News Nug

Simon Willison · 30d ago · 7 · agent workflow deployment

Shopify's internal coding agent 'River' enforces public-only Slack interactions to create visible, searchable work that enables organizational learning at scale—a practical implementation of how transparency and observability can improve both productivity and knowledge sharing in AI-assisted development workflows.

Interactive Jensen–Shannon Divergence Visualisation [P]

r/MachineLearning · 30d ago · 5 · tool tutorial

Interactive visualization tool for Jensen-Shannon divergence, a symmetric divergence metric useful for comparing probability distributions. While mathematically foundational for ML work, this is primarily an educational visualization rather than a practical tool for daily AI development workflows.

V-JEPA 2.1's dense features are partitioned: a robustness study across all four model sizes [R]

r/MachineLearning · 30d ago · 8 · research benchmark open source

Pre-registered robustness study of Meta's V-JEPA 2.1 across model sizes (80M-2B) reveals that representational drift (M2 metric) predicts failure on temporal corruptions but not image noise, non-monotonic scaling where larger models aren't reliably more robust, and unexpected orientation sensitivity despite temporal structure preservation. Includes mechanistic hypothesis linking findings to hub marginalization in deep ViTs with fully reproducible code and pre-registered decision rules.

How enterprises are scaling AI

OpenAI Blog · 30d ago · 5 · workflow deployment

Article discusses enterprise AI scaling strategies focusing on governance, workflow design, and quality assurance rather than specific technical implementations. Provides organizational/process perspective on moving from AI experiments to production systems, relevant for engineers managing AI infrastructure at scale.

unsloth/MiMo-V2.5-GGUF · Hugging Face

r/LocalLLaMA · 30d ago · 8 · new model inference deployment agent

MiMo-V2.5 is a native omnimodal model supporting text, image, video, and audio with agentic capabilities, featuring hybrid attention architecture that reduces KV-cache by 6× and supports 1M token context. The guide covers practical deployment across multiple inference frameworks (llama-cpp-python, Ollama, SGLang, Docker) with Unsloth's GGUF quantization, making it immediately usable for engineers building multimodal AI applications.

Why is human LLM annotation so expensive? [D]

r/MachineLearning · 31d ago · 5 · workflow tutorial

A discussion thread about data labeling trade-offs for ML practitioners: Scale AI offers quality but high cost, MTurk is cheap but low quality, leaving a gap for teams needing thousands of labeled examples for evals/fine-tuning. The post seeks practical solutions and community experiences on bridging this middle ground.

Quoting New York Times Editors’ Note

Simon Willison · 31d ago · 6 · workflow prompt engineering

A New York Times correction highlights a critical failure in AI tool usage: an AI-generated summary was mistakenly presented as a direct quotation, revealing the importance of verifying AI outputs before publication. This incident underscores a significant workflow issue for anyone integrating AI into content creation or information gathering—the tool produced plausible-sounding but inaccurate text that bypassed human verification.

ktx — ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer

GitHub Trending AI · 31d ago · 7 · tool agent open source workflow

ktx is an open-source context layer that enables AI agents to accurately query data warehouses by maintaining approved metrics, schema knowledge, and business context—solving the problem of agents reinventing metric logic on each query. It works with major SQL databases and integrates via MCP with Claude Code, Cursor, and other agent platforms, requiring only your own LLM API keys.

ktx-ai-data-agents-context — ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills, memory and a semantic layer

GitHub Trending AI · 31d ago · 7 · tool agent open source workflow

ktx is an open-source context layer that enables AI agents to accurately query data warehouses by maintaining semantic layer definitions, metric logic, and business knowledge automatically. It integrates with popular databases (PostgreSQL, Snowflake, BigQuery, etc.) and works via MCP with agents like Claude Code and Cursor, eliminating the need for agents to re-explore schemas or invent custom metrics.

ktx-ai-data-agents-mcp-context-skills — ktx is an executable context layer for data and analytics agents 🐙 Allow Claude Code, Codex, and any AI agent to query data accurately through MCP with skills and memory

GitHub Trending AI · 31d ago · 8 · tool open source agent rag api update

ktx is an open-source context layer that enables AI agents to query data warehouses accurately by maintaining approved metric definitions, business knowledge, and schema context. It integrates with popular databases (PostgreSQL, Snowflake, BigQuery, etc.) and agent platforms (Claude Code, Cursor, Codex) via MCP, eliminating the need for agents to reinvent metric logic on each query.

MachinaCheck: Building a Multi-Agent CNC Manufacturability System on AMD MI300X

HuggingFace Blog · 31d ago · 7 · agent open source inference workflow rag

MachinaCheck is a multi-agent AI system for CNC machine shops that analyzes STEP CAD files to determine manufacturability in 30 seconds. It uses Qwen 2.5 7B running locally on AMD MI300X (for on-premise privacy), cadquery for geometric feature extraction, and a five-component LangChain pipeline with vLLM inference to replace manual 30-60 minute feasibility assessments.

Signals: finding the most informative agent traces without LLM judges [R]

r/MachineLearning · 31d ago · 5

Anybody else noticing how good gemma-4-26b-a4b is with one-shotting three.js?

r/LocalLLaMA · 31d ago · 6 · tool workflow prompt engineering

A creative Python automation tool that cycles through prompts to generate Three.js demonstrations, with error detection and HTML archival. While primarily a fun project rather than production-critical, it demonstrates practical prompt engineering and automated code generation workflows that could inspire similar build-and-test pipelines for AI-assisted development.

Any implementations similar to D4RT? [D]

r/MachineLearning · 31d ago · 6 · research open source deployment

Discussion seeking open-source alternatives to DeepMind's D4RT for 4D scene understanding from video, which reconstructs 3D point clouds and estimates camera poses from dynamic scenes. While the original model isn't released, this identifies a gap in available tools for video-to-3D reconstruction and invites community pointers to similar implementations.

Parax v0.7: Parametric Modeling in JAX [P]

r/MachineLearning · 31d ago · 7 · library open source tool

Parax v0.7 is a JAX library that bridges functional PyTree-based modeling with object-oriented approaches, offering derived parameters, computed PyTrees, and abstract interfaces for constrained optimization and probabilistic sampling. The release includes polished APIs and practical examples for bounded optimization (JAXopt) and Bayesian sampling (BlackJAX), making it valuable for engineers building probabilistic ML systems in JAX.

"colss" a math-style expression evaluator for NumPy arrays [P]

r/MachineLearning · 31d ago · 6 · library open source tool

A new Python library that wraps NumPy operations with mathematical expression syntax, using C++/pybind11 for performance. While it provides cleaner notation for complex vectorized operations, it's early-stage and represents an ergonomic enhancement rather than a fundamental capability addition for AI engineers.

Exactly a year ago, I started working on an MCP server I launched on reddit that became by far my most active open source project!

r/LocalLLaMA · 32d ago · 8 · tool open source api update agent

Workspace MCP is a comprehensive Model Context Protocol server providing full natural language control over all Google Workspace services (Gmail, Drive, Calendar, Docs, Sheets, Slides, Forms, Tasks, Contacts, Chat, Apps Script) with OAuth 2.1 support and stateless deployment options. It enables AI assistants and agent platforms to access 12 Google services with fine-grained editing capabilities that exceed built-in Claude/ChatGPT integrations, available as open-source MIT-licensed software with CLI and Code Mode support.

LLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]

r/MachineLearning · 32d ago · 7 · tool benchmark research

LLM Win is a visualization tool that models LLM benchmark results as a directed graph where edges represent win relationships, revealing that 94.2% of weaker models can reach stronger ones through transitive benchmark chains. The analysis identifies systematic benchmark reversals (119k cases where lower-ranked models outperform higher-ranked ones on specific tests) and suggests this reversal structure could signal either genuine model specialization or benchmark noise, opening new approaches for robust model evaluation metrics.

"OncoAgent: A Dual-Tier Multi-Agent Framework for Privacy-Preserving Oncology Clinical Decision Support"

HuggingFace Blog · 32d ago · 9 · open source fine tuning agent rag inference deployment

OncoAgent is an open-source clinical decision support system combining dual-tier fine-tuned LLMs (9B/27B via QLoRA), multi-agent LangGraph architecture, and Corrective RAG over medical guidelines with strict privacy (Zero-PHI). The system demonstrates significant technical innovations: 56× speedup on AMD MI300X hardware via sequence packing, 266K oncological case fine-tuning dataset, and deployable on-premises inference eliminating cloud API dependency.

DeepSeek V4 paper full version is out, FP4 QAT details and stability tricks [D]

r/MachineLearning · 32d ago · 9 · new model research inference fine tuning benchmark

DeepSeek V4 paper reveals production-ready FP4 quantization-aware training achieving 2x QK selector speedup with 99.7% recall and 27% FLOPs reduction, plus novel training stabilization techniques (anticipatory routing, SwiGLU clamping) for trillion-parameter MoE models. Includes practical inference optimizations and generative reward modeling for RLHF that significantly reduce computational overhead for multi-agent and multi-call workflows.