r/MachineLearning · 10h ago · 8 · benchmark research open source

New structured output benchmark that measures value accuracy and faithfulness beyond just JSON schema validation, revealing significant gaps between schema compliance (90%+) and actual value correctness across all models. Includes comprehensive evaluation framework with 7 key metrics across text, image, and audio modalities, with open-source code and leaderboard showing GPT-4 leading and GLM-4 performing competitively.

Anthropic Blog · 12h ago · 7 · tool api update workflow open source

Anthropic released Claude connectors for creative tools including Blender, Autodesk, Adobe, Ableton, and Splice, built on the Model Context Protocol (MCP) standard. These connectors enable Claude to integrate directly with professional creative software, allowing developers to build AI-assisted workflows for 3D modeling, design, music production, and related tasks. The MCP-based approach ensures compatibility across multiple LLMs and emphasizes interoperability.

r/MachineLearning · 14h ago · 7 · tool visualization optimization

Interactive browser-based tool for visualizing neural network loss landscapes using dimensionality reduction techniques from Li et al. (NeurIPS 2018), allowing users to experiment with different architectures (MLPs to ResNet-8) and optimizers to understand how they navigate high-dimensional optimization spaces. Provides practical intuition-building for understanding local minima geometry and optimizer behavior, though acknowledges limitations of 2D/3D projections for representing true high-dimensional surfaces.

r/LocalLLaMA · 15h ago · 8 · new model inference benchmark

NVIDIA released Nemotron 3 Nano Omni, a 31B multimodal model combining video, audio, image, and text understanding using a Mamba2-Transformer hybrid MoE architecture. Available commercially on Hugging Face/NGC with practical deployment guidance including vLLM 0.20.0+ requirements and ~62GB VRAM needs for inference.

HuggingFace Blog · 15h ago · 8 · new model inference agent open source

NVIDIA released Nemotron 3 Nano Omni, a multimodal model designed for efficient processing of documents, audio, video, and GUI-based agentic tasks with 7.4-9.2x higher system efficiency than comparable models. The 30B model uses Mamba state-space layers, MoE routing, and grouped-query attention to handle long-context reasoning across modalities while maintaining low latency for interactive workloads.

r/LocalLLaMA · 15h ago · 9 · new model open source agent inference benchmark

Ling-2.6-flash, a 104B parameter model with 7.4B active parameters, is now open-source and optimized for agent workloads with hybrid linear attention (MLA + Lightning Linear) and sparse MoE architecture. The model achieves 4× throughput improvements over comparable models while reducing token consumption—a critical optimization for production agent deployments where token costs are a major barrier.

r/MachineLearning · 16h ago · 5 · agent workflow

A developer shares an experiment comparing two iterations of an AI agent playing Dark Hex against itself, with a Colab notebook for reproducibility. While it demonstrates agent training/iteration workflows, it lacks technical depth on the methodology, model architecture, or learnings that would be immediately useful for other builders.

r/MachineLearning · 18h ago · 7 · library open source tool inference

Dynabatch is a PyTorch sampler that dynamically adjusts batch sizes based on sequence lengths using XGBoost to predict GPU memory pressure, achieving 3.3x throughput improvement on encoder-decoder models like NLLB-200. The tool uses a practical approach of sorting by token length and selecting optimal batch sizes within memory constraints, with built-in fallbacks for OOM errors.

Simon Willison · 1d ago · 7 · tool workflow deployment

pip 26.1 introduces lockfile support (pylock.toml) for reproducible Python dependency management and dependency cooldowns via --uploaded-prior-to flag, enabling engineers to pin packages to versions older than a specified number of days for stability. These features are particularly useful for AI/ML projects that depend on packages like Datasette and LLM, improving dependency reproducibility in production environments.

Simon Willison · 1d ago · 6 · new model open source fine tuning research

Talkie is a 13B language model trained exclusively on pre-1931 English text, with both base and instruction-tuned variants available under Apache 2.0 license. The project demonstrates novel approaches to training on out-of-copyright data and addresses contamination challenges, though the chat version relies on modern LLMs (Claude) for preference optimization, creating an interesting tension between data purity and practical fine-tuning.

r/LocalLLaMA · 1d ago · 7 · research benchmark training

Researchers trained 'vintage' language models on historical text (pre-1931) to study how LMs understand time, predict future events, and generate novel ideas. They evaluate these models on tasks like forecasting historical surprises and coding problems, providing insights into model capabilities and scaling behavior across different knowledge cutoffs.

HuggingFace Blog · 1d ago · 7 · new model open source deployment inference

NVIDIA and Siemens Healthineers released NV-Raw2Insights-US, an AI model that reconstructs ultrasound images directly from raw sensor data instead of traditional beamforming pipelines, enabling personalized speed-of-sound correction in real-time. The system uses Holoscan Sensor Bridge (open-source FPGA IP) to stream high-bandwidth ultrasound data to GPUs, demonstrating an end-to-end AI approach to medical imaging that learns adaptive physics-aware transformations for each patient.

OpenAI Blog · 1d ago · 6 · api update deployment

OpenAI's GPT models and Codex are now accessible through AWS, allowing developers to integrate these models within AWS infrastructure for enterprise deployments. This is primarily a deployment/infrastructure announcement rather than a technical capability breakthrough, but relevant for engineers deploying AI applications in AWS environments.

Simon Willison · 1d ago · 7 · new model tool open source inference deployment

Microsoft's VibeVoice is a MIT-licensed Whisper-style speech-to-text model with built-in speaker diarization, now available in optimized MLX format for efficient inference on Apple Silicon. The post provides practical benchmarks (8:45 for 1-hour transcription on M5 Max) and hands-on implementation details, including JSON output structure and workarounds for the 1-hour audio limit.

Latent Space · 1d ago · 6 · deployment workflow research

Applied Intuition's founders discuss building a physical AI platform for autonomous systems, emphasizing that the bottleneck has shifted from model intelligence to deploying AI on constrained hardware with safety-critical reliability requirements. The conversation covers their evolution from simulation/data infrastructure to an Android-like OS for vehicles and machines, plus practical insights on AI tooling adoption and verification/validation approaches for autonomous systems.

r/MachineLearning · 1d ago · 8 · agent open source workflow deployment

Mahoraga is an open-source agent orchestrator that routes tasks between local and cloud AI models using LinUCB contextual bandits, with empirical results showing local 4B models (Qwen3) outperforming cloud APIs on constrained tasks like code generation while eliminating API costs. The system uses a two-stage routing strategy (task classification + bandit selection) with a custom 4-layer heuristic quality scorer, demonstrating that intelligent task-model matching can achieve both cost efficiency and better latency/quality trade-offs on consumer hardware.