r/MachineLearning · 1d ago · 8 · open source inference workflow benchmark

A practical routing-based architecture for lightweight multilingual ASR that switches between specialized ~100M parameter monolingual models instead of using large multilingual models, achieving 13% WER on inter-utterance code-switching by coordinating Zipformer, Silero VAD, and SpeechBrain components with intelligent rollback logic. The open-source approach demonstrates strong performance on real-world language-switching scenarios with significantly lower computational requirements than cloud APIs.

HuggingFace Blog · 1d ago · 9 · new model open source inference tool

Mellum2 is a new open-source Mixture-of-Experts model optimized for low-latency inference in software engineering tasks, delivering 2x faster performance than similarly-sized models while maintaining competitive benchmarks. Designed as a specialized "focal" model for routing, retrieval, code completion, and agent subtasks within multi-model production systems, it's particularly suited for RAG pipelines, self-hosted deployments, and latency-sensitive workloads.

Latent Space · 1d ago · 7 · agent research inference workflow

Podcast episode featuring an engineer from NVIDIA Cosmos and xAI's Grok Imagine discussing frontier video generation systems, multimodal models, and the shift toward video agents rather than improved single-model performance. Key technical insights cover data pipelines, VAEs, diffusion transformers, inference optimization, and the argument that video model intelligence derives primarily from LLM components rather than video-specific training.

r/LocalLLaMA · 1d ago · 7 · inference optimization open source

GitHub PR discussion about optimizing VRAM usage in llama.cpp by reserving logits space only for n_seqs when possible, saving 1.2GB+ on typical configs with draft decoding. The optimization targets the llama-context API and includes benchmarks on RTX 3080/5070 Ti showing persistent memory headroom improvements without impacting model inference quality.

r/MachineLearning · 1d ago · 7 · benchmark research agent

FML-Bench reveals that apparent 30→80% improvements in MLE-Bench are largely due to better base models and problem definition shifts rather than algorithmic progress—older AIDE matches modern agents when controlling for step budget and model. This research paper introduces FML-Bench as a unified benchmark for evaluating true algorithmic efficiency (search/memory) of ML agents independent of model capability gains.

r/LocalLLaMA · 1d ago · 8 · new model open source inference tool

JetBrains open-sourced Mellum2, a 12B parameter model optimized for production AI systems with <50% inference latency of comparable models. Built for routing, Q&A, and agent workflows in software engineering, it's positioned as a specialized 'focal model' component for multi-model AI systems rather than a general-purpose frontier model.

HuggingFace Blog · 1d ago · 6 · agent workflow research

IBM article discussing agent logic as a design pattern to improve AI agent performance in enterprise workflows by using domain-specific primitives (knowledge graphs, algorithms, program analysis) to constrain LLM context and reduce hallucinations. Provides practical examples of agent architecture for mainframe development, IT operations, and enterprise software delivery.

OpenAI Blog · 1d ago · 6 · deployment api update

OpenAI's frontier models and Codex are now available through AWS Marketplace, enabling enterprise developers to access OpenAI's APIs within existing AWS infrastructure and procurement processes. This primarily improves deployment workflow and enterprise adoption but doesn't introduce new models or technical capabilities.

HuggingFace Blog · 1d ago · 9 · new model research inference benchmark

Cosmos 3 is a unified multimodal foundation model using Mixture-of-Transformers architecture that combines video generation, scene understanding, reasoning, and policy generation in a single model for physical AI applications like robotics and autonomous vehicles. The architecture supports text, image, video, audio, and action modalities through shared representation with separate autoregressive and diffusion pathways, eliminating the need to juggle multiple specialized models.

r/MachineLearning · 1d ago · 6 · research

A discussion post asking about current academic research directions in world models and self-supervised learning, noting a shift from methods like Barlow Twins/DINO toward scaled video generation. While this reflects genuine technical evolution in representation learning, it's community commentary rather than a concrete technical resource, tool, or research paper.

r/LocalLLaMA · 1d ago · 9 · new model benchmark agent open source

M3 is a new open-weight frontier model combining 1M-token context via proprietary Sparse Attention, native multimodality, and world-leading coding/agentic capabilities—demonstrated through autonomous task execution including ICLR paper reproduction and GPU kernel optimization without human intervention. The model achieves top benchmarks on coding tasks, agent-based browsing (83.5 on BrowseComp), and multi-step reasoning, making it directly relevant for building AI assistants and automated workflows.

r/MachineLearning · 2d ago · 6 · tutorial workflow

A software engineer is troubleshooting convergence issues with a Conformer-based ASR model trained on dialectal Arabic speech using SpeechBrain, where combined CTC+KLDiv losses plateau early and validation WER remains near 100% despite multiple hyperparameter adjustments. This represents a practical deep learning debugging challenge relevant to engineers building speech models, though it's a specific problem thread rather than a generalizable technique or tool release.

Simon Willison · 2d ago · 5 · workflow prompt engineering

A thoughtful essay examining how AI coding agents can paradoxically reduce productivity by making project creation frictionless, leading to abandoned side projects and attention fragmentation. The post explores the psychological challenge of managing AI tools' efficiency and includes discussion of how some ADHD users find agents helpful for focus, presenting a nuanced perspective on workflow management with AI.

r/MachineLearning · 2d ago · 6 · workflow benchmark

A software engineer working on computer vision is seeking advice on clustering YOLO detections into groups and predicting strand counts. They've trained a YOLO object detector and XGBoost classifier achieving 70% accuracy, but believe better performance is possible given the constrained problem space (max 8 groups, 3 strands per group). This is a practical computer vision engineering problem discussing detection post-processing and classification approaches.

Simon Willison · 3d ago · 8 · deployment open source agent benchmark

Anthropic published detailed documentation on their sandboxing techniques across Claude products (Claude.ai, Claude Code, Cowork), covering process isolation methods like gVisor, Seatbelt, Bubblewrap, and full VMs. The post explains threat models and constraint strategies for preventing agent escapes and credential exfiltration, plus mentions their open-source srt (Sandbox Runtime) tool for building secure AI applications.

Simon Willison · 3d ago · 7 · tool workflow open source

Datasette Lite now uses Service Workers with Pyodide to run Python ASGI apps in the browser, enabling full JavaScript execution that was previously broken. This approach, developed with Claude Opus 4.8's assistance, allows running full Python web applications like Datasette in WebAssembly without server infrastructure.