r/MachineLearning · 3h ago · 9 · inference optimization open source benchmark research

KVarN is a novel KV-cache quantization method combining Hadamard rotations with variance normalization that achieves 3-4x compression with minimal accuracy loss on demanding benchmarks like AIME24. The approach includes a vLLM implementation and demonstrates actual speedups over fp16 baselines, making it immediately applicable for optimizing inference in reasoning and code-generation workloads.

HuggingFace Blog · 3h ago · 8 · new model open source inference tool

NVIDIA released Nemotron 3.5 ASR, a 600M-parameter multilingual streaming speech recognition model supporting 40 language-locales with native punctuation/capitalization and efficient cache-aware processing that eliminates redundant computation in streaming scenarios. The model uses Cache-Aware FastConformer encoder + RNNT decoder architecture with language conditioning capabilities, available as a NeMo checkpoint for straightforward integration.

r/MachineLearning · 4h ago · 8 · research fine tuning workflow

On-policy distillation (OPD) is an emerging post-training technique used in recent frontier models (Qwen 3.6/3.7, GLM-5.1, DeepSeek-V4) that efficiently teaches models to avoid specific errors by injecting hint tokens into trajectories rather than requiring full rollout regeneration. The technique uses a separate model to identify mistakes in rollouts, then trains the main model via probability matching on the annotated trajectories—a practical efficiency win over naive reinforcement learning approaches.

HuggingFace Blog · 4h ago · 7 · benchmark agent open source tool

EVA-Bench is an expanded open-source voice agent evaluation benchmark now covering 3 enterprise domains (airline, IT service, healthcare HR) with 213 scenarios across 121 tools—4x larger than the original release. The benchmark includes detailed methodology for dataset generation and validation against frontier models, plus an upcoming multilingual extension, making it useful for engineers evaluating or building voice agents.

OpenAI Blog · 4h ago · 5 · agent workflow

Endava's case study on deploying AI agents and ChatGPT Enterprise for software delivery automation provides practical enterprise implementation insights, though it's primarily a business-focused success story rather than technical depth on the AI tools themselves.

r/MachineLearning · 5h ago · 6 · workflow research

A practical discussion on conducting ablation studies without full retraining by leveraging saved checkpoints and model components. The thread explores techniques like selective layer freezing, component masking, and gradient-based analysis to evaluate model component importance while maintaining reproducibility against the original baseline.

r/MachineLearning · 8h ago · 7 · library open source tool

AttnHut is an open-source repository providing modular, swappable attention mechanism implementations for language models and vision tasks, including MiniMax M3's sparse attention. The library enables easy experimentation and benchmarking of different attention variants, with applications across SLMs, computer vision, and RL.

r/MachineLearning · 13h ago · 6 · benchmark api update

Discussion exploring which AI models handle long-form video understanding and complex reasoning tasks effectively. Covers practical considerations for video input handling and reasoning capabilities across different model providers.

Latent Space · 13h ago · 8 · new model research training fine tuning benchmark

Microsoft released MAI-Thinking-1 with a detailed 109-page technical report covering training without synthetic data or distillation, achieving strong benchmarks (97% AIME, 53% SWE-Bench Pro). The report includes rare transparency on scaling recipes, MFU numbers, training stack (SGLang, dspy.GEPA), and data mixture composition (50% code, 17.5% STEM/math each). Microsoft also introduced Frontier Tuning for RL-based model adaptation and multiple specialized models (MAI-Image-2.5, MAI-Code-1-Flash) with deployment into products.

Latent Space · 21h ago · 7 · research agent workflow inference

Axiom is developing 'Verified AI' systems that use formal verification (similar to type checkers in programming) to ensure mathematical and logical correctness in AI reasoning, applying this at both training and inference stages. The approach aims to address critical gaps in AI reasoning beyond coding by requiring systems to produce formally provable outputs, enabling better scaling and composability of AI capabilities.

r/MachineLearning · 21h ago · 7 · workflow deployment monitoring

Practical discussion of production ML monitoring and retraining strategies for handling data drift, covering continuous retraining (interval vs trigger-based), drift detection, shadow models, and human-in-the-loop approaches. The post emphasizes that operational constraints often matter more than model architecture when choosing drift mitigation strategies.

r/LocalLLaMA · 23h ago · 7 · library inference open source

Technical discussion in llama.cpp about extracting embeddings for Multi-Token Prediction (MTP) models, specifically whether to use pre-norm or post-norm hidden states depending on the model architecture. The thread explores API design options for decoupling embedding extraction from logits computation to support different MTP model requirements.

r/MachineLearning · 23h ago · 6 · research fine tuning workflow

A software engineer shares detailed diagnostics of an AlphaZero training failure for 6x6 Othello, analyzing hyperparameters (c_puct, Dirichlet noise, temperature) and providing empirical metrics (value loss plateaus, policy entropy, KL-divergence trends) to understand why the model fails against simple baselines despite showing policy learning.

r/LocalLLaMA · 23h ago · 9 · new model inference deployment

Google released Gemma 4 12B, a lightweight multimodal model designed for on-device deployment on consumer laptops (16GB RAM) with native audio/vision support and encoder-free architecture. The model balances performance near the larger 26B variant while maintaining efficiency, enabling local agentic AI applications without cloud dependency.

r/LocalLLaMA · 1d ago · 9 · new model open source inference agent

Google DeepMind released Gemma 4 12B, a multimodal open-weight model with native audio/vision support, 256K context window, and both dense and MoE architecture variants optimized for local deployment from mobile to servers. The model features improved reasoning, coding capabilities, function-calling for agents, and is immediately usable via Hugging Face Transformers with Apache 2.0 licensing.

r/MachineLearning · 1d ago · 8 · open source library tool inference

A lightweight C++ implementation of Meta's EnCodec audio codec using Eigen with zero ML runtime dependencies, compiled weights, and single-threaded performance matching or exceeding ONNX Runtime. Provides an easily integrable CMake library for audio tokenization and compression tasks without external model files.