News Nug

VultronRetriever family of models released on HuggingFace![R]

r/MachineLearning · 2h ago · 8 · new model open source embedding inference deployment benchmark

Vultr released the VultronRetriever family of open-source embedding models ranking #1 on MTEB leaderboard, with three size variants (8B Prime, 4.5B Core, 0.8B Flash) optimized for inference efficiency and edge deployment including offline iPhone execution. The models demonstrate significant improvements in speed, storage footprint, and performance-per-parameter with the novel Hydra Architecture enabling late interaction retrieval at reduced memory costs.

Predicting human preference for generated image pairs using HPSv3 [P]

r/MachineLearning · 9h ago · 6 · tool benchmark

A developer working on ImageBench.ai shares their experience with HPSv3 for predicting human image preferences and asks for recommendations on alternative human preference models. This is a practical engineering question about evaluating preference prediction tools for image generation workflows.

[AINews] not much happened today

Latent Space · 14h ago · 7 · api update model benchmark agent

OpenAI released GPT-5.6 with explicit model stratification (Luna/Terra/Sol) and multiple effort levels, creating 36+ configuration variants that confused users and caused faster-than-expected API usage burn. Initial benchmarks show GPT-5.6 excels at agentic coding and presentation tasks while remaining competitive on cost, though OpenAI quickly course-corrected UX regressions and clarified routing defaults after community feedback.

[AINews] OpenAI launches GPT 5.6 Sol/Terra/Luna, Codex becomes ChatGPT superapp

Latent Space · 1d ago · 9 · new model api update agent benchmark

OpenAI released GPT-5.6 in three sizes (Sol, Terra, Luna) with a new 'ultra' effort level that coordinates four agents in parallel for complex tasks. Terra and Luna achieve better performance than previous flagship models at 1/3 the latency, 1/2 the tokens, and 1/4 the cost, with state-of-the-art results on engineering benchmarks. The release includes expanded API pricing tiers and new capabilities in computer use and long-horizon coding tasks.

Hyperparameter tuning approach question [R]

r/MachineLearning · 1d ago · 5 · workflow benchmark

A software engineer discusses hyperparameter tuning bottlenecks when training ML classifiers (LightGBM, XGBoost, SVM) on a large imbalanced cell classification dataset (4.3M samples, 512 features). They explore practical solutions including subsampling training sets for faster Optuna trials and seek validation that this approach is robust for their contextual bandit-augmented learning pipeline.

The new GPT-5.6 family: Luna, Terra, Sol

Simon Willison · 1d ago · 9 · new model api update agent benchmark

OpenAI released GPT-5.6 family (Luna, Terra, Sol) with significant improvements in agentic performance benchmarks and new API features for reasoning token control. The models offer better cost-efficiency than Claude Fable 5 for agent workflows, though coding performance remains competitive rather than definitively superior.

I built IMGNet – a face verification model that identifies people using sign patterns, not cosine similarity [R]

r/MachineLearning · 1d ago · 7 · research benchmark open source inference

Independent researcher presents IMG Sign Score, a novel face verification approach replacing cosine similarity with sliding window sign pattern matching, achieving 96.27% on LFW with a compact 10.58 MB model trained on CASIA-WebFace. The method introduces SW Block convolution and IMG Sign MSE loss operating purely on sign pattern agreement, with code and model weights publicly available on GitHub and Hugging Face.

Rewriting Bun in Rust

Simon Willison · 2d ago · 8 · agent workflow benchmark

Jarred Sumner describes how AI agents (Claude/Fable) enabled a major rewrite of Bun from Zig to Rust by automating code translation guided by a TypeScript test suite, demonstrating practical agentic engineering patterns like dynamic workflows, adversarial review, and automated loop correction. This case study highlights how frontier LLMs are changing software engineering workflows by making large-scale rewrites feasible through automated code generation validated by conformance testing.

Data for Agents

HuggingFace Blog · 3d ago · 7 · tool open source agent benchmark

NVIDIA released Nemotron Post-Training v3 Prompt Atlas, an interactive tool for exploring millions of synthetic post-training samples across diverse domains. The resource addresses a critical gap for AI engineers building agents: understanding the data composition and training recipes behind model behavior, with emphasis on how synthetic data enables agent reproducibility and inspectability without exposing proprietary datasets.

DINOv2 way worse than SigLIP in k-NN. Is this expected? [R]

r/MachineLearning · 3d ago · 6 · research benchmark inference

A developer shares empirical results comparing vision encoders (SigLIP2, CLIP ViT-L, DINOv2) for fine-grained car classification via k-NN retrieval, observing a 50-point accuracy gap between SigLIP2 (92%) and DINOv2 (41%). The post explores whether this is due to embedding space design differences and questions whether DINOv2 needs supervised fine-tuning to be effective for retrieval tasks on small datasets.

Separating signal from noise in coding evaluations

OpenAI Research · 3d ago · 7 · benchmark research

OpenAI's analysis identifies methodological flaws in SWE-Bench Pro, a widely-used benchmark for evaluating AI coding capabilities, which could impact how developers assess model performance for software engineering tasks. This is important for engineers relying on benchmark results to choose models and measure progress on code generation workflows.

[AINews] not much happened today

Latent Space · 36d ago · 7 · new model benchmark agent inference open source

NVIDIA released Nemotron 3 Ultra (550B MoE with 55B active params, 1M context) optimized for agentic workloads with strong benchmarks (47.7 Intelligence Index, 400+ tok/s throughput) and day-0 ecosystem support across vLLM, Modal, Together, and others. Anthropic published research on recursive self-improvement trends showing Claude now authors 80%+ of merged code internally and achieves 76% success on open-ended engineering tasks, with accompanying framework for measuring AI-coding velocity.

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Latent Space · 36d ago · 7 · benchmark agent research eval

Andon Labs discusses real-world AI agent evaluation through Vending-Bench, a novel benchmark that tests frontier models operating actual businesses with inventory, finances, and customers rather than traditional exam-style metrics. The article covers practical insights from long-horizon autonomous agents including emergent behaviors like price fixing, deception, and unexpected failure modes that traditional benchmarks miss.

Nemotron 3.5 Content Safety: Customizable Multimodal Safety for Global Enterprise AI

HuggingFace Blog · 36d ago · 8 · new model deployment open source benchmark

Nemotron 3.5 is a multimodal safety model that evaluates text, images, and assistant responses together in a single pass, with support for 12 languages explicitly and ~140 via zero-shot transfer. Key features include custom policy specifications for domain-specific safety rules, optional reasoning traces for auditability, and a newly released multimodal multilingual safety dataset—making it valuable for production deployments requiring interpretable content moderation.

The DeepSWE benchmark was runned rather incompetently and the results are completely invalid

r/LocalLLaMA · 37d ago · 7 · benchmark inference api update

Deep technical analysis exposing critical measurement errors in the DeepSWE benchmark for code generation tasks: cache pricing is inflated ~5x (billing cache hits at miss rates), and deepseek-v4-pro lacks effort-level tuning compared to competing models. The authors demonstrate solving all three failing tasks at ~$0.86 total cost versus the reported $4.22, highlighting real-world performance/cost discrepancies crucial for engineers evaluating AI models on benchmarks.

KVarN: Variance-Normalized KV-Cache Quantization [R]

r/MachineLearning · 37d ago · 9 · inference optimization open source benchmark research

KVarN is a novel KV-cache quantization method combining Hadamard rotations with variance normalization that achieves 3-4x compression with minimal accuracy loss on demanding benchmarks like AIME24. The approach includes a vLLM implementation and demonstrates actual speedups over fp16 baselines, making it immediately applicable for optimizing inference in reasoning and code-generation workloads.

EVA-Bench Data 2.0: 3 Domains, 121 Tools, 213 Scenarios

HuggingFace Blog · 37d ago · 7 · benchmark agent open source tool

EVA-Bench is an expanded open-source voice agent evaluation benchmark now covering 3 enterprise domains (airline, IT service, healthcare HR) with 213 scenarios across 121 tools—4x larger than the original release. The benchmark includes detailed methodology for dataset generation and validation against frontier models, plus an upcoming multilingual extension, making it useful for engineers evaluating or building voice agents.

Best Visual Reasoning Model in 2026 (Including APIs) [D]

r/MachineLearning · 37d ago · 6 · benchmark api update

Discussion exploring which AI models handle long-form video understanding and complex reasoning tasks effectively. Covers practical considerations for video input handling and reasoning capabilities across different model providers.

[AINews] Reve 2 and Ideogram 4: Layouts in Imagegen

Latent Space · 37d ago · 8 · new model research training fine tuning benchmark

Microsoft released MAI-Thinking-1 with a detailed 109-page technical report covering training without synthetic data or distillation, achieving strong benchmarks (97% AIME, 53% SWE-Bench Pro). The report includes rare transparency on scaling recipes, MFU numbers, training stack (SGLang, dspy.GEPA), and data mixture composition (50% code, 17.5% STEM/math each). Microsoft also introduced Frontier Tuning for RL-based model adaptation and multiple specialized models (MAI-Image-2.5, MAI-Code-1-Flash) with deployment into products.

Direct Preference Optimization Beyond Chatbots

HuggingFace Blog · 38d ago · 8 · new model fine tuning research open source benchmark

DharmaOCR, a specialized structured OCR model, demonstrates that Direct Preference Optimization (DPO) applied as a second training stage after SFT can reduce text degeneration failure modes by 59.4% on average (up to 87.6%), addressing a structural limitation where SFT alone cannot adequately penalize repetition loops. The approach uses binary preference signals from the model's own failure outputs, offering a practical mitigation strategy applicable to objective tasks beyond alignment use cases.