Anthropic Research · 7h ago · 7 · benchmark research agent

Anthropic's research team released BioMysteryBench, a bioinformatics benchmark evaluating Claude's ability to analyze real-world biological datasets and tackle complex scientific workflows. The benchmark shows Claude's scientific reasoning improving across model generations, now performing on par with human experts in biology tasks that go beyond knowledge tests to include data analysis, hypothesis generation, and experimental design.

Simon Willison · 15h ago · 8 · library api update workflow

LLM 0.32a0 introduces a major architectural shift from text prompt/response to a message-based interface with multi-type streaming outputs, enabling better handling of modern LLM capabilities like image/audio/video inputs and structured responses. The update replaces the simple prompt-based API with conversational message sequences and composite response streams, aligning with vendor APIs like OpenAI's chat completions while maintaining backward compatibility.

HuggingFace Blog · 17h ago · 7 · benchmark inference workflow research

AI evaluation costs have become a significant barrier to entry, with agent benchmarks costing $20k-$40k per comprehensive sweep and single runs on frontier models exceeding $2,800. The article explores cost drivers in agent evaluation (scaffold choice creating 33× variance), presents compression techniques like Flash-HELM that reduce compute 100-200× while preserving rankings, and discusses how evaluation can exceed pretraining costs during model development cycles.

r/LocalLLaMA · 18h ago · 9 · new model agent tool api update benchmark

Mistral released Medium 3.5, a new 128B flagship merged model with 256k context window scoring 77.6% on SWE-Bench Verified, designed for long-horizon coding tasks with configurable reasoning effort. The model powers remote async coding agents in Mistral Vibe and Le Chat's new Work mode, enabling developers to offload multi-step tasks to cloud infrastructure that runs in parallel and integrates with GitHub, Linear, Jira, and Slack.

r/LocalLLaMA · 18h ago · 8 · new model open source fine tuning inference

IBM released Granite 4.1, a comprehensive collection of enterprise-focused models including language models (3B-30B), vision, speech, embeddings, and Guardian safety models. The 8B instruct variant matches Granite 4.0's 32B MoE model while being more efficient for fine-tuning, with strong performance on instruction following and tool calling—practical advantages for cost-conscious production deployments.

r/LocalLLaMA · 19h ago · 9 · new model inference tool deployment

Mistral Medium 3.5 is a new 128B dense flagship model with 256k context window that unifies instruction-following, reasoning, and coding capabilities in a single set of weights, replacing previous Mistral Medium 3.1 and Devstral models. The model features configurable reasoning effort per request, supports variable image sizes, and achieves strong benchmarks (91.4% on τ³-Telecom, 77.6% on SWE-Bench Verified), with practical deployment guidance via vLLM, SGLang, and Mistral's API. An EAGLE speculative decoding model is available to accelerate local inference.

HuggingFace Blog · 19h ago · 9 · new model open source research

IBM releases Granite 4.1, a family of dense LLMs (3B, 8B, 30B) trained on 15T tokens with a sophisticated 5-phase pre-training pipeline featuring long-context extension to 512K tokens and GRPO-based RLHF. The 8B instruct model matches previous 32B MoE performance through rigorous data curation and multi-stage refinement, all released under Apache 2.0.

r/MachineLearning · 19h ago · 7 · tool open source rag

A developer built an interactive scientific paper map using SPECTER 2 embeddings, UMAP dimensionality reduction, and Voronoi partitioning on 10M OpenAlex papers to enable semantic exploration and hybrid keyword/semantic search. The system demonstrates practical application of embedding models, clustering algorithms, and analytics pipelines for knowledge discovery at scale.

r/MachineLearning · 20h ago · 8 · library open source workflow research

AeroJAX is a JAX-native differentiable CFD framework enabling end-to-end gradient flow through Navier-Stokes and LBM solvers for inverse design and learned closures. The framework maintains full differentiability across physics simulation pipelines, allowing CFD solvers to be embedded directly in ML optimization loops without treating them as black boxes, which is valuable for physics-informed learning and inverse design applications.

r/MachineLearning · 23h ago · 6 · tool workflow inference

Discussion of autocomplete/typeahead system architectures balancing latency, quality, and infrastructure complexity, comparing classical methods (prefix/n-gram), full search backends, and LLM-based approaches. The author shares a lightweight Python library for query autocomplete and seeks production insights on hybrid retrieval+reranking patterns versus traditional approaches.

r/MachineLearning · 1d ago · 6 · research

A researcher shares a survey on weight-space learning—an emerging field focused on learning and reasoning directly in neural network parameter spaces rather than just input-output behavior. The post includes a pointer to a comprehensive arxiv survey and expresses interest in connecting with others working on related research problems.

r/MachineLearning · 1d ago · 6 · benchmark dataset open source

A multilingual speech language models challenge covering speaker diarization, ASR, and conversational understanding across 14 languages with 2,100 hours of free dataset. Two tracks focus on speech recognition/diarization and semantic understanding through QA, with practical experience building production speech systems.

Latent Space · 1d ago · 8 · inference tool new model benchmark

vLLM 0.20 brings significant inference optimizations including 2-bit KV cache quantization, MoE serving efficiency, and multi-hardware support (Blackwell, ROCm, Intel XPU), with early benchmarks showing substantial speedups for DeepSeek V4 serving. Multiple open model releases (Poolside Laguna XS, NVIDIA Nemotron 3 Nano Omni) emphasize deployment-friendly architectures with MoE efficiency and multi-modal capabilities, while community discussion highlights quantization trade-offs and potential hardware diversification away from CUDA lock-in.

r/MachineLearning · 1d ago · 6 · prompt engineering research

Reddit discussion exploring why LLMs express reasoning through natural language chains-of-thought rather than operating directly in latent vector space, and the tradeoffs between vector-based and language-based reasoning for interpretability, efficiency, and task performance. Touches on practical considerations for model architecture and reasoning transparency that are relevant to LLM engineering but lacks concrete technical solutions or research findings.

HuggingFace Blog · 1d ago · 7 · api update inference tool

DeepInfra is now integrated as a supported Inference Provider on Hugging Face Hub, offering serverless inference for 100+ models including LLMs, text-to-image, and embeddings with cost-effective pricing. Developers can access models like DeepSeek V4 and Kimi-K2.6 directly through Hugging Face SDKs (Python/JS) and agent frameworks without additional setup, with automatic routing and transparent billing.

r/MachineLearning · 1d ago · 8 · benchmark research open source

New structured output benchmark that measures value accuracy and faithfulness beyond just JSON schema validation, revealing significant gaps between schema compliance (90%+) and actual value correctness across all models. Includes comprehensive evaluation framework with 7 key metrics across text, image, and audio modalities, with open-source code and leaderboard showing GPT-4 leading and GLM-4 performing competitively.

Anthropic Blog · 1d ago · 7 · tool api update workflow open source

Anthropic released Claude connectors for creative tools including Blender, Autodesk, Adobe, Ableton, and Splice, built on the Model Context Protocol (MCP) standard. These connectors enable Claude to integrate directly with professional creative software, allowing developers to build AI-assisted workflows for 3D modeling, design, music production, and related tasks. The MCP-based approach ensures compatibility across multiple LLMs and emphasizes interoperability.