Anthropic Blog · 1h ago · 9 · new model api update agent benchmark

Claude Opus 4.8 releases today with improved benchmarks across coding, reasoning, and agentic tasks, plus better honesty in flagging uncertainties. Key features include user-controllable effort levels on claude.ai, dynamic workflows in Claude Code for large-scale problems, and 3x cheaper fast mode (2.5× speed). The model shows 4x lower likelihood of allowing code flaws to pass unremarked compared to Opus 4.7.

r/MachineLearning · 1h ago · 8 · benchmark agent deployment research

AgingBench introduces a longitudinal deployment benchmark revealing that upgrading agent backbone models isn't straightforward—swapping Claude Sonnet 4.6 to Opus 4.7 in a coding agent actually decreased performance by ~15% due to how memory state evolves over long deployments. Memory management policy alone created a 4.5x spread in agent half-life, suggesting that model capability alone doesn't predict real-world agentic system performance over time.

r/MachineLearning · 2h ago · 8 · new model open source fine tuning deployment

Wall-OSS-0.5 is a new 4B Vision-Language-Action model using a gradient bridge approach where discrete action-token CE dominates VLM backbone updates while flow matching contributes ~5%, combined with Vision-Aligned RVQ tokenization for semantic grounding of action tokens and DMuon optimizer for distributed training. The release includes strong real-robot evaluations (82% on held-out deformable tasks zero-shot, 60.5% average after fine-tuning across 15 tasks) and open-source code, making it immediately relevant for practitioners building embodied AI systems.

r/MachineLearning · 6h ago · 8 · tool library open source dataset benchmark

MONET is a new Apache 2.0 open-source image-text dataset with 104.9M high-quality samples curated from 2.9B images, accompanied by visualization tools, a retrieval system, and a T2I training codebase. This is a significant resource for engineers building multimodal AI systems, offering both the dataset and practical tooling for training text-to-image models.

r/LocalLLaMA · 11h ago · 7 · new model tool tutorial benchmark

Q-Judger is a fine-tuned vision-language model for automated evaluation of text-to-image generation quality, built on Qwen2.7-27B with structured JSON output. The article provides practical setup instructions across multiple inference frameworks (Transformers, vLLM, SGLang, Docker) and demonstrates hierarchical evaluation criteria validated against human expert rankings.

Anthropic Research · 14h ago · 6 · agent research benchmark

Survey of 1,260 quantitative social scientists (Feb-Mar 2026) reveals 81% adoption of generative AI for research, with coding agents like Claude Code enabling autonomous research execution—automating data analysis, interpretation, and iteration that were previously irreducibly human tasks. The research explores disparities in tool access, output quality differences, and potential impacts on the scholarly record, with planned randomized experiments to measure productivity effects.

r/MachineLearning · 16h ago · 7 · training research workflow

A researcher training GPT-like Transformer-decoder models (100M-500M parameters) on 750M tokens is encountering a common failure mode where the model gets stuck generating single tokens repeatedly, suggesting a training dynamics issue. The post includes detailed hyperparameters (AdamW optimizer, 1e-3 learning rate, 4M token batch size) and seeks guidance on whether decoder-only model training requires specific tricks or has undocumented failure modes.

r/MachineLearning · 17h ago · 7 · research open source tool

This project applies diffusion models to sketch-guided trajectory simulation in basketball, enabling controllable generation of player movements conditioned on partial instructions. The approach uses joint refinement of all trajectories through diffusion rather than autoregressive methods, with open-sourced code and models demonstrating a practical application of conditional generation for sports analytics.

r/LocalLLaMA · 18h ago · 9 · api update deployment agent

CVE-2026-48710 (BadHost) is a critical vulnerability in Starlette that affects FastAPI, vLLM, LiteLLM, and MCP servers—allowing HTTP Host header injection to bypass authentication. AI engineers building agents and services must immediately upgrade Starlette to version 1.0.1+ and audit any systems using these frameworks, as credentials stored in MCP servers are particularly at risk.

Simon Willison · 19h ago · 6 · agent workflow

SQLite has published AGENTS.md documenting their policy on AI agents interacting with their codebase—they reject agentic code contributions but accept high-quality AI-generated bug reports with reproducible test cases. This reflects practical workflow considerations for engineers using AI agents in development, including how open-source projects are adapting policies around AI-generated contributions.

r/MachineLearning · 21h ago · 7 · open source benchmark agent research

Open-source Context Swarm Memory (CSM) system benchmarked against Hindsight on BEAM 100K, achieving 0.757573 AMB score vs 0.733658 with 38.2% fewer context tokens but 4.5x slower retrieval. Author seeks methodology feedback before pursuing official leaderboard validation.

r/MachineLearning · 22h ago · 8 · inference open source library benchmark

TritonMoE is a portable MoE inference kernel written in Triton that achieves 89-131% of Megablocks throughput while running unchanged on both NVIDIA and AMD GPUs. The key optimization uses fused gate+up GEMM operations to reduce global memory traffic by 35%, though performance degrades at very long sequences (2048+ tokens) and under extreme routing skew.

r/MachineLearning · 22h ago · 7 · open source fine tuning rag dataset

Open-source UK GDPR compliance QA dataset (1K pairs) with SME-focused questions, detailed answers linked to specific articles/ICO guidance, and generation metadata. Generated via Qwen 14B + DeepSeek API, released in JSON/Parquet with MIT license—directly applicable for fine-tuning compliance assistants or building RAG systems for privacy tools.

Latent Space · 1d ago · 8 · new model benchmark open source research

BioHub released ESMFold 2, a transformer-based protein structure prediction engine that achieves state-of-the-art performance on protein interactions and antibody design by scaling simple BERT-like models on diverse protein sequence data rather than using specialized architectures like AlphaFold3. The release includes an atlas of 6.8 billion predicted protein structures and demonstrates that inference-time scaling works across multiple biological targets, representing a significant shift toward general-purpose foundation models in structural biology.

r/MachineLearning · 1d ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

r/MachineLearning · 1d ago · 8 · benchmark inference deployment research

NVIDIA's SOL-ExecBench revealed critical issues in AI-generated CUDA kernels when deployed in production training loops, despite passing the benchmark verifier. A detailed case study of a fused embedding-gradient + RMSNorm kernel demonstrates how bf16 accumulation bugs can cause training divergence that masquerades as research failures, with practical debugging insights for transformer training implementations.