Anthropic Research · 10h ago · 6 · agent research benchmark

Survey of 1,260 quantitative social scientists (Feb-Mar 2026) reveals 81% adoption of generative AI for research, with coding agents like Claude Code enabling autonomous research execution—automating data analysis, interpretation, and iteration that were previously irreducibly human tasks. The research explores disparities in tool access, output quality differences, and potential impacts on the scholarly record, with planned randomized experiments to measure productivity effects.

r/MachineLearning · 12h ago · 7 · training research workflow

A researcher training GPT-like Transformer-decoder models (100M-500M parameters) on 750M tokens is encountering a common failure mode where the model gets stuck generating single tokens repeatedly, suggesting a training dynamics issue. The post includes detailed hyperparameters (AdamW optimizer, 1e-3 learning rate, 4M token batch size) and seeks guidance on whether decoder-only model training requires specific tricks or has undocumented failure modes.

r/MachineLearning · 13h ago · 7 · research open source tool

This project applies diffusion models to sketch-guided trajectory simulation in basketball, enabling controllable generation of player movements conditioned on partial instructions. The approach uses joint refinement of all trajectories through diffusion rather than autoregressive methods, with open-sourced code and models demonstrating a practical application of conditional generation for sports analytics.

r/LocalLLaMA · 14h ago · 9 · api update deployment agent

CVE-2026-48710 (BadHost) is a critical vulnerability in Starlette that affects FastAPI, vLLM, LiteLLM, and MCP servers—allowing HTTP Host header injection to bypass authentication. AI engineers building agents and services must immediately upgrade Starlette to version 1.0.1+ and audit any systems using these frameworks, as credentials stored in MCP servers are particularly at risk.

Simon Willison · 16h ago · 6 · agent workflow

SQLite has published AGENTS.md documenting their policy on AI agents interacting with their codebase—they reject agentic code contributions but accept high-quality AI-generated bug reports with reproducible test cases. This reflects practical workflow considerations for engineers using AI agents in development, including how open-source projects are adapting policies around AI-generated contributions.

r/MachineLearning · 17h ago · 7 · open source benchmark agent research

Open-source Context Swarm Memory (CSM) system benchmarked against Hindsight on BEAM 100K, achieving 0.757573 AMB score vs 0.733658 with 38.2% fewer context tokens but 4.5x slower retrieval. Author seeks methodology feedback before pursuing official leaderboard validation.

r/MachineLearning · 18h ago · 8 · inference open source library benchmark

TritonMoE is a portable MoE inference kernel written in Triton that achieves 89-131% of Megablocks throughput while running unchanged on both NVIDIA and AMD GPUs. The key optimization uses fused gate+up GEMM operations to reduce global memory traffic by 35%, though performance degrades at very long sequences (2048+ tokens) and under extreme routing skew.

r/MachineLearning · 18h ago · 7 · open source fine tuning rag dataset

Open-source UK GDPR compliance QA dataset (1K pairs) with SME-focused questions, detailed answers linked to specific articles/ICO guidance, and generation metadata. Generated via Qwen 14B + DeepSeek API, released in JSON/Parquet with MIT license—directly applicable for fine-tuning compliance assistants or building RAG systems for privacy tools.

Latent Space · 22h ago · 8 · new model benchmark open source research

BioHub released ESMFold 2, a transformer-based protein structure prediction engine that achieves state-of-the-art performance on protein interactions and antibody design by scaling simple BERT-like models on diverse protein sequence data rather than using specialized architectures like AlphaFold3. The release includes an atlas of 6.8 billion predicted protein structures and demonstrates that inference-time scaling works across multiple biological targets, representing a significant shift toward general-purpose foundation models in structural biology.

r/MachineLearning · 22h ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

r/MachineLearning · 23h ago · 8 · benchmark inference deployment research

NVIDIA's SOL-ExecBench revealed critical issues in AI-generated CUDA kernels when deployed in production training loops, despite passing the benchmark verifier. A detailed case study of a fused embedding-gradient + RMSNorm kernel demonstrates how bf16 accumulation bugs can cause training divergence that masquerades as research failures, with practical debugging insights for transformer training implementations.

r/MachineLearning · 1d ago · 8 · tool open source benchmark workflow

noisekit is an open-source tool that generates realistic degraded audio datasets from clean annotated speech data, enabling accurate STT vendor benchmarking under production conditions (phone noise, codecs, reverb). It fills a critical gap for voice agent builders by providing WER-measurable datasets that approximate real-world phone call audio rather than relying on clean studio recordings.

r/MachineLearning · 1d ago · 8 · research inference open source library

NeuroFlow is a training-free dynamic routing framework for Vision Transformers that achieves 55.8× wall-clock speedup on high-res video inference by eliminating redundant tokens via semantic surprise tracking in embedding space. The approach uses a dual-memory architecture with retinal gating and cortical caching to maintain 97%+ fidelity while achieving extreme sparsity (84% token reduction), with code and paper publicly available.

r/MachineLearning · 1d ago · 7 · research benchmark open source

Cross-species neuroscience study comparing learning rules (BP, FA, PC, STDP) across human fMRI and macaque electrophysiology (V1/V2/V4/IT), finding that early visual alignment is conserved but IT alignment scales with model capacity rather than learning rule. Includes careful controls for stimulus confounds and capacity baselines, with code and companion papers provided.

r/MachineLearning · 1d ago · 7 · tutorial workflow benchmark

Explores the measurement paradox in PyTorch training profiling where synchronization calls can distort performance results, presenting CUDA events as a lightweight alternative to capture timing without forcing synchronization in the hot path. Useful as a first-pass profiling technique before deeper operator-level analysis with PyTorch Profiler or Nsight.

OpenAI Blog · 1d ago · 6 · agent workflow api update

Case study on building a self-improving tax agent using OpenAI's Codex for automating tax filings and improving accuracy through iterative refinement. Demonstrates practical application of code generation models to domain-specific workflow automation.

r/MachineLearning · 1d ago · 7 · open source inference deployment

Open-source 7MB autonomous driving model that learns visual navigation, lane following, and drift recovery for edge deployment on lightweight hardware. Demonstrates practical real-time inference optimization for complex perception tasks without cloud infrastructure, valuable for understanding model compression and embedded AI systems.