r/MachineLearning · 1d ago · 8 · tool library open source dataset benchmark

MONET is a new Apache 2.0 open-source image-text dataset with 104.9M high-quality samples curated from 2.9B images, accompanied by visualization tools, a retrieval system, and a T2I training codebase. This is a significant resource for engineers building multimodal AI systems, offering both the dataset and practical tooling for training text-to-image models.

r/LocalLLaMA · 1d ago · 7 · new model tool inference api update

PaddleOCR-VL-1.6 is an upgraded document parsing model achieving 96.33% SOTA on OmniDocBench with improved table/formula/chart recognition and zero-cost migration from v1.5. The release includes CLI and Python API usage patterns, vLLM integration support, and transformers library compatibility for document understanding tasks.

OpenAI Blog · 1d ago · 6 · agent workflow api update

Endava's case study demonstrates using OpenAI's Codex to automate requirements analysis and accelerate software delivery through agentic workflows. The approach reduces traditionally weeks-long processes to hours, showing practical application of code generation models in enterprise software development.

r/LocalLLaMA · 1d ago · 7 · new model tool tutorial benchmark

Q-Judger is a fine-tuned vision-language model for automated evaluation of text-to-image generation quality, built on Qwen2.7-27B with structured JSON output. The article provides practical setup instructions across multiple inference frameworks (Transformers, vLLM, SGLang, Docker) and demonstrates hierarchical evaluation criteria validated against human expert rankings.

Anthropic Research · 1d ago · 6 · agent research benchmark

Survey of 1,260 quantitative social scientists (Feb-Mar 2026) reveals 81% adoption of generative AI for research, with coding agents like Claude Code enabling autonomous research execution—automating data analysis, interpretation, and iteration that were previously irreducibly human tasks. The research explores disparities in tool access, output quality differences, and potential impacts on the scholarly record, with planned randomized experiments to measure productivity effects.

r/MachineLearning · 1d ago · 7 · training research workflow

A researcher training GPT-like Transformer-decoder models (100M-500M parameters) on 750M tokens is encountering a common failure mode where the model gets stuck generating single tokens repeatedly, suggesting a training dynamics issue. The post includes detailed hyperparameters (AdamW optimizer, 1e-3 learning rate, 4M token batch size) and seeks guidance on whether decoder-only model training requires specific tricks or has undocumented failure modes.

r/MachineLearning · 1d ago · 7 · research open source tool

This project applies diffusion models to sketch-guided trajectory simulation in basketball, enabling controllable generation of player movements conditioned on partial instructions. The approach uses joint refinement of all trajectories through diffusion rather than autoregressive methods, with open-sourced code and models demonstrating a practical application of conditional generation for sports analytics.

r/LocalLLaMA · 1d ago · 6 · tool benchmark fine tuning inference

This article covers a merged 31B parameter model (Gemma-4-Harmonia) with practical integration guides for Transformers, vLLM, and SGLang, along with MMLU benchmark results showing 84.55% accuracy. While the technical implementation details on model merging and quantization are useful, the content is heavily focused on a niche fine-tuned variant rather than addressing core workflow or breakthrough capabilities.

r/LocalLLaMA · 1d ago · 9 · api update deployment agent

CVE-2026-48710 (BadHost) is a critical vulnerability in Starlette that affects FastAPI, vLLM, LiteLLM, and MCP servers—allowing HTTP Host header injection to bypass authentication. AI engineers building agents and services must immediately upgrade Starlette to version 1.0.1+ and audit any systems using these frameworks, as credentials stored in MCP servers are particularly at risk.

Simon Willison · 1d ago · 6 · agent workflow

SQLite has published AGENTS.md documenting their policy on AI agents interacting with their codebase—they reject agentic code contributions but accept high-quality AI-generated bug reports with reproducible test cases. This reflects practical workflow considerations for engineers using AI agents in development, including how open-source projects are adapting policies around AI-generated contributions.

r/MachineLearning · 2d ago · 7 · open source benchmark agent research

Open-source Context Swarm Memory (CSM) system benchmarked against Hindsight on BEAM 100K, achieving 0.757573 AMB score vs 0.733658 with 38.2% fewer context tokens but 4.5x slower retrieval. Author seeks methodology feedback before pursuing official leaderboard validation.

r/MachineLearning · 2d ago · 8 · inference open source library benchmark

TritonMoE is a portable MoE inference kernel written in Triton that achieves 89-131% of Megablocks throughput while running unchanged on both NVIDIA and AMD GPUs. The key optimization uses fused gate+up GEMM operations to reduce global memory traffic by 35%, though performance degrades at very long sequences (2048+ tokens) and under extreme routing skew.

r/MachineLearning · 2d ago · 7 · open source fine tuning rag dataset

Open-source UK GDPR compliance QA dataset (1K pairs) with SME-focused questions, detailed answers linked to specific articles/ICO guidance, and generation metadata. Generated via Qwen 14B + DeepSeek API, released in JSON/Parquet with MIT license—directly applicable for fine-tuning compliance assistants or building RAG systems for privacy tools.

Latent Space · 2d ago · 8 · new model benchmark open source research

BioHub released ESMFold 2, a transformer-based protein structure prediction engine that achieves state-of-the-art performance on protein interactions and antibody design by scaling simple BERT-like models on diverse protein sequence data rather than using specialized architectures like AlphaFold3. The release includes an atlas of 6.8 billion predicted protein structures and demonstrates that inference-time scaling works across multiple biological targets, representing a significant shift toward general-purpose foundation models in structural biology.

r/MachineLearning · 2d ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

r/MachineLearning · 2d ago · 8 · benchmark inference deployment research

NVIDIA's SOL-ExecBench revealed critical issues in AI-generated CUDA kernels when deployed in production training loops, despite passing the benchmark verifier. A detailed case study of a fused embedding-gradient + RMSNorm kernel demonstrates how bf16 accumulation bugs can cause training divergence that masquerades as research failures, with practical debugging insights for transformer training implementations.