r/MachineLearning · 1h ago · 7 · agent research workflow

A systems-focused writeup on building self-improving AI agent harnesses for benchmark tasks, exploring the challenge of safely compounding agent-proposed improvements and parallels to coding-agent customization patterns. The author shares both successful and failed approaches to implementing continuous self-improvement loops, offering practical insights for engineers building autonomous improvement systems.

r/MachineLearning · 5h ago · 8 · tool open source benchmark workflow

noisekit is an open-source tool that generates realistic degraded audio datasets from clean annotated speech data, enabling accurate STT vendor benchmarking under production conditions (phone noise, codecs, reverb). It fills a critical gap for voice agent builders by providing WER-measurable datasets that approximate real-world phone call audio rather than relying on clean studio recordings.

r/MachineLearning · 5h ago · 8 · research inference open source library

NeuroFlow is a training-free dynamic routing framework for Vision Transformers that achieves 55.8× wall-clock speedup on high-res video inference by eliminating redundant tokens via semantic surprise tracking in embedding space. The approach uses a dual-memory architecture with retinal gating and cortical caching to maintain 97%+ fidelity while achieving extreme sparsity (84% token reduction), with code and paper publicly available.

r/MachineLearning · 6h ago · 7 · research benchmark open source

Cross-species neuroscience study comparing learning rules (BP, FA, PC, STDP) across human fMRI and macaque electrophysiology (V1/V2/V4/IT), finding that early visual alignment is conserved but IT alignment scales with model capacity rather than learning rule. Includes careful controls for stimulus confounds and capacity baselines, with code and companion papers provided.

r/MachineLearning · 6h ago · 7 · tutorial workflow benchmark

Explores the measurement paradox in PyTorch training profiling where synchronization calls can distort performance results, presenting CUDA events as a lightweight alternative to capture timing without forcing synchronization in the hot path. Useful as a first-pass profiling technique before deeper operator-level analysis with PyTorch Profiler or Nsight.

OpenAI Blog · 11h ago · 6 · agent workflow api update

Case study on building a self-improving tax agent using OpenAI's Codex for automating tax filings and improving accuracy through iterative refinement. Demonstrates practical application of code generation models to domain-specific workflow automation.

r/MachineLearning · 12h ago · 7 · open source inference deployment

Open-source 7MB autonomous driving model that learns visual navigation, lane following, and drift recovery for edge deployment on lightweight hardware. Demonstrates practical real-time inference optimization for complex perception tasks without cloud infrastructure, valuable for understanding model compression and embedded AI systems.

r/MachineLearning · 13h ago · 6 · research benchmark

A researcher shares their struggling GNN implementation for fraud detection on IEEE CIS dataset, achieving suboptimal performance (AUC 0.87, PR-AUC 0.52) across multiple architectures (GCN, GraphSAGE, GAT). This is practical ML engineering content with specific technical challenges but lacks novel insights—relevant for learning what not to do and potential debugging approaches.

HuggingFace Blog · 18h ago · 8 · tutorial tool open source workflow inference deployment

Technical guide for building a fully local speech-to-speech pipeline (VAD → STT → LLM → TTS) with Reachy Mini robot using open-source tools like llama.cpp, Parakeet, and Qwen3TTS. Demonstrates how to run conversational AI systems without cloud dependencies, with modular component swapping and customization for latency/quality tradeoffs.

HuggingFace Blog · 18h ago · 8 · tool workflow open source research inference

A practical open-source solution for efficient async RL training using sparse weight deltas instead of full model snapshots, reducing synchronization overhead from ~1TB to ~20GB per checkpoint. The approach leverages bf16 arithmetic properties where 98% of weights remain bit-equivalent between steps, enabling asynchronous weight distribution via shared storage (S3) without direct trainer-inference connectivity.

r/MachineLearning · 1d ago · 7 · research library benchmark

EAMS presents an Equivariant Mesh Neural Network framework for robust anatomical mesh segmentation across medical imaging tasks (dental, liver, aneurysm), maintaining performance under geometric perturbations like patient pose variation where standard methods degrade by 25+ IoU points. The work combines intrinsic mesh descriptors with anatomy-aware PCA-derived priors in a lightweight (<2M parameter) architecture, demonstrating that equivariance principles from molecular modeling transfer effectively to 3D medical mesh tasks despite trade-offs in capturing subtle asymmetric features.

r/MachineLearning · 1d ago · 8 · tool open source rag

Tomesphere is a free research paper discovery platform indexing 3M arxiv/OpenAlex papers with AI-generated TLDRs, peer reviews, GitHub repos, HuggingFace models, and semantic similarity search using SPECTER2 embeddings in pgvector. The semantic graph approach enables discovery of topically related papers beyond citation networks, with a Chrome extension for arxiv integration and multiple ranking modes (influential, recent, hidden gems, nearest neighbors).

Simon Willison · 1d ago · 7 · agent security prompt engineering

Microsoft Copilot Cowork contained a critical security vulnerability where agentic systems could exfiltrate files through unapproved email messages with external image requests and pre-authenticated OneDrive links. This highlights a major design challenge in building safe autonomous agents: preventing prompt injection attacks from enabling data theft while maintaining agent autonomy.

r/MachineLearning · 1d ago · 7 · research agent inference

A technical essay critiques reasoning models' ability to perform faithful inference, arguing that jointly-generated reasoning traces and final answers lack genuine separation of concerns. The piece engages empirically with recent work (Lanham/Turpin/Mirzadeh) and compares architectural approaches (HRM, TRM, GRAM, AlphaProof, Kona/Aleph), offering conceptual framing around constraints vs. influence that's relevant for engineers building reasoning systems.

r/LocalLLaMA · 1d ago · 7 · new model library inference

MOSS-TTS-v1.5 expands multilingual text-to-speech capabilities to 31 languages with improved performance through FlashAttention 2 support and optimized dependencies. The update maintains backward compatibility with v1.0 while adding support for languages like Cantonese, Hindi, Thai, and Vietnamese, with straightforward installation and generation APIs.

r/MachineLearning · 1d ago · 9 · tool open source inference deployment

WAVE is a portable GPU kernel abstraction layer that compiles to a unified binary compatible with Metal, PTX, HIP, and SYCL across Apple, NVIDIA, and AMD hardware. This solves a critical pain point for AI engineers building cross-platform systems—write kernels once and deploy identically across diverse GPU architectures with verified PyTorch integration.

r/MachineLearning · 1d ago · 6 · workflow

A Reddit discussion asking for ML/AI community recommendations focused on deep technical work—papers, training dynamics, model debugging, and infrastructure challenges rather than LLM API projects. The post seeks spaces for sharing specific technical problems (e.g., anomalies in SSL training) and receiving substantive expert feedback.

r/LocalLLaMA · 1d ago · 7 · tutorial inference open source deployment

Practical guide covering multiple inference frameworks (Transformers, llama-cpp-python, vLLM, SGLang, Ollama, etc.) for running a 27B quantized Qwen model. Includes GGUF quantization options and benchmark comparisons showing minimal accuracy degradation, useful for engineers optimizing local model deployment.

r/LocalLLaMA · 1d ago · 6 · open source inference deployment fine tuning

Guide for using a fine-tuned Qwen 3.5-35B variant (with reduced content restrictions) across multiple inference frameworks including Transformers, vLLM, and SGLang, with MMLU benchmark results (83.72% accuracy) and multiple quantization options available. Practical for engineers looking to deploy modified open-source models with different inference backends.