HuggingFace Blog · 4d ago · 8 · tutorial fine tuning library rag

Practical tutorial on finetuning Qwen3-VL-Embedding-2B for Visual Document Retrieval (VDR) tasks using Sentence Transformers, demonstrating significant performance gains (NDCG@10: 0.947 vs 0.888 baseline) through domain-specific adaptation. Covers the multimodal training pipeline, dataset construction, and implementation details for engineers building with vision-language models.

HuggingFace Blog · 4d ago · 6 · tool open source workflow

This article discusses a skill/test harness for porting language models to mlx-lm and provides commentary on the challenges of open-source maintenance in an era of AI code agents. While the tool itself (porting skill for mlx-lm) is technically relevant, the bulk of the piece focuses on broader open-source governance challenges rather than actionable technical content for daily AI builders.

Simon Willison · 4d ago · 5 · tool api update

Datasette alpha introduces modern CSRF security using browser headers instead of Django tokens, and adds RenameTableEvent for plugin compatibility when tables are renamed. While technically sound engineering practices, this is primarily a database tooling update with limited direct relevance to AI/ML workflows.

r/MachineLearning · 4d ago · 6 · research benchmark

A researcher shares reproducibility issues encountered when validating claims from 7 papers this year, finding 4 irreproducible results with 2 having unresolved GitHub issues. This highlights systemic problems in ML research quality and code availability that directly impact engineers evaluating and building on published work.

r/MachineLearning · 4d ago · 7 · research open source rag

Independent researcher presents a dual-output framework addressing a specific LLM failure mode: distinguishing familiar data from novel noise through a continuous familiarity score μ(x) derived from set-theoretic axioms. The work includes documented iterations addressing saturation bugs in high-dimensional spaces, PAC-Bayes convergence proofs, and testing on a 17k-topic knowledge base system, with technical reports and code available on GitHub.

Simon Willison · 4d ago · 8 · new model api update tool

Google released Gemini 3.1 Flash TTS, a new text-to-speech model accessible via the standard Gemini API that supports prompt-based direction for audio generation, including accent and tone control. The model is demonstrated with practical examples showing how detailed prompts can generate contextual speech variations (e.g., regional accents), making it useful for developers building voice-enabled applications.

Simon Willison · 4d ago · 8 · new model tool api update

Google released Gemini 3.1 Flash TTS, a new text-to-speech model that appears to be a significant update to their multimodal capabilities. This is a practical addition to the Gemini ecosystem that software engineers building with AI should be aware of for voice synthesis workflows and applications.

DeepMind Blog · 4d ago · 8 · new model api update inference

Gemini 3.1 Flash TTS, Google's latest text-to-speech model, introduces granular audio tags for precise vocal control across 70+ languages with improved naturalness (Elo score 1,211 on benchmarks). Developers can now embed natural language commands directly in text to control style, pacing, and delivery, with all audio watermarked using SynthID, available in Google AI Studio, Vertex AI, and Google Vids.

r/MachineLearning · 4d ago · 7 · research prompt engineering agent

Technical analysis documenting five social engineering attacks against GPT-4, GPT-4o, and Claude 3.5 Sonnet, demonstrating alignment failures through psychological manipulation vectors (guilt, peer pressure, identity destabilization, etc.). The writeup argues these vulnerabilities stem from training data rather than mathematical exploits, reframing jailbreak research from software vulnerability to inherited social failure modes.

HuggingFace Blog · 4d ago · 7 · benchmark agent tool research

VAKRA is a new executable benchmark for evaluating AI agents on compositional reasoning across APIs and documents in enterprise-like environments, featuring 8,000+ locally-hosted APIs across 62 domains with real databases. It measures multi-step workflows (3-7 reasoning chains) and reveals significant performance gaps in current models, with detailed failure mode analysis included.

OpenAI Blog · 4d ago · 8 · tool agent api update deployment

OpenAI's Agents SDK now includes native sandbox execution and model-native harness features, enabling developers to build more secure and reliable long-running agents with safe file and tool access. This is a practical SDK update that directly impacts how software engineers implement agent-based workflows in production.

HuggingFace Blog · 4d ago · 7 · agent tool deployment

Holo3, a computer-use AI model, is now accessible via HoloTab, a Chrome extension that automates web tasks through natural language commands and visual demonstration-based routine recording. The extension enables agentic automation for repetitive workflows across any website without requiring technical setup, representing a practical application of vision models and action planning for browser-based task automation.

r/MachineLearning · 4d ago · 7 · fine tuning benchmark inference workflow

Engineer successfully implemented GRPO (reinforcement learning) fine-tuning for summarization using a 3-node MLX cluster with combined length penalties and quality rewards (ROUGE-L), achieving ~64 token avg rollouts. The work demonstrates practical techniques for controlling output length while maintaining quality using multi-axis LLM-as-a-Judge evaluation (faithfulness, coverage, conciseness, clarity), with next steps focused on isolating reward function impact and detecting reward gaming.

r/MachineLearning · 4d ago · 7 · benchmark research

Critical discussion of a research paper's evaluation methodology for SQL code generation in LLMs—the authors found that using natural language metrics instead of execution metrics results in ~20% false positives, raising concerns about paper validity and peer review standards at top-tier venues.

r/MachineLearning · 5d ago · 7 · fine tuning open source tool research

Fine-tuned open-source TTS model (Chatterbox) for 8 Indian languages using LoRA adapters (1.4% parameters) and grapheme-level tokenization with Brahmic script warm-start initialization. Achieves sub-0.25 CER for most languages except Malayalam (0.86), demonstrating efficient multilingual adaptation without full model retraining or language-specific G2P pipelines.

Latent Space · 5d ago · 7 · agent tool workflow deployment

Deep technical dive into Notion's Custom Agents product, covering the evolution from failed 2022 tool-calling experiments through multiple rebuilds to production-ready agents. Discusses practical agent architecture decisions including progressive tool disclosure, eval philosophy (regression/launch-quality/frontier evals), and organizational patterns for AI engineering teams working on agent-native systems.

Anthropic Research · 5d ago · 7 · research agent fine tuning benchmark

Anthropic's research explores weak-to-strong supervision as a practical approach to scalable oversight—training stronger AI models using weaker model feedback to prepare for supervising future superhuman AI. The study tests whether Claude can autonomously develop and test alignment methods, demonstrating potential for AI systems to accelerate their own alignment research.

r/MachineLearning · 5d ago · 7 · tool inference open source research

LARQL introduces a novel approach to decomposing LLM weight matrices into graph databases, enabling k-NN traversal as a mathematically equivalent alternative to matrix multiplication. This enables in-context knowledge updates without retraining and reduces memory footprint by replacing dense matrices with sparse graph structures, offering practical efficiency gains for model deployment and knowledge management.

Simon Willison · 5d ago · 6 · new model fine tuning api update

OpenAI released GPT-5.4-Cyber, a fine-tuned variant optimized for defensive cybersecurity use cases, along with a Trusted Access for Cyber program using identity verification for reduced-friction access. The announcement emphasizes OpenAI's existing cybersecurity work and self-service verification, though premium tools still require application approval similar to competing offerings.