r/MachineLearning · 57d ago · 8 · dataset rag research open source benchmark

A software engineer has built a structured 20M+ Indian court case dataset with citation graphs, dense/sparse embeddings, and extracted metadata (judges, parties, sections, acts). The resource includes heuristic + LLM-based NER extraction pipeline, cross-referenced legislation, and serves as a novel evaluation benchmark for legal RAG systems and graph neural networks on low-resource legal domain data.

r/MachineLearning · 57d ago · 7 · benchmark inference research

Comprehensive benchmark comparing six LLMs on subtitle translation across six languages using reference-free quality metrics (MetricX-24 and COMETKiwi), with a custom combined score revealing model-metric affinity bias and critical failures like TranslateGemma's inability to properly distinguish Simplified vs Traditional Chinese despite high metric scores. The evaluation highlights practical limitations of current QE metrics and real-world deployment risks when relying solely on automated scoring.

Latent Space · 57d ago · 6 · open source deployment benchmark

Community survey of popular open-weight models across local deployment use cases, highlighting Qwen 3.5, Gemma 4, DeepSeek V3.2, and others based on actual Reddit recommendations rather than benchmarks. Focuses on practical model selection for engineers building local inference systems, with specific callouts for coding (Qwen3-Coder-Next) and agentic workloads (MiniMax M2.5/M2.7).

r/MachineLearning · 57d ago · 8 · library research open source benchmark deployment

HALO-Loss is an open-source drop-in replacement for Cross-Entropy that uses euclidean distance instead of dot products to bound model confidence, enabling native out-of-distribution detection without sacrificing base accuracy. The method addresses a fundamental neural network problem where models hallucinate on unfamiliar data by mathematically constraining confidence to finite distances and providing an implicit "abstain class" at the origin of the latent space. Testing shows zero accuracy drop, improved calibration (ECE down to 1.5%), and significantly reduced false positives on far OOD detection compared to standard approaches.

r/MachineLearning · 57d ago · 7 · research open source inference benchmark

An indie developer trained a 1B parameter Spiking Neural Network (SNN) from random initialization for language modeling, achieving 93% sparsity and spontaneous cross-lingual emergence, challenging the conventional wisdom that direct SNN training requires ANN conversion or distillation. While early-stage (4.4 loss, 27k steps), this demonstrates a viable pathway for neuromorphic computing and inference efficiency, with code and checkpoint shared for community feedback.

r/MachineLearning · 57d ago · 7 · research workflow benchmark

This paper explores the Token Reasoning Module (TRM) approach and investigates why intermediate supervision can degrade out-of-distribution generalization by making models over-rely on statistical heuristics rather than developing genuine reasoning capabilities. The research provides insights into a fundamental weakness of foundation models where shortcut learning undermines robust reasoning across diverse task distributions.

DeepMind Blog · 57d ago · 9 · new model api update agent benchmark

Google released Gemini Robotics-ER 1.6, a specialized embodied reasoning model for robotic systems with enhanced spatial understanding, multi-view reasoning, and new instrument-reading capabilities like gauge interpretation. The model is now available via the Gemini API with improvements in pointing, counting, task planning, and success detection—critical for physical agent autonomy.

Simon Willison · 58d ago · 7 · tool library open source

Servo browser engine is now available on crates.io as an embeddable library, enabling Rust developers to integrate it into applications. The post demonstrates practical usage including a CLI screenshot tool and explores WebAssembly compilation possibilities, though full Servo WebAssembly compilation isn't feasible due to threading and dependency constraints.

GitHub Trending AI · 58d ago · 7 · tool open source workflow

skills-manage is a Tauri-based desktop app that centralizes AI coding agent skill management across 20+ platforms (Claude, Cursor, Gemini CLI, etc.) using a single ~/.agents/skills/ directory with symlink distribution. It implements the Agent Skills open pattern, allowing engineers to maintain one skill source of truth deployed to multiple AI coding tools.

Simon Willison · 58d ago · 6 · prompt engineering workflow

Bryan Cantrill argues that LLMs lack the optimization pressure that human laziness (finite time) creates, leading to bloated systems and poor abstractions if left unchecked. The piece emphasizes how human constraints force better engineering practices, a useful perspective for AI engineers building production systems to consider when relying on LLM-generated code or architectures.

Simon Willison · 58d ago · 7 · tutorial inference open source tool

Practical walkthrough of running local audio transcription using Gemma 4 E2B model with MLX framework on macOS via uv run. Demonstrates real-world inference with a 10GB model and shows actual transcription output with accuracy notes, useful for developers building local AI audio pipelines.

r/LocalLLaMA · 59d ago · 7 · open source inference tool benchmark

This PR adds audio processing support to Gemma 4 models in llama.cpp using a USM-style Conformer encoder, with key fixes for CUDA/Vulkan/Metal backend compatibility. The implementation includes optimizations like replacing unsupported ops (ggml_roll → view+concat) and fixing contiguity issues that caused CPU fallbacks, achieving strong audio transcription results across different quantization levels and backends.

r/MachineLearning · 59d ago · 6 · research benchmark

This essay explores whether LLM capabilities emerge purely from scale (data + compute) versus requiring fundamental algorithmic innovations, tracing this debate from early computer vision work through GPT scaling. While intellectually engaging, it's primarily philosophical reflection on existing trends rather than introducing new technical methods, models, or practical tools for engineers building with AI.

GitHub Trending AI · 59d ago · 8 · tutorial research prompt engineering

Comprehensive educational resource covering LLM fundamentals including tokenization (BPE), attention mechanics (Q/K/V math), scaling factors, causal masking, backpropagation, and cross-entropy loss. Step-by-step explanations with numeric examples make this valuable for engineers building with LLMs who want to understand the mathematical foundations beneath their tools.

TLDR AI · 59d ago · 6 · workflow benchmark

Survey findings reveal widespread developer distrust in AI-generated code (96%) with reliability concerns, highlighting the need for automated verification and deterministic guardrails in AI-assisted development workflows. The report positions AI as "trusted but verified" with emphasis on SDLC integration and automated quality gates rather than manual code review.