r/MachineLearning · 2d ago · 8 · research architecture inference open source

Residual Coupling (RC) is a novel architecture that connects frozen language models in parallel using lightweight linear bridge projections, achieving significant improvements over baselines and MoE routing (80.7% perplexity reduction in medical domain). The approach enables horizontal scaling of multi-model systems without modifying base weights, with potential applications in reducing multi-turn prompting to single parallel forward passes and edge deployment.

OpenAI Blog · 2d ago · 6 · deployment agent api update

OpenAI and Dell are partnering to enable on-premise deployment of Codex for enterprise environments, addressing secure AI coding in hybrid setups. This allows software engineers to integrate AI coding capabilities within their own infrastructure while maintaining data privacy and control.

r/MachineLearning · 2d ago · 8 · research benchmark inference agent

Detailed empirical analysis revealing that Mixture-of-Experts language models (Qwen 3.5-35B) exhibit dialect-conditioned routing divergence, causing differential safety behavior between AAVE and Academic English prompts—with routing divergence upstream of refusal layers and amplified when safety fine-tuning is weakened. The research demonstrates concrete technical failures including extended token generation loops and different operational vs. mitigative response types depending solely on linguistic register, exposing a latent deployment vulnerability in MoE-based safety mechanisms.

r/MachineLearning · 2d ago · 7 · tool open source research

A 16-year-old built SAGE, an open-source XAI tool that computes feature sensitivity (∂prediction/∂feature) for black-box models like Random Forest and XGBoost using weighted perturbation and linear regression. The approach shows more stable results than centered finite differences on non-differentiable models, addressing a practical gap where understanding how to change predictions matters more than feature attribution.

r/MachineLearning · 3d ago · 7 · workflow research tool

Discussion of practical approaches to the data scarcity problem in ML projects, including a proposed solution combining permissively licensed real-world data curation with synthetic expansion and fidelity reporting. The post identifies a real pain point for engineers building models—choosing between accepting poor performance, spending engineering time on scraping/cleaning, or using marginal augmentation techniques—and explores whether synthetic data generation with statistical validation could bridge this gap.

r/LocalLLaMA · 3d ago · 6 · inference deployment benchmark

Detailed cost analysis comparing local inference on Apple Silicon (M5 MacBook Pro) versus cloud API providers like OpenRouter, finding local inference costs 3x more per token but provides valuable speed/latency tradeoffs. For most software engineers, cloud APIs remain more cost-effective unless latency/privacy requirements justify the hardware investment, though the economics vary significantly based on token throughput and device lifespan assumptions.

DeepMind Blog · 3d ago · 7 · new model tool agent

Google has expanded Project Genie, their world model capable of generating interactive environments, by integrating Street View imagery to ground virtual worlds in real-world locations. This enables AI agents and robots to train and simulate in realistic environments tied to actual places, with the capability now rolling out to Google AI Ultra subscribers globally.

DeepMind Blog · 3d ago · 8 · new model api update inference

Google released Gemini Omni Flash, a multimodal generative model that creates and edits video from text, image, audio, and video inputs with consistent physics and character continuity. The model supports iterative natural language editing and reasoning about real-world physics, now rolling out to Gemini app, Google Flow, and YouTube Shorts with plans to add image and audio generation.

r/MachineLearning · 3d ago · 7 · rag research benchmark

Experimental memory retrieval system achieving 96.4% on LongMemEval benchmark using cognitive science foundations (episodic memory theory, temporal context modeling) with key innovations in query decomposition, temporal salience scoring, and coherence re-ranking. The work isolates retrieval quality from model capability by using a smaller answering model and provides detailed category-level performance breakdown, though acknowledges limitations including single-benchmark evaluation and no ablation studies.

r/LocalLLaMA · 3d ago · 7 · inference optimization open source

A performance optimization for multi-token prediction (MTP) in llama.cpp that reduces memory traffic during prompt processing by avoiding unnecessary logit copying, improving throughput by ~20-50% on various hardware (RTX 5090, MI50). This is a practical inference optimization that affects token generation speed for models using MTP, relevant for engineers optimizing LLM serving.

DeepMind Blog · 3d ago · 7 · new model agent api update workflow

Google launches Gemini for Science, a collection of experimental AI tools (Co-Scientist, Alpha Evolve, Empirical Research Assistance, NotebookLM) designed to accelerate scientific research workflows by automating complex tasks like literature analysis and data synthesis. Enterprise versions are already in private preview with companies like BASF and Bayer, with validation papers published in Nature.

DeepMind Blog · 3d ago · 6 · tool api update deployment

Google is expanding SynthID digital watermarking and C2PA Content Credentials verification across its products (Search, Gemini, Chrome, Pixel) to help detect AI-generated vs. authentic content. The verification tools have already been used 50 million times and are rolling out to more platforms, with industry partners like OpenAI and ElevenLabs adopting SynthID for their generated content.

r/MachineLearning · 3d ago · 6 · deployment security

A critical session isolation vulnerability in DeepSeek exposed user conversations through specific input patterns, highlighting architectural risks in shared backend AI platforms. The article analyzes how different deployment models (local execution like Cursor vs. isolated workspaces vs. shared infrastructure) present different security trade-offs, relevant for engineers choosing AI tools for sensitive work.

r/MachineLearning · 3d ago · 6 · workflow benchmark

A Reddit discussion about the gap between simplified evaluation frameworks taught in PM cohorts (like Product Faculty's layered defense approach) and the statistical realities of production ML evaluation, highlighting the challenge of bridging PM and ML engineer perspectives on eval methodology without dismissing either party's valid insights.

r/LocalLLaMA · 3d ago · 7 · new model open source tool inference agent benchmark

Jackrong/Qwopus3.5-9B-Coder-GGUF is a 9B fine-tuned coding model optimized for agentic tasks, tool calling, and complex reasoning, with practical integration guides across multiple inference frameworks (llama.cpp, vLLM, Ollama, etc.) and strong performance on SWE-bench benchmarks. The model runs efficiently on 16GB RAM devices at 8-bit precision, making it accessible for local development while maintaining competitive coding capabilities.

r/LocalLLaMA · 4d ago · 7 · tool inference open source deployment

Guide for running the G4-MeroMero-31B GGUF quantized model across multiple inference frameworks (llama-cpp-python, llama.cpp, Ollama, etc.). Includes MMLU benchmark results (86.83% accuracy) and technical details on K-quant preservation strategies for SSM tensors, useful for engineers deploying open-source models locally.

r/LocalLLaMA · 4d ago · 6 · tool inference tutorial

Tutorial covering deployment of a fine-tuned Gemma 4 31B GGUF model across multiple inference frameworks (Transformers, llama-cpp-python, vLLM, Ollama, etc.), with focus on creative writing and reduced content restrictions. While practically useful for engineers running quantized models locally, this is primarily a model card/deployment guide rather than introducing new technical capabilities or frameworks.