DeepMind Blog · 47d ago · 7 · new model api update inference

Google DeepMind released Nano Banana 2 (Gemini 3.1 Flash Image), a new image generation model combining advanced reasoning and world knowledge with Flash-speed inference. The model is now available across Google products (Gemini app, Search) and offers improved subject consistency, photorealism, and instruction-following capabilities with reduced latency compared to the Pro version.

Ahead of AI · 48d ago · 8 · new model research benchmark

Comprehensive technical comparison of 10+ major open-weight LLM releases from January-March 2026, analyzing architectural innovations like mixture-of-experts, sliding window attention, QK-norm, and gating mechanisms across models from Arcee, Moonshot, Qwen, and others. Serves as a practical reference for understanding current design patterns and trade-offs in large model architecture.

OpenAI Research · 50d ago · 7 · benchmark research

Analysis reveals significant data contamination and training leakage issues in SWE-bench Verified, a widely-used benchmark for evaluating AI coding models, with recommendations to use SWE-bench Pro instead. This is technically important for engineers evaluating code generation models and understanding the reliability of current benchmarking standards.

OpenAI Research · 53d ago · 7 · benchmark research agent

Research team demonstrates AI model performance on expert-level mathematical proof problems from the First Proof challenge, providing insights into current capabilities and limitations of AI reasoning on formal mathematics. This benchmarking work is relevant for engineers building AI systems that require complex reasoning and problem-solving.

DeepMind Blog · 54d ago · 9 · new model api update benchmark

Google released Gemini 3.1 Pro, an upgraded core model with significantly improved reasoning capabilities (77.1% on ARC-AGI-2, more than 2x better than 3 Pro). Available through Gemini API, Vertex AI, and consumer products, it excels at complex problem-solving tasks including code generation, system synthesis, and advanced reasoning workflows that engineers building with AI will find immediately applicable.

DeepMind Blog · 55d ago · 6 · new model api update deployment

Google DeepMind released Lyria 3, an advanced music generation model integrated into the Gemini app, allowing users to create 30-second tracks from text descriptions or images with SynthID watermarking for AI-generated content detection. The model improves on previous versions with better audio quality and customization, and is also rolling out to YouTube creators for Dream Track.

OpenAI Research · 56d ago · 6 · benchmark agent research

EVMbench is a new benchmark for evaluating AI agents on smart contract security tasks like vulnerability detection and patching. While technically interesting for agent evaluation, it's specialized to blockchain/security domains rather than general AI engineering workflows.

OpenAI Research · 60d ago · 6 · research benchmark

GPT-5.2 generated a novel theoretical physics formula for gluon amplitudes that was subsequently validated by formal proof and peer verification. While intellectually interesting, this represents a scientific application outcome rather than actionable technical guidance for AI builders developing with current models.

OpenAI Research · 68d ago · 6 · agent workflow api update

An autonomous lab system integrates GPT-5 with cloud automation for closed-loop experimentation in synthetic biology, demonstrating a 40% cost reduction in protein synthesis. While showcasing practical AI agent application in scientific workflows, the focus is primarily on biotech outcomes rather than AI engineering techniques or tools.

Ahead of AI · 80d ago · 8 · inference prompt engineering tutorial research

Comprehensive overview of inference-time scaling techniques for LLMs, covering methods like chain-of-thought prompting, self-consistency, best-of-N ranking, and rejection sampling with verifiers. The author shares practical experimentation results (achieving 15% to 52% accuracy improvement) and categorizes approaches from both academic literature and proprietary LLM implementations, making it directly applicable to deployed systems.

Ahead of AI · 105d ago · 9 · research new model fine tuning benchmark

A comprehensive retrospective on 2025's major LLM developments, starting with DeepSeek R1's January release showing that reinforcement learning (specifically RLVR/GRPO) can enable reasoning-like behavior in LLMs, and revealing that state-of-the-art model training may cost an order of magnitude less than previously estimated. The article examines how post-training scaling through verifiable rewards represents a significant algorithmic shift from SFT/RLHF approaches, opening new possibilities for capability unlocking.

OpenAI Research · 117d ago · 8 · research benchmark tool workflow

OpenAI releases a framework and evaluation suite for monitoring chain-of-thought reasoning processes, demonstrating that internal reasoning transparency significantly outperforms output-only monitoring for AI control. The work includes 13 evaluations across 24 environments, providing practical tools for engineers building interpretable AI systems.

Ahead of AI · 132d ago · 8 · new model open source inference research

DeepSeek V3.2 is a new open-weight flagship model achieving GPT-5/Gemini 3.0 Pro-level performance with a custom sparse attention mechanism requiring specialized inference infrastructure. The article provides technical deep-dive into the model's architecture, training pipeline, and what's changed since V3/R1, making it essential for engineers working with state-of-the-art open-source models.

Ahead of AI · 161d ago · 7 · research benchmark tutorial

Comprehensive overview of alternative LLM architectures beyond standard transformers, including diffusion models, linear attention hybrids, state space models (SSMs), and specialized architectures like code world models. The article surveys emerging approaches aimed at improving efficiency and modeling performance, with comparisons to current SOTA transformer-based models like DeepSeek R1, Llama 4, and Qwen3.

HN AI Stories · 187d ago · 7 · research benchmark deployment

Anthropic, UK AI Security Institute, and Alan Turing Institute released findings that LLMs can be backdoored with as few as 250 poisoned documents regardless of model size, challenging the assumption that attackers need to control a percentage of training data. This large-scale poisoning study demonstrates data-poisoning attacks are more practical than previously believed and highlights a critical security vulnerability in pretraining pipelines that AI builders need to understand.

Ahead of AI · 191d ago · 7 · benchmark tutorial workflow

Practical guide covering four main LLM evaluation methods: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges, with code examples and analysis of their strengths/weaknesses. Essential reading for engineers comparing models, interpreting benchmarks, and measuring progress on their own projects.

Ahead of AI · 220d ago · 8 · tutorial open source research

Deep dive into Qwen3 architecture implementation from scratch in PyTorch, covering the open-weight model family's design choices and building blocks. Provides practical code examples and architectural patterns directly applicable to understanding modern LLM internals and building custom variations.

HN AI Stories · 469d ago · 7 · benchmark new model inference

Comprehensive year-in-review of LLM developments in 2024, highlighting that 18 organizations now have models surpassing GPT-4, with major advances in context length (up to 2M tokens with Gemini), multimodal capabilities (video input), and expanded model availability across open-source and commercial providers. Key takeaways include the democratization of competitive model performance, practical improvements in long-context reasoning for code and document analysis, and emerging capabilities like AI agents and multimodal processing becoming standard.

HN AI Stories · 736d ago · 9 · open source library inference tutorial

llm.c is a high-performance C/CUDA implementation for LLM pretraining that eliminates heavy dependencies (PyTorch, Python) while achieving 7% faster performance than PyTorch Nightly. It provides clean reference implementations for reproducing GPT-2/GPT-3 models with both GPU (CUDA) and CPU code paths, making it valuable for understanding model training mechanics and CUDA optimization.