Practical guide covering four main LLM evaluation methods: multiple-choice benchmarks, verifiers, leaderboards, and LLM judges, with code examples and analysis of their strengths/weaknesses. Essential reading for engineers comparing models, interpreting benchmarks, and measuring progress on their own projects.
Deep dive into Qwen3 architecture implementation from scratch in PyTorch, covering the open-weight model family's design choices and building blocks. Provides practical code examples and architectural patterns directly applicable to understanding modern LLM internals and building custom variations.
Comprehensive year-in-review of LLM developments in 2024, highlighting that 18 organizations now have models surpassing GPT-4, with major advances in context length (up to 2M tokens with Gemini), multimodal capabilities (video input), and expanded model availability across open-source and commercial providers. Key takeaways include the democratization of competitive model performance, practical improvements in long-context reasoning for code and document analysis, and emerging capabilities like AI agents and multimodal processing becoming standard.
llm.c is a high-performance C/CUDA implementation for LLM pretraining that eliminates heavy dependencies (PyTorch, Python) while achieving 7% faster performance than PyTorch Nightly. It provides clean reference implementations for reproducing GPT-2/GPT-3 models with both GPU (CUDA) and CPU code paths, making it valuable for understanding model training mechanics and CUDA optimization.
llamafile 0.10.0 update from Mozilla.ai enables distributing and running open LLMs as single-file executables across platforms with no installation required, now with improved alignment to latest llama.cpp versions and support for more recent models. The tool also includes whisperfile for single-file speech-to-text capabilities, making local LLM deployment significantly more accessible for developers.