News Nug

r/LocalLLaMA · 19h ago · 7 · tool tutorial prompt engineering inference

Guide to using BLOOMZ-P3, a multilingual instruction-following model fine-tuned on 46 languages, with practical examples showing how to deploy it via Transformers, vLLM, SGLang, and Docker. Includes prompt engineering best practices for optimizing zero-shot task performance across languages, emphasizing clear prompt structure and contextual framing.

Mapping world model taxonomy [P]

r/MachineLearning · 1d ago · 7 · research tutorial

An explainer article proposing a classification framework for world models, categorizing different approaches and identifying emerging trends in the space. Useful for understanding how world models organize conceptually, though limited in novel technical depth or implementation details.

How should I approach training this specific ML model for my startup project [D]

r/MachineLearning · 1d ago · 5 · fine tuning tutorial

A startup question about implementing sentiment analysis for Indian-language political content using muRIL model, seeking guidance on fine-tuning approaches and alternatives without ML expertise. While relevant to AI builders, this is a general advice post rather than a technical resource, tutorial, or new tool announcement.

Profiling in PyTorch (Part 3): Attention is all you profile

HuggingFace Blog · 1d ago · 8 · tutorial workflow inference

Deep dive into profiling attention mechanisms in PyTorch using the profiler to understand kernel execution, memory operations, and optimization techniques. Part 3 of a series covering naive attention, in-place operations, scaled dot-product attention (SDPA), and custom kernels with practical profiling traces and optimization patterns.

OpenMOSS-Team/MOSS-Transcribe-Diarize · Hugging Face

r/LocalLLaMA · 2d ago · 8 · tool tutorial inference deployment

MOSS-Transcribe-Diarize 0.9B is a practical end-to-end model for multi-speaker audio transcription and diarization in a single pass, with native Transformers support via custom remote code. The tutorial covers practical deployment options including vLLM and SGLang Omni serving with OpenAI-compatible APIs, plus prompt engineering for hotwords and optimization strategies for long-form audio.

Step 3.7 Flash IQ4_XS GGUF with preserve_thinking

r/LocalLLaMA · 2d ago · 7 · inference tutorial open source tool

Comprehensive guide for running the Step-3.7-Flash GGUF quantization across multiple inference frameworks (llama.cpp, vLLM, Ollama, etc.) with custom IQ4_XS quantization and a preserve_thinking chat template feature that maintains reasoning state across turns.

Reasoning-Medical0.1-27B (Qwen3.5-27B medical finetune, claims to surpass MedGemma)

r/LocalLLaMA · 2d ago · 7 · new model tutorial deployment fine tuning

EpistemeAI/Reasoning-Medical0.1-27B is a 27B parameter model fine-tuned on 100k medical reasoning examples using GRPO training and Unsloth optimization, with native Chain-of-Thought reasoning capabilities. The guide covers practical deployment across multiple inference frameworks (Transformers, vLLM, SGLang, Unsloth Studio) and API integration patterns using OpenAI SDK compatibility.

[AINews] Lilian Weng summarizes 35 papers on Harness Engineering for RSI

Latent Space · 3d ago · 7 · agent research workflow tutorial

Lilian Weng's research recap on harness engineering and its relationship to recursive self-improvement (RSI) provides practical design patterns and optimization literature overview for building agent systems. Multiple platforms (Anthropic, Google, LangChain) are converging around harness-centric agent architectures as the proven approach for long-running workflows, moving away from direct model weight modification.

Learning FlashAttention the Hard Way. Part 1: The Algebraic Foundation [D]

r/MachineLearning · 3d ago · 8 · tutorial research inference

A theoretical tutorial series on FlashAttention using modern algebraic formalism (associative reductions, twisted monoids) that enables GPU scheduling optimizations—more powerful than the original framing. Covers safe softmax, Welford's variance, numerical stability bounds, and provides first-principles derivations of constants in FA-2 and Triton kernels.

github-code Web Component

Simon Willison · 4d ago · 6 · tool tutorial open source

Simon Willison demonstrates building a Web Component for embedding GitHub code snippets using GPT-5.5, which fetches raw GitHub URLs and displays specified line ranges with line numbers. This showcases practical LLM-assisted web development for creating reusable components, though it's more of a creative experiment than a production tool or framework advancement.

Adding MCP Tools to Reachy Mini

HuggingFace Blog · 38d ago · 7 · tool tutorial workflow

Tutorial on integrating remote tools into a robotics AI system using profiles and tool configuration files. Covers the tool system architecture (built-in, local custom, and remote tools), profile management via instructions.txt and tools.txt, and how to enable/discover tools from external sources via a Hub with MCP endpoints.

I Put a Datacenter GPU in My Gaming PC for £200

r/LocalLLaMA · 38d ago · 7 · hardware inference tutorial

A practical guide to using datacenter GPUs (Tesla V100) for local LLM inference by adding an SXM2-to-PCIe adapter, achieving 32GB VRAM across two GPUs for ~£200. The article provides technical details on memory bandwidth advantages and hardware compatibility considerations for engineers running models locally on consumer hardware.

Finetuning a Reasoning LLM with Supervised or Reinforcement Learning? [D]

r/MachineLearning · 40d ago · 8 · fine tuning agent workflow tutorial

A practitioner asks for guidance on fine-tuning small LLMs with reasoning traces and tool-calling data, specifically about optimal training data structuring (conversation sampling strategy with selective loss masking) and whether to follow SFT with RL (PPO/DPO) for tool-use behavior. This is highly relevant for engineers building agentic systems, covering practical dataset preparation, training methodology, and reinforcement learning considerations for multi-step reasoning.

Arabic ASR model struggling to converge during training [D]

r/MachineLearning · 40d ago · 6 · tutorial workflow

A software engineer is troubleshooting convergence issues with a Conformer-based ASR model trained on dialectal Arabic speech using SpeechBrain, where combined CTC+KLDiv losses plateau early and validation WER remains near 100% despite multiple hyperparameter adjustments. This represents a practical deep learning debugging challenge relevant to engineers building speech models, though it's a specific problem thread rather than a generalizable technique or tool release.

Qwen 3.7 🤖, Cursor Composer 2.5 👨‍💻, Anthropic acquires Stainless 🛠️

TLDR AI · 42d ago · 7 · rag prompt engineering tutorial deployment

A technical guide covering RAG (Retrieval-Augmented Generation) implementation patterns, including code snippets, prompt templates, and production anti-patterns for scaling AI-powered search systems. Provides practical patterns and ready-to-use prompt contracts for building reliable RAG applications.

Profiling in PyTorch (Part 1): A Beginner's Guide to torch.profiler

HuggingFace Blog · 43d ago · 8 · tutorial workflow inference benchmark

Comprehensive tutorial on profiling PyTorch models using torch.profiler, covering how to read trace files and identify performance bottlenecks in matrix operations and GPU kernels. Essential for engineers optimizing LLM inference and training loops, with practical examples using NVIDIA GPUs and step-by-step walkthroughs of profiler outputs.

Qwen/Qwen-Image-Bench · Hugging Face

r/LocalLLaMA · 44d ago · 7 · new model tool tutorial benchmark

Q-Judger is a fine-tuned vision-language model for automated evaluation of text-to-image generation quality, built on Qwen2.7-27B with structured JSON output. The article provides practical setup instructions across multiple inference frameworks (Transformers, vLLM, SGLang, Docker) and demonstrates hierarchical evaluation criteria validated against human expert rankings.

Profiling PyTorch training without accidentally stalling the GPU [D]

r/MachineLearning · 45d ago · 7 · tutorial workflow benchmark

Explores the measurement paradox in PyTorch training profiling where synchronization calls can distort performance results, presenting CUDA events as a lightweight alternative to capture timing without forcing synchronization in the hot path. Useful as a first-pass profiling technique before deeper operator-level analysis with PyTorch Profiler or Nsight.

Reachy Mini goes fully local

HuggingFace Blog · 45d ago · 8 · tutorial tool open source workflow inference deployment

Technical guide for building a fully local speech-to-speech pipeline (VAD → STT → LLM → TTS) with Reachy Mini robot using open-source tools like llama.cpp, Parakeet, and Qwen3TTS. Demonstrates how to run conversational AI systems without cloud dependencies, with modular component swapping and customization for latency/quality tradeoffs.

Qwen3.5 27B Uncensored Heretic Native MTP Preserved is Out Now With the Full 15 MTPs Preserved and Retained, Available in Safetensors, GGUFs, NVFP4, NVFP4 GGUFs and GPTQ-Int4 Formats!

r/LocalLLaMA · 46d ago · 7 · tutorial inference open source deployment

Practical guide covering multiple inference frameworks (Transformers, llama-cpp-python, vLLM, SGLang, Ollama, etc.) for running a 27B quantized Qwen model. Includes GGUF quantization options and benchmark comparisons showing minimal accuracy degradation, useful for engineers optimizing local model deployment.