News Nug

Training GPT-like model on non-language series [R]

r/MachineLearning · 3d ago · 7 · training research workflow

A researcher training GPT-like Transformer-decoder models (100M-500M parameters) on 750M tokens is encountering a common failure mode where the model gets stuck generating single tokens repeatedly, suggesting a training dynamics issue. The post includes detailed hyperparameters (AdamW optimizer, 1e-3 learning rate, 4M token batch size) and seeks guidance on whether decoder-only model training requires specific tricks or has undocumented failure modes.

Anyone tried this yet? LLM with knowledge date in the 1930s

r/LocalLLaMA · 33d ago · 7 · research benchmark training

Researchers trained 'vintage' language models on historical text (pre-1931) to study how LMs understand time, predict future events, and generate novel ideas. They evaluate these models on tasks like forecasting historical surprises and coding problems, providing insights into model capabilities and scaling behavior across different knowledge cutoffs.

ResBM: a new transformer-based architecture for low-bandwidth pipeline-parallel training, achieving 128× activation compression [R]

r/MachineLearning · 45d ago · 7 · research training inference

ResBM introduces a residual bottleneck architecture for efficient pipeline-parallel training that achieves 128× activation compression while maintaining convergence, directly addressing bandwidth constraints in distributed AI model training. The work combines encoder-decoder bottlenecks with low-rank identity paths and demonstrates practical results using Muon optimization, relevant for engineers optimizing large-scale model training infrastructure.