News Nug

A new dataset with more that 100M hi-quality, curated images, with captions and meta data! [P]

r/MachineLearning · 3d ago · 8 · tool library open source dataset benchmark

MONET is a new Apache 2.0 open-source image-text dataset with 104.9M high-quality samples curated from 2.9B images, accompanied by visualization tools, a retrieval system, and a T2I training codebase. This is a significant resource for engineers building multimodal AI systems, offering both the dataset and practical tooling for training text-to-image models.

UK GDPR Small Business Q&A — 5,000 synthetic pairs with article-level citations [D]

r/MachineLearning · 3d ago · 7 · open source fine tuning rag dataset

Open-source UK GDPR compliance QA dataset (1K pairs) with SME-focused questions, detailed answers linked to specific articles/ICO guidance, and generation metadata. Generated via Qwen 14B + DeepSeek API, released in JSON/Parquet with MIT license—directly applicable for fine-tuning compliance assistants or building RAG systems for privacy tools.

Released a free 9.8M doc Indic multilingual corpus — Hindi, Bengali, Tamil, Telugu + 7 more (CC0, HuggingFace) [P]

r/MachineLearning · 12d ago · 7 · dataset open source fine tuning

A new multilingual dataset (Indic HPLT v1) with 9.8M documents across 11 Indian languages plus English, totaling 8.4B tokens, released under CC0 license on Hugging Face. Useful for training and fine-tuning language models for underrepresented Indian language families, though primarily a resource rather than a novel technical breakthrough.

Free Registration & $20K Prize Pool: 2nd MLC-SLM Challenge 2026 on Multilingual Speech LLMs [N]

r/MachineLearning · 32d ago · 6 · benchmark dataset open source

A multilingual speech language models challenge covering speaker diarization, ASR, and conversational understanding across 14 languages with 2,100 hours of free dataset. Two tracks focus on speech recognition/diarization and semantic understanding through QA, with practical experience building production speech systems.

How to Ground a Korean AI Agent in Real Demographics with Synthetic Personas

HuggingFace Blog · 40d ago · 7 · tool dataset tutorial agent deployment

NVIDIA released Nemotron-Personas-Korea, a synthetic dataset of 6M demographically-accurate Korean personas (zero PII) for grounding multilingual agents with cultural and contextual accuracy. The tutorial demonstrates deploying a Korean-aware agent using the dataset with NeMo Data Designer, NIM inference, or NVIDIA APIs—useful for engineers building localized AI systems.

SGOCR: A Spatially-Grounded OCR-focused Pipeline & V1 Dataset [P]

r/MachineLearning · 41d ago · 7 · open source tool dataset agent

Developer released SGOCR, an open-source dataset pipeline for generating spatially-grounded OCR-focused VQA data with rich metadata for training vision-language models. The project details a practical multi-stage architecture using Nvidia's nemotron-ocr-v2, Gemma/Qwen models, and Gemini 2.5 Flash for verification, plus an agentic optimization loop inspired by Karpathy's autoresearch for dataset quality improvement.

20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

r/MachineLearning · 47d ago · 8 · dataset rag research open source benchmark

A software engineer has built a structured 20M+ Indian court case dataset with citation graphs, dense/sparse embeddings, and extracted metadata (judges, parties, sections, acts). The resource includes heuristic + LLM-based NER extraction pipeline, cross-referenced legislation, and serves as a novel evaluation benchmark for legal RAG systems and graph neural networks on low-resource legal domain data.