DataDecide: How to Predict Best Pretraining Data with Small Experiments Paper • 2504.11393 • Published Apr 15 • 18
Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources Paper • 2504.04152 • Published Apr 5 • 1
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published Aug 14 • 60
Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset Paper • 2412.02595 • Published Dec 3, 2024 • 5
The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining Paper • 2510.00866 • Published Oct 1
Data, Data Everywhere: A Guide for Pretraining Dataset Construction Paper • 2407.06380 • Published Jul 8, 2024
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models Paper • 2505.22232 • Published May 28 • 18
Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset Paper • 2508.15096 • Published Aug 20 • 4
Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models Paper • 2407.07263 • Published Jul 9, 2024