Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Docs
  • Enterprise
  • Pricing

  • Log In
  • Sign Up
AvinashBenkiGnani 's Collections
LLM - Pretraining Dataset Research

LLM - Pretraining Dataset Research

updated 10 days ago
Upvote
-

  • DataDecide: How to Predict Best Pretraining Data with Small Experiments

    Paper • 2504.11393 • Published Apr 15 • 18

  • Rethinking Multilingual Continual Pretraining: Data Mixing for Adapting LLMs Across Languages and Resources

    Paper • 2504.04152 • Published Apr 5 • 1

  • BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining

    Paper • 2508.10975 • Published Aug 14 • 60

  • Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

    Paper • 2412.02595 • Published Dec 3, 2024 • 5

  • The Data-Quality Illusion: Rethinking Classifier-Based Quality Filtering for LLM Pretraining

    Paper • 2510.00866 • Published Oct 1

  • Data, Data Everywhere: A Guide for Pretraining Dataset Construction

    Paper • 2407.06380 • Published Jul 8, 2024

  • Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

    Paper • 2505.22232 • Published May 28 • 18

  • Nemotron-CC-Math: A 133 Billion-Token-Scale High Quality Math Pretraining Dataset

    Paper • 2508.15096 • Published Aug 20 • 4

  • Reuse, Don't Retrain: A Recipe for Continued Pretraining of Language Models

    Paper • 2407.07263 • Published Jul 9, 2024
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Jobs
Website
Models Datasets Spaces Pricing Docs