TL;DR · Key Takeaways
  • Common Crawl is the dominant web-crawl component in many documented open training recipes, either directly or through derivatives (C4, RefinedWeb, FineWeb, Dolma, RedPajama). Proprietary training mixtures are partly undisclosed. It applies no filtering — the filtering pipeline determines usable quality.
  • The 2025 Bartz v. Anthropic ruling drew a hard legal line: training on lawfully purchased books is "spectacularly transformative" fair use, but maintaining a library of pirated books is "inherently, irredeemably infringing." A proposed $1.5 billion settlement has been reported as the largest in a U.S. copyright case; confirm final court approval before citing as settled.
  • Data curation now beats raw scale: HuggingFace's FineWeb-Edu trained on 1.3T well-filtered tokens outperforms models trained on datasets 10× larger.
  • Median data exhaustion year is 2028 (Epoch AI, 2024) — but multi-epoch training, synthetic data, and multimodal data extend the runway.
  • Chinese frontier models — Qwen 3 (36T tokens) and DeepSeek V3 (14.8T tokens) — have reached Western data scales while operating under fundamentally different regulatory constraints.
Evidence note This reference covers public and disclosed datasets. Frontier model training mixtures from OpenAI, Google, and Anthropic are largely undisclosed; token counts and source proportions for proprietary models are drawn from leaked analyses, court filings, and researcher estimates where official documentation is unavailable. Legal case status (including Bartz v. Anthropic) should be verified before citing as settled. Dataset availability, scale, and legal status change frequently. Last reviewed: May 2026.

Why Training Data Is the Limiting Variable in AI Development

Modern large language models do not differ primarily in architecture — they differ in data. The attention mechanism has been largely stable since 2017. What separates frontier models from their predecessors, and from each other, is the scale, composition, and curation quality of their training corpora. Yet the field has no canonical public inventory of what those corpora contain.

This reference fills that gap. It profiles every significant dataset used in LLM pretraining, instruction tuning, alignment, and multimodal training — covering provenance, scale, filtering methodology, legal status, and downstream usage. It also profiles the Chinese data ecosystem, which operates under fundamentally different regulatory and censorship constraints, and which has produced models at comparable data scales to Western counterparts.

The filtering is more important than the raw data. Each generation of web crawl curation — from C4's crude heuristics in 2019 to FineWeb-Edu's LLM-annotated quality scoring in 2024 — demonstrates that data curation sophistication, not token count, is the binding constraint on model quality.

Three structural forces are reshaping this ecosystem simultaneously. First, a quality revolution: FineWeb-Edu, DCLM, and Microsoft's Phi series have demonstrated that a well-curated 1.3 trillion token dataset can outperform a noisy 15 trillion token one. Second, a legal reckoning over provenance: courts are establishing that fair use may protect AI training, but not if source data was pirated. Third, the platformisation of data access: Reddit, Twitter/X, Baidu, and Shutterstock have each moved to restrict or monetise crawl access, ending the era of freely available internet data.

300T
Estimated effective stock of quality-adjusted public text, in tokens. At current consumption rates, median exhaustion is projected for 2028.Epoch AI, 2024

Section 1: Web Crawl Datasets — The Foundation Layer

Web crawl data is the dominant component in many documented open training recipes. Nearly every dataset in this category derives from a single source: Common Crawl. (Proprietary training mixtures from frontier labs are partly undisclosed; share estimates vary by recipe and source.) The evolution of filtering pipelines applied to that source tells the story of the field's maturation.

Common Crawl

Common Crawl is the bedrock of modern AI training data. Operated by a nonprofit founded by Gil Elbaz, it has crawled the public web continuously since 2008. Each monthly snapshot adds roughly 2.4 billion pages totalling over 400 TiB of uncompressed data. The cumulative archive spans petabytes across 100+ snapshots, distributed as WARC, WAT, and WET files on AWS S3. Common Crawl applies no filtering — it is raw web data in all languages. Virtually every major LLM derives training data from it either directly or through derivative datasets. The underlying web content retains its original copyright, creating the legal fault line at the centre of nearly every AI copyright lawsuit.

C4 (Colossal Clean Crawled Corpus)

Google's first major attempt to clean Common Crawl for LLM training, created by the T5 team (Colin Raffel et al.). C4 extracted ~156 billion tokens from a single April 2019 snapshot. Filtering was aggressive but crude: pages not ending in terminal punctuation were removed; pages with fewer than five sentences were discarded; any page containing a word from a profanity list was excluded. Studies later showed this approach disproportionately removed African American English (42% removal rate) and Hispanic-aligned English (32%) versus White American English (6.2%). Licensed under ODC-BY; available on HuggingFace via AllenAI.

RefinedWeb

Created by the Falcon team at the Technology Innovation Institute (TII, Abu Dhabi), RefinedWeb represented a philosophical shift. Its thesis: properly filtered and deduplicated web data alone can outperform curated multi-source datasets such as The Pile. The Macrodata Refinement pipeline processed Common Crawl through URL filtering, Trafilatura text extraction, fastText language identification, heuristic filtering, and — critically — both fuzzy (MinHash) and exact substring deduplication, removing nearly 90% of original content. The internal dataset reached ~5 trillion tokens; a 600-billion-token public extract is available under ODC-BY. Models trained solely on RefinedWeb matched or exceeded those trained on The Pile, validating the web-only approach.

FineWeb

HuggingFace's FineWeb processed 96+ Common Crawl snapshots spanning 2013 to 2024, producing ~18.5 trillion tokens of cleaned English text — sufficient to train a Chinchilla-optimal 500B+ parameter model. Built using HuggingFace's datatrove library, FineWeb tested 50+ candidate filters to identify a small effective set, used Justext for text extraction, and applied MinHash deduplication. Oldest snapshots lost up to 94% of tokens to deduplication. FineWeb outperforms C4, The Pile, SlimPajama, and RefinedWeb on aggregate benchmarks. Licensed under ODC-BY.

FineWeb-Edu

Perhaps the most consequential dataset innovation of 2024. HuggingFace used Llama-3-70B-Instruct to score 500,000 samples on a 0–5 educational quality scale, then fine-tuned a BERT-like classifier on these annotations. Applying this classifier across all 15T tokens of FineWeb — at a cost of 6,000 H100 GPU hours — and retaining only scores ≥3 yielded 1.3 trillion tokens of high-educational-value content (92% removed). The result: dramatic improvements on knowledge-intensive benchmarks, outperforming models trained on datasets 10× larger. This proved that aggressive quality filtering followed by multi-epoch training beats raw scale.

OSCAR, CC-100, mC4, and CulturaX — Multilingual Web Data

OSCAR (Open Super-large Crawled Aggregated Corpus), created by Inria's ALMAnaCH team, extracts multilingual text from Common Crawl across 150–168 languages using fastText for language identification. Available on HuggingFace with gated access. CC-100, created by Meta for XLM-R, provides monolingual text for 100+ languages using the CCNet pipeline with perplexity-based filtering via KenLM models trained on Wikipedia for each language. mC4 extends C4's methodology to 101 languages across 86 Common Crawl dumps, totalling ~27 TB; toxicity filtering was not applied for non-English languages. CulturaX merges mC4 and all OSCAR versions with additional cleaning, producing 6.3 trillion tokens in 167 languages; available with authentication on HuggingFace.

Dolma

Allen Institute for AI's open pretraining corpus, created specifically for OLMo. Version 1.x contains ~3 trillion tokens across seven sources: Common Crawl (2,415B tokens), The Stack code (411B), C4 (175B), Reddit (89B), peS2o scientific papers (57B), Project Gutenberg (4.8B), and Wikipedia (3.6B). Dolma 3 (2025) expanded to 9.3 trillion tokens, incorporating PDFs processed by olmOCR. All tools, data, and processing code are open-sourced. Licensed under ODC-BY.

DCLM (DataComp for Language Models)

A multi-institutional benchmark framework providing a ~240 trillion token raw pool from Common Crawl, then systematically testing filtering approaches. Key finding: a fastText classifier trained on OpenHermes 2.5 and ELI5 data dramatically improves dataset quality. The DCLM-Baseline trained a 7B model to 64% MMLU — comparable to LLaMA 3 8B while using 6.6× less compute. Publicly available at datacomp.ai.

Section 2: Curated Multi-Source Datasets — The Recipe Books

Multi-source corpora assemble and weight heterogeneous data types according to a compositional recipe. The key design choices are source selection, source weighting, and how many training epochs to assign each component.

The Pile

EleutherAI's landmark open multi-source corpus, released December 2020. It assembled 825 GiB (~300B tokens) from 22 component datasets, with higher-quality components upsampled: Wikipedia seen 3×, most academic sources 2×. Major components included Pile-CC web text (18.1%), PubMed Central (14.4%), Books3 (12.1%), OpenWebText2 (10.0%), ArXiv (9.0%), and GitHub (7.6%). It powered GPT-NeoX-20B, GPT-J-6B, Pythia, BloombergGPT, and dozens of other models. Following copyright disputes over Books3, the original download was removed. EleutherAI released Common Pile v0.1 in June 2025 as a fully licensed replacement, in partnership with HuggingFace, the Library of Congress, and Poolside.

RedPajama v1 and v2

RedPajama v1 (Together AI) reverse-engineered Meta's LLaMA training recipe, assembling ~1.2 trillion tokens across seven slices: CommonCrawl (~878B), C4 (~175B), GitHub (~59B), ArXiv (~28B), Books (~26B), Wikipedia (~24B), and StackExchange (~20B). It spawned 500+ community models. Licensed under Apache 2.0. RedPajama v2 took a different approach: web-only, five languages, but at massive scale — over 100 trillion raw tokens from 84 Common Crawl snapshots, with 40+ pre-computed quality annotations per document, allowing developers to construct custom filtering pipelines.

SlimPajama

Cerebras demonstrated the power of global deduplication. Starting from RedPajama's 1.2T tokens, they applied MinHash deduplication across all sources simultaneously — not within each source — reducing the corpus to 627 billion tokens, a 49.6% reduction. This outperformed RedPajama baselines, establishing that cross-source deduplication is essential. Processing took ~2.5 days on 64 CPU cores. Licensed under Apache 2.0.

ROOTS

The uniquely governed training corpus for BLOOM, created by 1,000+ BigScience Workshop researchers across 60 countries. ROOTS assembled 1.6 TB across 46 natural languages and 13 programming languages from 498 constituent datasets. Its governance model — with dedicated working groups for data governance, sourcing, privacy, and legal scholarship — remains unprecedented in the field. Community hackathons with Masakhane, ML Tokyo, and LatinX in AI shaped language selection. Available with gated access on HuggingFace.

Section 3: Books Datasets — Copyright's Ground Zero

Books datasets represent the most legally consequential category in AI training data. The chain from shadow libraries to corporate LLM training is now documented in court filings, and the legal distinction between lawfully acquired and pirated source material is reshaping how labs source book content.

Books3

Created by independent developer Shawn Presser, Books3 is the most legally consequential dataset in AI history. Presser scraped 196,640 books from Bibliotik, a private BitTorrent tracker, and uploaded the 100.96 GiB collection in October 2020. It became the third-largest component of The Pile (12.1% weight). Meta explicitly cited Books3 in the LLaMA paper. Its takedown by Danish Rights Alliance in August 2023 triggered cascading legal consequences: Kadrey v. Meta, Authors Guild v. OpenAI (George R.R. Martin, John Grisham, 17+ authors), and suits against Apple, NVIDIA, and Bloomberg. Books3 is officially defunct on HuggingFace.

BookCorpus (Books1)

Scraped from Smashwords by University of Toronto researchers in 2015 — approximately 7,185 unique self-published books totalling ~985 million words. It trained GPT-1 and BERT. Over 100 books contained explicit statements that they were "licensed for your personal enjoyment only." No longer distributed from the original source.

Project Gutenberg

The only legally unimpeachable large-scale books dataset: 70,000+ public domain books, primarily pre-1919 Western literature. Its PG-19 subset (28,602 books, ~1.97 billion tokens) appears in The Pile, RedPajama, Dolma, and the 2025 Common Pile. The limitation is stylistic: dated linguistic patterns that poorly represent modern language use.

Library Genesis (LibGen) and the Piracy Pipeline

LibGen — a Russian-rooted shadow library of 7.5+ million books and 81+ million papers — sits at the centre of the field's most damaging legal revelations. Court documents in Kadrey v. Meta revealed that Meta employees discussed downloading LibGen via BitTorrent for LLaMA 3, with conversations escalating to CEO Mark Zuckerberg, who gave "permission to use" the data. Internal communications acknowledged that exposure could "undermine our negotiating position with regulators." NVIDIA internal emails revealed receipt of "roughly 500 TB of book data" from Anna's Archive. Anthropic maintained a "central library" of over 7 million pirated books. In September 2024, a US judge ordered LibGen to pay $30 million to publishers.

The 2025 Bartz v. Anthropic ruling drew a crucial legal line: training on lawfully purchased books is "spectacularly transformative" fair use, but maintaining a library of pirated books "plainly displaced demand." A proposed $1.5 billion settlement has been reported as the largest in a U.S. copyright case. Provenance is now a legal variable, not merely an ethical preference.

Section 4: Code Datasets — The Licence Filtering Debate

GitHub dominates code training data. The central debate concerns scope: should training include only permissively licensed code (MIT, Apache, BSD), or all public code regardless of licence? The Stack/StarCoder took the conservative approach; others train on all public code.

The Stack v1 and v2

The Stack v1 (BigCode: ServiceNow + HuggingFace) collected 6.4 TB of source code across 358 programming languages from GitHub repositories. It pioneered systematic permissive-licence filtering and the "Am I in The Stack" opt-out tool. The Stack v2, built in partnership with Software Heritage, expanded dramatically to 67.5 TB raw covering 619 programming languages and over 3 billion unique files. Sources expanded beyond GitHub to include pull requests, Jupyter notebooks, Kaggle notebooks, and documentation. Content is accessed via Software Heritage persistent identifiers (SWHIDs), providing stronger provenance tracking than prior approaches.

The scale progression across code datasets illustrates rapid field maturation:

A notable anomaly: the ROOTS code component for BLOOM accidentally filtered for GPL licences only — the inverse of the intended permissive-licence filter — due to a preprocessing bug, raising unresolved questions about copyleft implications for derivative models.

Section 5: Scientific and Academic Datasets

Open-access papers flow freely into training through S2ORC, PubMed Central, and ArXiv. The vast majority of scientific literature remains behind paywalls. Court documents in the NVIDIA case explicitly list Sci-Hub alongside LibGen and Z-Library as training data sources, illustrating the same provenance problem that afflicts books data.

S2ORC and peS2o

S2ORC (Semantic Scholar Open Research Corpus, AI2) aggregates 81.1 million English-language papers with metadata and abstracts, plus 8.1 million with structured full text. Later versions expanded to 136M+ paper nodes with 12.7M full-text papers and 467M citation edges. Now accessed through the Semantic Scholar API as a continuously updated bulk dataset, licensed under ODC-BY. peS2o is AI2's preprocessed version optimised for LLM training: ~40 million open-access papers with additional quality filtering and OCR error detection, forming a major component of Dolma/OLMo.

PubMed Central Open Access Subset

Approximately 4.5 million full-text biomedical articles from NIH/NLM. It has trained BioMedLM (Stanford/MosaicML, 300B tokens), PMC-LLaMA (75B tokens), Meditron (46B tokens), and BioMistral. A notable operational detail: the NIH holds the "PubMed" trademark — Stanford was required to rename PubMedGPT to BioMedLM.

ArXiv Bulk Data

Over 2.4 million papers (~2.7 TB of PDFs, ~1.1 TB of LaTeX source), growing ~100 GB monthly across physics, mathematics, and computer science since 1991. ArXiv's default licence grants distribution rights to ArXiv only — it states it "is unable to grant others the right to distribute arXiv articles" — creating a legal grey area for training use even for openly posted preprints.

USPTO Patent Text

One of the cleanest large-scale technical datasets: fully public domain as US government publications. The 2025 Common Pile uses USPTO as its second-largest source. The Pile uses "USPTO Backgrounds" — the background sections of granted patents, which are dense with technical exposition.

CORE

Aggregates 431 million metadata records with 46 million full texts from 15,000+ data providers in 102 countries, making it the largest open-access paper aggregator. Based at The Open University (UK), it harvests from institutional repositories worldwide using OAI-PMH.

Section 6: Conversational and Social Datasets

Conversational data is undergoing rapid enclosure. Sources that were freely available through 2022 are now either paywalled, legally restricted, or actively litigated. This creates a structural incumbency advantage: organisations that already scraped this data hold it; new entrants face prohibitive costs.

Reddit via Pushshift

Jason Baumgartner's Pushshift archive captured billions of comments and hundreds of millions of posts from 2.8M+ subreddits, dating to Reddit's 2005 inception — the dominant source of conversational training data until its collapse. In June 2023, Reddit imposed API restrictions. Baumgartner shut down real-time feeds. Reddit subsequently struck a ~$60 million/year deal with Google for Gemini training data (February 2024), followed by an estimated ~$70M/year deal with OpenAI. Historical dumps through mid-2023 remain at files.pushshift.io, but the era of free Reddit data is over.

StackExchange, Wikipedia, and Hacker News

StackExchange data dumps cover 170+ Q&A sites under CC-BY-SA licensing — one of the cleanest major data sources. In The Pile: 32.2 GiB (5.1% weight); also used in RedPajama and many other training mixes. Wikipedia appears in virtually every LLM training set. English Wikipedia provides 6.8M+ articles (~25 GB compressed) under CC-BY-SA 4.0. In The Pile it was upsampled to 3× epochs, reflecting its exceptional quality density. Hacker News contributed 3.9 GiB of tech-focused conversations to The Pile; Ubuntu IRC logs added 5.5 GiB of technical support dialogue, valued for spontaneous real-time character.

News Datasets

CC-News extracts news articles from Common Crawl's dedicated news crawl, running since 2016, and provided training data for RoBERTa. RealNews (120 GB, AI2) specifically targeted the top 5,000 Google News domains from 2016–2019 Common Crawl dumps, created to train and detect neural fake news via the Grover model. GDELT monitors global news in 100+ languages with 42 billion words of content, but serves primarily as an event index rather than direct LLM training data.

Section 7: Instruction and Alignment Datasets

Instruction datasets shape model behaviour post-pretraining. Their legal status varies considerably: some are clean (Apache 2.0, MIT), some are commercially restricted, and several exist in a legal grey area owing to derivation from proprietary model outputs — which OpenAI's Terms of Service prohibit from being used to train competing models.

OpenAssistant (OASST1)

Crowd-sourced from 13,500+ volunteers worldwide, producing 161,443 messages across 66,497 conversation trees in 35 languages. Each node carries quality labels, toxicity scores, and preference rankings. Licensed under Apache 2.0 — fully permissive and openly available.

Stanford Alpaca

Demonstrated that high-quality instruction data could be generated cheaply. Using 175 human-written seed pairs and GPT-3.5 (text-davinci-003), the team generated 52,000 instruction-following examples for under $500. The fine-tuned LLaMA-7B performed comparably to GPT-3.5. However, OpenAI's Terms of Service explicitly prohibit using outputs "to develop models that compete with OpenAI," making commercial use legally questionable. Licensed CC BY-NC 4.0.

ShareGPT and Vicuna

User-shared ChatGPT conversations (~70K–125K after cleaning) trained Vicuna (LMSYS), which achieved ~90% of ChatGPT quality at $300 training cost. The legal status remains unresolved: users shared outputs subject to OpenAI's Terms of Service, and the ShareGPT API was eventually disabled. Copies persist across HuggingFace.

FLAN Collection

Google's most comprehensive instruction collection. The 2022 version combines 1,800+ tasks across sub-collections including Chain-of-Thought reasoning, dialogue, and program synthesis. It yielded 4.2% improvement on MMLU and 8.5% on BBH over prior collections. Code is Apache 2.0; component data licences vary.

Anthropic HH-RLHF

Provides ~170K human preference comparisons for reward model training, with separate helpfulness and harmlessness splits. Crowdworkers conversed with AI assistants and selected preferred responses. Licensed under MIT. Used by 282+ models on HuggingFace — the most widely used open RLHF dataset, though it is explicitly not intended for supervised dialogue training.

WildChat and LMSYS-Chat-1M

WildChat (AI2) captured 4.8 million real user-ChatGPT conversations through July 2025, obtained via opt-in consent (free API access in exchange for anonymised transcripts). Over 10% contained toxic content. LMSYS-Chat-1M collected 1 million conversations from Chatbot Arena across 154+ languages and 210K unique IPs — an order of magnitude larger than Anthropic HH or OpenAssistant.

UltraChat

Tsinghua University generated 1.5 million synthetic dialogues by having two separate ChatGPT APIs interact — one playing the user, the other the assistant. A filtered 200K subset trained Zephyr-7B-β. Licensed under MIT.

Section 8: Synthetic Datasets — Bootstrapping from Models

Synthetic data generation has become a central strategy for circumventing data scarcity. Its viability is domain-dependent: it reliably improves capabilities in narrow, verifiable domains (mathematics, coding) but carries recursive risks when applied indiscriminately to web-scale data.

Cosmopedia

HuggingFace generated 28 billion tokens of synthetic textbooks, blog posts, and stories using Mixtral-8x7B-Instruct. Two prompt strategies: conditioning on curated sources (Stanford courses, OpenStax, KhanAcademy) and conditioning on web data from RefinedWeb. Crucially, only open-weight models were used, avoiding proprietary dependencies. It powers SmolLM and serves as an open alternative to Microsoft's unreleased Phi datasets.

Microsoft Phi — The Textbook Quality Paradigm

The most influential synthetic data programme in the field. The "Textbooks Are All You Need" approach (June 2023) used GPT-3.5 to generate synthetic coding textbooks for Phi-1 (7B tokens total). Phi-2 scaled to 250B tokens; Phi-4 reached 400B tokens via 50+ custom generation pipelines. The core insight: data quality dominates scale beyond certain thresholds — the 3.8B parameter Phi-3-mini matched Mixtral performance. The datasets themselves are not publicly released, motivating Cosmopedia as an open alternative.

Magpie

Introduced a method requiring no prompts or seed questions. Feeding only the pre-query chat template to an aligned LLM (such as Llama-3-Instruct) causes it to auto-regressively generate a user query, which is then answered. This extracted 4 million instructions with responses from open-weight models. Published at ICLR 2025.

Orca-Style Datasets

Microsoft's Orca series used "explanation tuning" — querying GPT-4 with prompts requesting detailed reasoning traces for FLAN tasks. Orca 1 used ~6M examples; Orca 3/AgentInstruct scaled to 25M. The community OpenOrca reproduced this approach with ~4.2M GPT-3.5/4 completions, available on HuggingFace.

The Model Collapse Risk

Research by Shumailov et al. (Nature, 2024) demonstrated that indiscriminate training on synthetic data causes irreversible degradation where distribution tails disappear — termed model collapse. The concern is systemic: as AI-generated text proliferates online, future web crawls will inevitably contain synthetic content, creating unintentional recursive training loops. However, mixing synthetic with organic data appears to avoid collapse (Gerstgrasser et al., 2024), and the effect is most severe when organic data is entirely excluded.

Section 9: Multimodal Datasets

LAION-5B and Re-LAION-5B

LAION-5B5.85 billion image-text pairs scraped from Common Crawl — powered Stable Diffusion and DALL-E 2 training. In December 2023, the Stanford Internet Observatory found 3,226 suspected instances of CSAM (child sexual abuse material). LAION immediately took down all datasets. After comprehensive safety revision with the Internet Watch Foundation, Re-LAION-5B was re-released in August 2024 under Apache 2.0 with 2,236 CSAM-linked URLs removed. Class-action artist lawsuits (Andersen v. Stability AI) remain active.

DataComp

Provides a 12.8 billion image-text pair pool (2.5× LAION-5B) as a benchmark for dataset curation. Its best filtering approach — CLIP-based filtering intersected with image-based filtering — produced DataComp-1B: 1.4 billion samples that trained OpenCLIP to 79.2% ImageNet zero-shot accuracy, outperforming OpenAI's CLIP with 9× less compute.

OBELICS

HuggingFace extracted 141 million interleaved image-text web documents (353M images, 115B text tokens) from Common Crawl. Unlike image-caption pairs, OBELICS preserves documents' natural structure — text paragraphs interleaved with images as they appear on web pages. It trained IDEFICS, HuggingFace's Flamingo reproduction.

Video Datasets

InternVid (Shanghai AI Lab) is the largest public video-text dataset: 7M+ videos yielding 234 million clips with LLM-generated captions totalling 4.1 billion words. WebVid-10M (Oxford, 10.7M stock footage clips) was effectively taken down after a Shutterstock cease-and-desist order. MMC4 (101.2M interleaved image-text documents) was partially lost when original copies at AI2 were accidentally deleted in February 2025 — illustrating the fragility of the open data commons.

Section 10: Benchmark Contamination — The Field's Accountability Problem

Benchmark contamination is the field's most persistent integrity problem. It inflates reported performance, distorts model comparisons, and renders widely-used benchmarks unreliable for frontier evaluation.

19%
Maximum MMLU performance inflation measured when leaked benchmark samples are detected in training data.Multiple contamination studies, 2023–2024

Detection methods include n-gram overlap, MinKProb token-level log-probabilities, permutation testing (checking whether performance degrades when answer options are shuffled), and canary strings. NVIDIA's CoDeC estimates contamination from model behaviour without requiring training data access. The fundamental challenge: most training datasets remain undisclosed, making definitive contamination determination impossible without cooperation from model developers.

Section 11: The Chinese LLM Training Data Ecosystem

China has built a fundamentally separate data ecosystem for LLM training. The Great Firewall creates a distinct digital information environment dominated by domestic platforms — WeChat, Weibo, Douyin, Baidu, Zhihu, Bilibili — rather than their Western equivalents. Chinese frontier models have reached comparable data scales to Western counterparts while operating under fundamentally different regulatory and censorship constraints.

Major Open Chinese Datasets

WuDaoCorpora 2.0 (BAAI, Beijing Academy of Artificial Intelligence) is the landmark Chinese dataset: ~3 TB of text (1.08 trillion Chinese characters), cleaned from 100 TB of raw web pages using 20+ rules. A 200 GB open-source base version exists; the full dataset requires institutional cooperation. WanJuan 3.0 (Shanghai AI Lab/InternLM team, January 2025) expanded to 1.2 TB+ across five non-Chinese languages (Thai, Russian, Arabic, Korean, Vietnamese) with 300 billion tokens, licensed under CC BY 4.0. SkyPile-150B (Kunlun Tech) provides ~150 billion tokens (620 GB) from 233 million unique Chinese web pages, processed with 200+ filtering rules and BERT-based sensitive content detection. ChineseWebText 2.0 (Chinese Academy of Sciences, 2024) provides 3.8 TB of cleaned Common Crawl Chinese text with multi-dimensional annotations including quality scores, domain labels, and toxicity classification.

Chinese Platform Data

Baidu Baike contains ~30 million entries — far larger than Chinese Wikipedia's 1.43 million — and is widely used but increasingly restricted; Baidu blocked Google and Bing crawlers in August 2024. Zhihu (China's Quora equivalent) provides high-quality Q&A pairs reported as "highly favoured for training Chinese LLMs." Douban supplies book and film reviews. All face increasing access restrictions, mirroring the platformisation trend observable in Western markets.

Training Data at Chinese Frontier Labs

The following table summarises publicly disclosed training configurations for major Chinese frontier models:

Alibaba Qwen — The Data Flywheel Model

Qwen exemplifies the data flywheel strategy. From 3T tokens (Qwen 1.0) to 36T tokens (Qwen 3), each generation uses prior models to bootstrap data quality: Qwen2.5-VL for PDF text recognition, Qwen2.5 for text refinement, Qwen2.5-Math and Qwen2.5-Coder for domain-specific synthesis. The multilingual annotation system labels 30T+ tokens across dimensions including educational value, domain, and safety.

DeepSeek — Technical Pipeline Sophistication

DeepSeek V2 introduced pipeline elements beyond standard filtering: perplexity-based filtering (proxy model removes unnatural segments), semantic deduplication (embedding similarity, beyond string matching), code deobfuscation (normalising variable names), and cross-lingual alignment (pairing documents with high-quality translations). DeepSeek-Coder-V2 trained on 10.2 trillion tokens (60% code, 10% math, 30% natural language). DeepSeek V3.2 added a large-scale agentic task synthesis pipeline generating 85,000+ prompts across 1,800+ environments.

Regulation and Its Structural Effects

China's Interim Measures for the Management of Generative AI Services (effective August 15, 2023) — the world's first binding generative AI regulation — requires that training data "uphold Core Socialist Values" and use "lawful sources." The Cyberspace Administration of China (CAC) mandates algorithm filing, security assessments, and disclosure of training data sources. As of March 2025, approximately 350 LLMs have filed with CAC. Research and internal use are exempt; only public-facing services trigger obligations.

Independent benchmarking by the Stanford Center for Research on Foundation Models and others has found Chinese-origin models exhibit higher refusal rates, shorter responses, and more inaccurate answers to politically sensitive questions — described as "censorship by design" embedded through training data and RLHF alignment. However, these same models demonstrate strong Chinese-specific capabilities precisely because of their rich, specialised domestic data ecosystem.

Section 12: Cross-Category Analysis

The Quality Revolution in Web Crawl Curation

The evolution of Common Crawl filtering tells the story of the field's maturation: C4's crude heuristics (2019) → RefinedWeb's strict deduplication (2022) → FineWeb's principled filter selection (2024) → DCLM's model-based classification (2024) → FineWeb-Edu's LLM-annotated quality scoring (2024). IBM's GneissWeb outperformed FineWeb by 2.73 percentage points by applying yet another layer of quality classification. The implication is significant: the limiting factor for LLM performance may not be data quantity but data curation sophistication. This partially deflates data wall concerns — the constraint is not total tokens but the ability to extract signal from existing sources.

English Dominance in Training Data

The linguistic skew in LLM training data is severe. GPT-3's training data was ~93% English; LLaMA 2 was ~90% English. In Common Crawl, languages such as Tagalog, Punjabi, and Amharic constitute less than 0.01% of tokens. Performance gaps are correspondingly large: GPT-4 scores 84.9% in English versus 68.1% in Urdu on MMLU (OpenAI, 2023). Efforts including BLOOM (46 natural languages), Aya (101 languages), and CulturaX (167 languages) have expanded coverage without eliminating the gap. English-only safety publications outnumber multilingual ones by approximately 10× at major AI conferences, and the gap is widening.

The Approaching Data Wall

Epoch AI estimates the effective stock of quality-adjusted public text at approximately 300 trillion tokens (90% CI: 100T–1,000T). At current consumption rates — Qwen 3 alone used 36T tokens — the median exhaustion year is 2028. Several factors complicate this projection:

The data wall is real but not a cliff. It is more accurately characterised as a gradual transition from pure scaling to a regime where data efficiency, synthetic data, and inference-time reasoning must carry increasing weight.

Section 13: Master Dataset Reference Table

The table below covers major public and disclosed datasets. Proprietary training mixtures from frontier labs are partly undisclosed; those entries reflect available public information only.

Dataset Creator Scale Type Licence Known use Status
Common CrawlCommon Crawl FoundationPetabytes (400+ TiB/snapshot)Raw web crawlCC TOUNearly all LLMsActive
C4Google Research156B tokensCleaned web (EN)ODC-BYT5, LLaMAAvailable
RefinedWebTII (Abu Dhabi)5T tokens (600B public)Cleaned web (EN)ODC-BYFalconAvailable
FineWebHuggingFace18.5T tokensCleaned web (EN)ODC-BYOpen-source communityAvailable
FineWeb-EduHuggingFace1.3T tokensEducational web (EN)ODC-BYSmolLMAvailable
OSCARInria/ALMAnaCHVaries by languageMultilingual web (168 langs)CC0 metadataCamemBERT, BARTGated
CC-100Meta2.5 TB (100+ langs)Multilingual webCC TOUXLM-RAvailable
mC4Google27 TB (101 langs)Multilingual webODC-BYmT5, ByT5Available
CulturaXU. Oregon / Adobe Research6.3T tokens (167 langs)Multilingual webGatedMultilingual LLMsGated
DolmaAI23T–9.3T tokensMulti-source (EN)ODC-BYOLMoAvailable
DCLMDataComp consortium240T raw / 2.6T curatedCleaned web (EN)OpenResearch communityAvailable
The PileEleutherAI825 GiB / ~300B tokensMulti-source (EN)MixedGPT-NeoX, GPT-J, PythiaPartial (Books3 removed)
RedPajama v1Together AI1.2T tokensMulti-source (EN)Apache 2.0RedPajama-INCITEAvailable
RedPajama v2Together AI30T tokens (dedup)Web-only (5 langs)CC TOUResearch communityAvailable
SlimPajamaCerebras627B tokensDeduplicated RedPajamaApache 2.0Cerebras-GPT, BTLMAvailable
ROOTSBigScience1.6 TB (59 langs)Multi-source multilingualRAILBLOOMGated
Books3Shawn Presser196,640 books / 101 GiBPirated booksNoneLLaMA, GPT-NeoXTaken down
BookCorpusU. Toronto / MIT~7,185 books / 985M wordsSelf-published booksUnclearGPT-1, BERTNo longer distributed
Project GutenbergVolunteer-run (since 1971)70,000+ booksPublic domain booksPublic domainThe Pile, Dolma, Common PileAvailable
The Stack v1BigCode6.4 TB (358 langs)Permissive-licensed codePer-file licencesStarCoder, SantaCoderAvailable
The Stack v2BigCode + Software Heritage67.5 TB (619 langs)Code (all licence types)SWH termsStarCoder2Available
S2ORCAI281M+ papers (8.1M full text)Academic papersODC-BYDolma/OLMo, SciBERTVia API
PubMed Central OANIH/NLM~4.5M articlesBiomedical papersVaries by articleBioMedLM, PMC-LLaMAAvailable
ArXiv bulkCornell University2.4M+ papers / 2.7 TB PDFsSTEM papersArXiv licenceThe Pile, Dolma, RedPajamaAvailable (requester pays)
Reddit (Pushshift)Jason BaumgartnerBillions of posts/commentsSocial mediaRestrictedGPT-3, LLaMA, OPTHistorical only
StackExchangeStack Exchange Inc.32 GiB (in Pile)Q&ACC-BY-SAThe Pile, RedPajamaAvailable
WikipediaWikimedia Foundation25 GB compressed (EN)EncyclopaediaCC-BY-SA 4.0Nearly all LLMsAvailable
CC-NewsCommon CrawlGrowing (since 2016)News articlesCC TOURoBERTaAvailable
OpenAssistant (OASST1)LAION community161K messages / 35 langsInstruction dataApache 2.0OpenAssistant modelsAvailable
AlpacaStanford CRFM52K examplesInstruction dataCC BY-NC 4.0Alpaca-7BAvailable
ShareGPTUser-contributed~70K–125K conversationsChatGPT conversationsGrey areaVicunaAPI disabled; copies exist
FLAN CollectionGoogle Research1,800+ tasks / ~300 GBInstruction tuningApache 2.0 (code)Flan-T5, Flan-PaLMAvailable
Anthropic HH-RLHFAnthropic170K preference pairsRLHF preference dataMIT282+ modelsAvailable
UltraChatTsinghua University1.5M dialoguesSynthetic conversationsMITUltraLM, ZephyrAvailable
WildChatAI24.8M conversationsReal user-LLM conversationsODC-BYResearch communityAvailable
LMSYS-Chat-1MLMSYS (UC Berkeley+)1M conversationsChatbot Arena logsResearch licenceResearch communityGated
CosmopediaHuggingFace28B tokensSynthetic textbooksOpenSmolLMAvailable
Phi textbook dataMicrosoft Research7B–400B tokensSynthetic textbooksNot releasedPhi-1/2/3/4Not public
MagpieXu, Lin et al.4M instructionsSynthetic instructionsOpenMagpieLM, communityAvailable
LAION-5B / Re-LAION-5BLAION e.V.5.85B image-text pairsMultimodal (image-text)Apache 2.0 (Re-LAION)Stable DiffusionRe-released
DataCompMulti-institutional12.8B image-text pairsMultimodal (image-text)ResearchOpenCLIPAvailable
OBELICSHuggingFace141M docs / 353M imagesInterleaved image-textOpenIDEFICSAvailable
InternVidShanghai AI Lab234M video clipsVideo-textResearchViCLIP, InternVideoAvailable
WuDaoCorpora 2.0BAAI3 TB (200 GB open)Chinese multi-sourceInstitutionalWu Dao 2.0Partial
SkyPile-150BKunlun Tech150B tokens / 620 GBChinese webCommunity licenceSkywork-13BAvailable
WanJuanShanghai AI Lab2 TB+ (v1.0)Chinese multi-sourceCC BY 4.0InternLMAvailable
ChineseWebTextCASIA3.8 TB (v2.0)Chinese webOpenChinese LLMsAvailable

Conclusion: Three Tectonic Forces Are Fracturing the Data Landscape

The LLM training data ecosystem is being reshaped by three simultaneous structural shifts.

First, the quality revolution: FineWeb-Edu, DCLM, and the Phi series have established that a well-curated 1.3 trillion token dataset can outperform a noisy 15 trillion token one. Data curation sophistication — not raw token count — is the binding constraint on model quality at the frontier.

Second, the legal reckoning over provenance: courts are establishing that fair use may protect AI training but not if source data was pirated. The 2025 Bartz v. Anthropic ruling — and a proposed $1.5 billion settlement reported as the largest in a U.S. copyright case — has made provenance a hard legal variable. Shadow libraries are no longer a defensible data source, concentrating advantage with organisations able to afford licensed content or generate high-quality synthetic data.

Third, the platformisation of access: Reddit charges $60M+/year, Twitter/X charges $42,000+/month for enterprise API access, Baidu blocks crawlers, and Shutterstock enforces takedowns. The era of freely available internet data for AI training is ending. Companies that already captured this data retain a structural incumbent advantage.

Whoever masters the combination of quality filtering, synthetic data generation, multimodal data exploitation, and inference-time reasoning will define the next generation of AI capabilities — regardless of how many tokens they can no longer freely scrape from the open web.

Open-source efforts — FineWeb, The Stack v2, Dolma, Common Pile, Cosmopedia — represent a deliberate attempt to maintain a viable data commons as commercial enclosure accelerates. Whether that commons remains competitive with proprietary pipelines is the defining data question of the next five years.

Last reviewed: May 2026 by Christopher Foster-McBride, Digital Human Assistants. This reference tracks a rapidly evolving field. Dataset availability, legal status, and scale figures change frequently. Check the blog for updates.

Common questions

Frequently Asked Questions

What is the most widely used LLM training dataset?

Common Crawl is the single most widely used source. It underlies 60–70% of typical LLM training mixtures either directly or through derivative datasets including C4, RefinedWeb, FineWeb, Dolma, and RedPajama. Every major frontier model — GPT-3, LLaMA, Falcon, BLOOM, Mistral, and DeepSeek — derives training data from Common Crawl. The dataset applies no filtering; downstream filtering pipelines determine usable quality.

When will LLM training data run out?

Epoch AI estimates the effective stock of quality-adjusted public text at approximately 300 trillion tokens (Epoch AI, 2024), with a 90% confidence interval of 100T–1,000T tokens. At current consumption rates — Qwen 3 alone used 36 trillion tokens — the median exhaustion year is 2028. However, this is not a sudden cliff: multi-epoch training (viable for 2–5× without significant degradation), improved quality filtering, synthetic data generation, and multimodal data (estimated 400T–20 quadrillion tokens of video and audio by 2030) all extend the effective data runway.

Is AI training on books legal?

The legal picture depends on provenance, not just use. The 2025 US ruling in Bartz v. Anthropic held that training on lawfully purchased books is 'spectacularly transformative' and qualifies as fair use, but maintaining a library of pirated books 'plainly displaced demand' and does not. Court documents in Kadrey v. Meta, and in cases involving NVIDIA and Anthropic, document corporate use of LibGen and Anna's Archive — both shadow libraries of pirated material. Project Gutenberg (70,000+ public domain books) is currently the only large-scale books dataset with unambiguous legal status.

What is benchmark contamination and how does it affect AI evaluation?

Benchmark contamination occurs when test questions from evaluation datasets (such as MMLU, GSM8K, or HumanEval) appear in LLM training data, inflating reported performance scores. Measured effects include MMLU inflation of up to 19% across model families (Yang et al., 2023), GSM8K inflation of up to 13%, and Phi-3 specifically showing a 6.7% performance reduction after decontamination (Phi-3 Technical Report, 2024) — these figures refer to different studies and model families. GPT-5.3 now scores 99% on GSM8K, rendering it non-discriminative for frontier model comparison. Detection methods include n-gram overlap, MinKProb log-probability analysis, and permutation testing, but definitive determination is impossible when training datasets are undisclosed.

How does the Chinese AI training data ecosystem differ from the Western one?

China's data ecosystem is structurally distinct in three ways. First, domestic platforms — Baidu, Zhihu, WeChat, Douyin, Weibo — constitute the primary Chinese-language data sources rather than Reddit, Twitter, or Wikipedia equivalents. Second, China's Interim Measures for the Management of Generative AI Services (August 2023) require training data to 'uphold Core Socialist Values' and use lawful sources, with approximately 350 LLMs filing with the Cyberspace Administration of China as of March 2025. Third, research shows Chinese-origin models exhibit higher refusal rates and more inaccurate answers on politically sensitive topics — characterised as censorship by design embedded through training data and RLHF alignment. Despite these constraints, Chinese frontier models such as Qwen 3 (36T tokens) and DeepSeek V3 (14.8T tokens) have reached data scales comparable to Western counterparts.

What is the difference between pretraining data and instruction tuning data?

Pretraining data — such as Common Crawl derivatives, books, scientific papers, and code — is used to train the base language model on the statistical patterns of language at scale, typically measured in trillions of tokens. Instruction tuning data — such as FLAN, OpenAssistant, Alpaca, and Anthropic HH-RLHF — is used after pretraining to align model behaviour with user intent through supervised fine-tuning and reinforcement learning from human feedback (RLHF). Instruction datasets are typically much smaller (thousands to millions of examples) but have outsized influence on how a model responds. The quality and legal provenance of both layers independently affect both capability and legal risk.

What is synthetic data and what are its risks for LLM training?

Synthetic data is text generated by existing language models rather than produced directly by humans, used to augment or replace organic training data. Microsoft's Phi series demonstrated its value: the 3.8B parameter Phi-3-mini trained primarily on synthetic textbooks matched the performance of Mixtral. The primary risk is model collapse — demonstrated by Shumailov et al. in Nature (2024) — whereby training recursively on model-generated text causes irreversible degradation as rare or nuanced information disappears from the distribution. Research by Gerstgrasser et al. (2024) found that mixing synthetic with organic data prevents collapse. Synthetic data works reliably for narrow, verifiable domains (mathematics, code) but is higher risk for general-purpose pretraining at scale.

About the Author

Christopher Foster-McBride is the Founder of AI Knowledge Signal and Digital Human Assistants. He works with organisations on structuring their knowledge so AI systems can accurately select, cite, and represent them in generated answers. He is the author of the AI Knowledge Signal Framework — a 6-phase methodology for AI visibility — and writes the weekly Signal newsletter on AI knowledge, GEO, and ASO.

Find out how AI systems represent you — then fix it.

The free AI Knowledge Signal Audit scores any public URL across five AI training readiness dimensions and returns a Corpus Survival Likelihood rating. The AI Knowledge Signal Framework — and the AI Knowledge Signal Chrome and Edge extension — give you the structure, audit, and re-score loop to fix what the audit finds.