AutoMathText-V2

A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset

Chao Li Yifan Zhang Yang Yuan Andrew C Yao

๐Ÿ“Š AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual data. This dataset was meticulously curated using a three-tier deduplication pipeline and AI-powered quality assessment to provide superior training data for LLMs.

Our dataset combines 50+ premium data sources with advanced processing techniques including semantic deduplication, contamination detection, and intelligent text cleaning to deliver exceptional model performance.

What makes AutoMathText-V2 special?

  • ๐Ÿ”ข STEM Concentration: Specially optimized for STEM content (especially Math)
  • ๐Ÿ” Triple Deduplication: Exact โ†’ Fuzzy (MinHash+LSH) โ†’ Semantic (GTE embeddings)
  • ๐Ÿค– AI Quality Assessment: Qwen2-based classifier with multi-source score fusion
  • ๐Ÿงน Advanced Text Cleaning: All text data was processed using Ultimate Data Cleaner v7.5.0.5, which provides robust, high-performance cleaning tailored for web-scraped and scientific data.
  • ๐Ÿ›ก๏ธ Contamination Prevention: Automatic test set leak detection and removal

Dataset Composition

Token Distribution by Domain

Domain Token Count Percentage Description
๐Ÿ† Nemotron CC High 1,468.3B 59.7% High quality CommonCrawl data
๐ŸŒ DCLM 314.2B 12.8% DCLM baseline web content
๐Ÿ’ป RefineCode 279.4B 11.4% GitHub repositories (Academic Use Only)
โญ Nemotron CC Medium-High 254.5B 10.3% Medium-high quality CommonCrawl data
๐Ÿ“š FineWeb Edu 117.4B 4.8% Educational web content
๐ŸŒ Chinese 112.18B 4.6% Chinese general content
๐Ÿง  Reasoning QA 86.2B 3.5% Instruction-following and complex reasoning tasks
๐Ÿ”ข Math Web 68.3B 2.8% Mathematics and scientific content
๐Ÿ“Š MegaMath 28.5B 1.2% Specialized mathematical collections
๐Ÿ”„ Translation 1.61B 0.1% English-Chinese translation pairs
Total 2,460.71B 100% Complete dataset

๐Ÿ”ฅ Complete Data Sources by Domain (52 Premium Datasets)

๐Ÿ† Nemotron CC High Domain

SourceHuggingFace DatasetDescription
Nemotron-CC (High)nvidia/nemotron-ccHigh-quality CommonCrawl subset

โญ Nemotron CC Medium-High Domain

SourceHuggingFace DatasetDescription
Nemotron-CC (Medium-High)nvidia/nemotron-ccMedium-high quality CommonCrawl subset

๐Ÿ“ DCLM Domain

SourceHuggingFace DatasetDescription
DCLM-BaselineDCLM/dclm-baseline-1.0High-quality web content from DCLM

๐Ÿ“š FineWeb Edu Domain

SourceHuggingFace DatasetDescription
FineWeb-EduHuggingFaceFW/fineweb-eduEducational web content (0-5 quality scale)

๐Ÿ”ข Math Web Domain

SourceHuggingFace DatasetDescription
AutoMathTextmath-ai/AutoMathTextMath/Code/ArXiv content with lm_q1q2_score
FineMathHuggingFaceTB/finemathHigh-quality mathematics content (0-5 scale)
Open-Web-Math-Progair-prox/open-web-math-proMathematical web pages
InfiMM-WebMath-40BInfi-MM/InfiMM-WebMath-40BMultimodal mathematical content

๐Ÿ“Š MegaMath Domain

SourceHuggingFace DatasetDescription
MegaMath-QALLM360/MegaMathLarge-scale mathematical QA
MegaMath-Translated-CodeLLM360/MegaMathMathematical code translations
MegaMath-Text-Code-BlockLLM360/MegaMathMixed math text and code blocks

๐Ÿ’ป RefineCode Domain

SourceHuggingFace DatasetDescription
RefineCodem-a-p/RefineCodeGitHub repositories (Academic Use Only)

๐Ÿง  Reasoning QA Domain

SourceHuggingFace DatasetDescription
OPC-Annealing-CorpusOpenCoder-LLM/opc-annealing-corpusCode training corpus
OPC-SFT-Stage1OpenCoder-LLM/opc-sft-stage1Instruction following data (stage 1)
OPC-SFT-Stage2OpenCoder-LLM/opc-sft-stage2Instruction following data (stage 2)
Magpie-Reasoning-V2-250K-CoT-QwQMagpie-Align/Magpie-Reasoning-V2-250K-CoT-QwQChain-of-thought reasoning (QwQ)
Magpie-Reasoning-V1-150K-CoT-QwQMagpie-Align/Magpie-Reasoning-V1-150K-CoT-QwQChain-of-thought reasoning (QwQ)
Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70BMagpie-Align/Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70BAdvanced reasoning (DeepSeek-R1)
Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70BMagpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70BAdvanced reasoning (DeepSeek-R1)
General-Instruction-Augmented-Corporainstruction-pretrain/general-instruction-augmented-corporaGeneral instruction synthesis
FT-Instruction-Synthesizer-Collectioninstruction-pretrain/ft-instruction-synthesizer-collectionFine-tuning instruction synthesis
Code-Feedback-Filtered-Instructionm-a-p/CodeFeedback-Filtered-InstructionCode QA with feedback
XCoder-80Kbanksy235/XCoder-80KCode instruction data
Orca-Math-Word-Problems-200Kmicrosoft/orca-math-word-problems-200kMath word problems
Meta-Math-QAmeta-math/MetaMathQAMathematical QA dataset
Numina-Math-CoTAI-MO/NuminaMath-CoTMath chain-of-thought
Scale-Quest-Mathdyyyyyyyy/ScaleQuest-MathMathematical problem solving
Calc-Ape210KMU-NLPC/Calc-ape210kChinese math problems
MathInstructTIGER-Lab/MathInstructMath instruction data
MathScaleQA-2Mfdqerq22ds/MathScaleQA-2MLarge-scale math QA
Gretel-Math-GSM8K-V1gretelai/gretel-math-gsm8k-v1GSM8K style problems
Open-Math-Instruct-2nvidia/OpenMathInstruct-2Open math instructions
Stack-Math-QAmath-ai/StackMathQAStack Exchange math QA
OpenR1-Math-220Kopen-r1/OpenR1-Math-220kAdvanced math reasoning
Natural-Reasoningfacebook/natural_reasoningNatural language reasoning
Math-Code-InstructMathLLMs/MathCodeInstructMath with code instructions
Math-Code-Instruct-PlusMathLLMs/MathCodeInstruct-PlusEnhanced math-code instructions
Open-OrcaOpen-Orca/OpenOrcaGeneral instruction following
SlimOrca-Deduped-Cleaned-CorrectedOpen-Orca/slimorca-deduped-cleaned-correctedCleaned instruction data
Orca-AgentInstruct-1M-V1-Cleanedmlabonne/orca-agentinstruct-1M-v1-cleanedAgent instruction data
FOL-NLItasksource/FOL-nliFirst-order logic reasoning
Infinity-InstructBAAI/Infinity-InstructMulti-domain instructions
Llama-Nemotron-Post-Training-Dataset-V1nvidia/Llama-Nemotron-Post-Training-Dataset-v1Post-training dataset
Codeforces-CoTsopen-r1/codeforces-cotsCompetitive programming
Reasoning-V1-20Mglaiveai/reasoning-v1-20mLarge-scale reasoning data
Lean-STaR-PlusScalableMath/Lean-STaR-plusLean formal proofs (enhanced)
Lean-STaR-BaseScalableMath/Lean-STaR-baseLean formal proofs (base)
Lean-CoT-PlusScalableMath/Lean-CoT-plusLean chain-of-thought (enhanced)
Lean-CoT-BaseScalableMath/Lean-CoT-baseLean chain-of-thought (base)
Lean-Githubinternlm/Lean-GithubLean repository code
Lean-Workbookinternlm/Lean-WorkbookLean problem workbook
DeepSeek-Prover-V1deepseek-ai/DeepSeek-Prover-V1Formal proof verification

๐ŸŒ FineWeb Edu Chinese Domain

SourceHuggingFace DatasetDescription
FineWeb-Edu-Chineseopencsg/Fineweb-Edu-Chinese-V2.1Chinese educational content (3.4-5.0 scale)

๐Ÿ”„ Translation Domain

SourceHuggingFace DatasetDescription
UN-PCHelsinki-NLP/un_pcEnglish-Chinese translation pairs
UN-PC-ReverseHelsinki-NLP/un_pcChinese-English translation pairs

Processing Pipeline

1. Data Extraction & Standardization

{
    "domain_prefix": "lbty.org",
    "id": "117b6a7d-5126-41fe-9bc2-d276e98632e6",
    "meta": "{\"domain\": \"dclm\", \"ori_score\": 0.043276190757751465, \"source\": \"dclm_baseline\"}",
    "text": "Sabine Expedition\n\nThe Sabine Expedition was an expedition approved by the United States Congress in 1806...",
    "tokens": 145,  # Token count using Qwen2.5 tokenizer
    "url": "https://lbty.org/american-indian-battles/sabine-expedition/",
    "score": 0.19072403013706207
}

2. Three-Tier Deduplication

  • ๐ŸŽฏ Exact Deduplication
    • SHA256 content hashing
    • Priority-based duplicate resolution
    • Result: ~30% exact duplicates removed
  • ๐Ÿ”„ Fuzzy Deduplication
    • MinHash Locality Sensitive Hashing (LSH)
    • Jaccard similarity threshold: 0.9
    • Connected components clustering
    • Result: ~20% near-duplicates removed
  • ๐Ÿง  Semantic Deduplication
    • Alibaba-NLP/gte-multilingual-base embeddings
    • K-means clustering (k=100,000)
    • Cosine similarity threshold: 0.007
    • Result: ~10% semantic duplicates removed

3. ๐Ÿค– AI Quality Assessment

Qwen2-Based Classifier Architecture:

  • Fine-tuned regression head for quality scoring
  • Multi-source score normalization and fusion
  • MSE loss with sigmoid activation

4. ๐Ÿงน Advanced Text Cleaning

All text data was processed using Ultimate Data Cleaner v7.5.0.5, which provides robust, high-performance cleaning tailored for web-scraped and scientific data.

Key Features Used:

  • Advanced LaTeX & Code Protection: protect complex nested LaTeX environments (\begin{}...\end{}), inline math ($...$), commands, and markdown code fences.
  • Quality Heuristics: Removes corrupted samples with excessive repetition, severe bracket imbalances, etc.

5. ๐Ÿ›ก๏ธ Contamination Detection

Test Set Protection:

  • Math dataset test questions
  • GSM8K evaluation problems
  • Exact string matching with preprocessing
  • Automatic filtering during data extraction

How to Use

Loading with Datasets

from datasets import load_dataset

# Load full dataset
dataset = load_dataset("OpenSQZ/AutoMathText-V2", streaming=True)

# Load specific domain
math_data = load_dataset("OpenSQZ/AutoMathText-V2", name="math_web", streaming=True)

๐Ÿ’ป RefineCode Content Download

Important: For the RefineCode domain, only metadata is included in the dataset. The actual code content was removed to reduce storage requirements. To access the full code content, use the blob_id field from the metadata to download from AWS S3:

import os
import json
import boto3
from smart_open import open
from datasets import load_dataset

# Setup AWS credentials
session = boto3.Session(
    aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
    aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"]
)
s3 = session.client("s3")

def download_code_content(blob_id, src_encoding):
    """Download code content from AWS S3 using blob_id"""
    s3_url = f"s3://softwareheritage/content/{blob_id}"
    
    try:
        with open(s3_url, "rb", compression=".gz", transport_params={"client": s3}) as fin:
            content = fin.read().decode(src_encoding)
        return {"content": content}
    except Exception as e:
        return {"content": None, "error": str(e)}

# Load RefineCode domain
refinecode_data = load_dataset("OpenSQZ/AutoMathText-V2", name="refinecode", streaming=True)

# Process each sample to download content
for sample in refinecode_data:
    # Parse metadata to extract blob_id and encoding
    meta = json.loads(sample["meta"])
    blob_id = meta.get("blob_id")
    src_encoding = meta.get("src_encoding", "utf-8")
    
    if blob_id:
        # Download the actual code content
        code_data = download_code_content(blob_id, src_encoding)
        
        # Combine metadata with downloaded content
        full_sample = {
            **sample,
            "code_content": code_data["content"]
        }
        
        print(f"Downloaded content for {sample['id']}")
        print(f"Content length: {len(code_data['content']) if code_data['content'] else 0}")
        break

Requirements:

  • AWS credentials with access to Software Heritage S3 bucket
  • smart_open library: pip install smart_open[s3]
  • boto3 library: pip install boto3

Note: This download method is required only for the RefineCode domain. All other domains contain the full text content directly in the dataset.

Dataset Structure & Configurations

Directory Structure

The dataset is organized by domain with quality-based token splits:

AutoMathText-V2/
โ”œโ”€โ”€ dclm/                 # DCLM baseline web content
โ”‚   โ”œโ”€โ”€ 0-10/             # Bottom 10% quality tokens (score-based)
โ”‚   โ”œโ”€โ”€ 10-20/            # 10-20% quality tokens
โ”‚   โ”œโ”€โ”€ 20-30/            # 20-30% quality tokens
โ”‚   โ”œโ”€โ”€ ...               # Additional percentile ranges
โ”‚   โ””โ”€โ”€ 90-100/           # Top 10% highest quality tokens
โ”œโ”€โ”€ fineweb_edu/          # FineWeb educational content
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ fineweb_edu_chinese/  # Chinese educational content
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ math_web/             # Mathematics and scientific content
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ megamath/             # Specialized math collections
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ nemotron_cc_high/     # High quality Nemotron CommonCrawl
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ nemotron_cc_medium_high/ # Medium-high quality Nemotron CommonCrawl
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ reasoning_qa/         # Instruction and reasoning data
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ”œโ”€โ”€ refinecode/           # GitHub code repositories (Academic Use Only)
โ”‚   โ”œโ”€โ”€ ...
โ”‚   โ””โ”€โ”€ 90-100/
โ””โ”€โ”€ translation/          # English-Chinese translation pairs
    โ”œโ”€โ”€ ...
    โ””โ”€โ”€ 90-100/

Quality-Based Token Distribution

Each domain is divided into 10 quality percentiles (0-10, 10-20, ..., 90-100) based on:

  • Token count: Equal number of tokens per percentile bucket
  • Quality scores: AI classifier scores from Qwen2-based quality assessment
  • Percentile ranking: Higher percentiles contain higher quality content

Available Configurations

  • Domain-specific configs: Load individual domains (dclm, fineweb_edu, math_web, reasoning_qa, etc.)
  • Quality-filtered configs: Load specific quality ranges (e.g., dclm/90-100 for top quality DCLM content)
  • Nemotron variants: Choose between nemotron_cc_high and nemotron_cc_medium_high based on quality needs
  • Combined configs: Mix domains and quality levels based on training requirements
  • Custom sampling: Select percentile ranges across multiple domains for balanced training

Language Distribution

  • English: ~95% of content
  • Chinese: ~5% of content

Deep Dive & Contributing

๐Ÿ”ฌ Technical Deep Dive

For detailed technical documentation, including processing pipeline specifications, deduplication algorithm details, quality classifier training procedures, and contamination detection methodology, please refer to our Technical Documentation and GitHub Repository.

๐Ÿค Contributing

We welcome contributions to improve dataset quality and processing techniques:

  • ๐Ÿ› Bug Reports: Issues with data quality or processing
  • ๐Ÿ’ก Feature Requests: New data sources or processing improvements
  • ๐Ÿ“š Documentation: Help improve our guides and examples
  • ๐Ÿ”ฌ Research: Collaborate on quality assessment and deduplication methods

Licensing & Citation

Citation

@misc{automathtext_v2_2025,
  title={AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset},
  author={Li, Chao and Zhang, Yifan and Yuan, Yang and Yao, Andrew C},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2},
  note={A 2.46T token multi-domain dataset with fine-grained deduplication and AI-powered quality assessment.}
}

@article{zhang2025autonomous,
  title={Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
  author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
  journal={The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings)},
  year={2025}
}

License

Released under AutoMathText Data Agreement for Model Training (See LICENSE).