- ๐ข STEM Concentration: Specially optimized for STEM content (especially Math)
- ๐ Triple Deduplication: Exact โ Fuzzy (MinHash+LSH) โ Semantic (GTE embeddings)
- ๐ค AI Quality Assessment: Qwen2-based classifier with multi-source score fusion
- ๐งน Advanced Text Cleaning: All text data was processed using Ultimate Data Cleaner v7.5.0.5, which provides robust, high-performance cleaning tailored for web-scraped and scientific data.
- ๐ก๏ธ Contamination Prevention: Automatic test set leak detection and removal
๐ AutoMathText-V2 consists of 2.46 trillion tokens of high-quality, deduplicated text spanning web content, mathematics, code, reasoning, and bilingual data. This dataset was meticulously curated using a three-tier deduplication pipeline and AI-powered quality assessment to provide superior training data for LLMs.
Our dataset combines 50+ premium data sources with advanced processing techniques including semantic deduplication, contamination detection, and intelligent text cleaning to deliver exceptional model performance.
What makes AutoMathText-V2 special?
Dataset Composition
Token Distribution by Domain
Domain | Token Count | Percentage | Description |
---|---|---|---|
๐ Nemotron CC High | 1,468.3B | 59.7% | High quality CommonCrawl data |
๐ DCLM | 314.2B | 12.8% | DCLM baseline web content |
๐ป RefineCode | 279.4B | 11.4% | GitHub repositories (Academic Use Only) |
โญ Nemotron CC Medium-High | 254.5B | 10.3% | Medium-high quality CommonCrawl data |
๐ FineWeb Edu | 117.4B | 4.8% | Educational web content |
๐ Chinese | 112.18B | 4.6% | Chinese general content |
๐ง Reasoning QA | 86.2B | 3.5% | Instruction-following and complex reasoning tasks |
๐ข Math Web | 68.3B | 2.8% | Mathematics and scientific content |
๐ MegaMath | 28.5B | 1.2% | Specialized mathematical collections |
๐ Translation | 1.61B | 0.1% | English-Chinese translation pairs |
Total | 2,460.71B | 100% | Complete dataset |
๐ฅ Complete Data Sources by Domain (52 Premium Datasets)
๐ Nemotron CC High Domain
Source | HuggingFace Dataset | Description |
---|---|---|
Nemotron-CC (High) | nvidia/nemotron-cc | High-quality CommonCrawl subset |
โญ Nemotron CC Medium-High Domain
Source | HuggingFace Dataset | Description |
---|---|---|
Nemotron-CC (Medium-High) | nvidia/nemotron-cc | Medium-high quality CommonCrawl subset |
๐ DCLM Domain
Source | HuggingFace Dataset | Description |
---|---|---|
DCLM-Baseline | DCLM/dclm-baseline-1.0 | High-quality web content from DCLM |
๐ FineWeb Edu Domain
Source | HuggingFace Dataset | Description |
---|---|---|
FineWeb-Edu | HuggingFaceFW/fineweb-edu | Educational web content (0-5 quality scale) |
๐ข Math Web Domain
Source | HuggingFace Dataset | Description |
---|---|---|
AutoMathText | math-ai/AutoMathText | Math/Code/ArXiv content with lm_q1q2_score |
FineMath | HuggingFaceTB/finemath | High-quality mathematics content (0-5 scale) |
Open-Web-Math-Pro | gair-prox/open-web-math-pro | Mathematical web pages |
InfiMM-WebMath-40B | Infi-MM/InfiMM-WebMath-40B | Multimodal mathematical content |
๐ MegaMath Domain
Source | HuggingFace Dataset | Description |
---|---|---|
MegaMath-QA | LLM360/MegaMath | Large-scale mathematical QA |
MegaMath-Translated-Code | LLM360/MegaMath | Mathematical code translations |
MegaMath-Text-Code-Block | LLM360/MegaMath | Mixed math text and code blocks |
๐ป RefineCode Domain
Source | HuggingFace Dataset | Description |
---|---|---|
RefineCode | m-a-p/RefineCode | GitHub repositories (Academic Use Only) |
๐ง Reasoning QA Domain
Source | HuggingFace Dataset | Description |
---|---|---|
OPC-Annealing-Corpus | OpenCoder-LLM/opc-annealing-corpus | Code training corpus |
OPC-SFT-Stage1 | OpenCoder-LLM/opc-sft-stage1 | Instruction following data (stage 1) |
OPC-SFT-Stage2 | OpenCoder-LLM/opc-sft-stage2 | Instruction following data (stage 2) |
Magpie-Reasoning-V2-250K-CoT-QwQ | Magpie-Align/Magpie-Reasoning-V2-250K-CoT-QwQ | Chain-of-thought reasoning (QwQ) |
Magpie-Reasoning-V1-150K-CoT-QwQ | Magpie-Align/Magpie-Reasoning-V1-150K-CoT-QwQ | Chain-of-thought reasoning (QwQ) |
Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B | Magpie-Align/Magpie-Reasoning-V1-150K-CoT-Deepseek-R1-Llama-70B | Advanced reasoning (DeepSeek-R1) |
Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B | Magpie-Align/Magpie-Reasoning-V2-250K-CoT-Deepseek-R1-Llama-70B | Advanced reasoning (DeepSeek-R1) |
General-Instruction-Augmented-Corpora | instruction-pretrain/general-instruction-augmented-corpora | General instruction synthesis |
FT-Instruction-Synthesizer-Collection | instruction-pretrain/ft-instruction-synthesizer-collection | Fine-tuning instruction synthesis |
Code-Feedback-Filtered-Instruction | m-a-p/CodeFeedback-Filtered-Instruction | Code QA with feedback |
XCoder-80K | banksy235/XCoder-80K | Code instruction data |
Orca-Math-Word-Problems-200K | microsoft/orca-math-word-problems-200k | Math word problems |
Meta-Math-QA | meta-math/MetaMathQA | Mathematical QA dataset |
Numina-Math-CoT | AI-MO/NuminaMath-CoT | Math chain-of-thought |
Scale-Quest-Math | dyyyyyyyy/ScaleQuest-Math | Mathematical problem solving |
Calc-Ape210K | MU-NLPC/Calc-ape210k | Chinese math problems |
MathInstruct | TIGER-Lab/MathInstruct | Math instruction data |
MathScaleQA-2M | fdqerq22ds/MathScaleQA-2M | Large-scale math QA |
Gretel-Math-GSM8K-V1 | gretelai/gretel-math-gsm8k-v1 | GSM8K style problems |
Open-Math-Instruct-2 | nvidia/OpenMathInstruct-2 | Open math instructions |
Stack-Math-QA | math-ai/StackMathQA | Stack Exchange math QA |
OpenR1-Math-220K | open-r1/OpenR1-Math-220k | Advanced math reasoning |
Natural-Reasoning | facebook/natural_reasoning | Natural language reasoning |
Math-Code-Instruct | MathLLMs/MathCodeInstruct | Math with code instructions |
Math-Code-Instruct-Plus | MathLLMs/MathCodeInstruct-Plus | Enhanced math-code instructions |
Open-Orca | Open-Orca/OpenOrca | General instruction following |
SlimOrca-Deduped-Cleaned-Corrected | Open-Orca/slimorca-deduped-cleaned-corrected | Cleaned instruction data |
Orca-AgentInstruct-1M-V1-Cleaned | mlabonne/orca-agentinstruct-1M-v1-cleaned | Agent instruction data |
FOL-NLI | tasksource/FOL-nli | First-order logic reasoning |
Infinity-Instruct | BAAI/Infinity-Instruct | Multi-domain instructions |
Llama-Nemotron-Post-Training-Dataset-V1 | nvidia/Llama-Nemotron-Post-Training-Dataset-v1 | Post-training dataset |
Codeforces-CoTs | open-r1/codeforces-cots | Competitive programming |
Reasoning-V1-20M | glaiveai/reasoning-v1-20m | Large-scale reasoning data |
Lean-STaR-Plus | ScalableMath/Lean-STaR-plus | Lean formal proofs (enhanced) |
Lean-STaR-Base | ScalableMath/Lean-STaR-base | Lean formal proofs (base) |
Lean-CoT-Plus | ScalableMath/Lean-CoT-plus | Lean chain-of-thought (enhanced) |
Lean-CoT-Base | ScalableMath/Lean-CoT-base | Lean chain-of-thought (base) |
Lean-Github | internlm/Lean-Github | Lean repository code |
Lean-Workbook | internlm/Lean-Workbook | Lean problem workbook |
DeepSeek-Prover-V1 | deepseek-ai/DeepSeek-Prover-V1 | Formal proof verification |
๐ FineWeb Edu Chinese Domain
Source | HuggingFace Dataset | Description |
---|---|---|
FineWeb-Edu-Chinese | opencsg/Fineweb-Edu-Chinese-V2.1 | Chinese educational content (3.4-5.0 scale) |
๐ Translation Domain
Source | HuggingFace Dataset | Description |
---|---|---|
UN-PC | Helsinki-NLP/un_pc | English-Chinese translation pairs |
UN-PC-Reverse | Helsinki-NLP/un_pc | Chinese-English translation pairs |
Processing Pipeline
1. Data Extraction & Standardization
{
"domain_prefix": "lbty.org",
"id": "117b6a7d-5126-41fe-9bc2-d276e98632e6",
"meta": "{\"domain\": \"dclm\", \"ori_score\": 0.043276190757751465, \"source\": \"dclm_baseline\"}",
"text": "Sabine Expedition\n\nThe Sabine Expedition was an expedition approved by the United States Congress in 1806...",
"tokens": 145, # Token count using Qwen2.5 tokenizer
"url": "https://lbty.org/american-indian-battles/sabine-expedition/",
"score": 0.19072403013706207
}
2. Three-Tier Deduplication
- ๐ฏ Exact Deduplication
- SHA256 content hashing
- Priority-based duplicate resolution
- Result: ~30% exact duplicates removed
- ๐ Fuzzy Deduplication
- MinHash Locality Sensitive Hashing (LSH)
- Jaccard similarity threshold: 0.9
- Connected components clustering
- Result: ~20% near-duplicates removed
- ๐ง Semantic Deduplication
Alibaba-NLP/gte-multilingual-base
embeddings- K-means clustering (k=100,000)
- Cosine similarity threshold: 0.007
- Result: ~10% semantic duplicates removed
3. ๐ค AI Quality Assessment
Qwen2-Based Classifier Architecture:
- Fine-tuned regression head for quality scoring
- Multi-source score normalization and fusion
- MSE loss with sigmoid activation
4. ๐งน Advanced Text Cleaning
All text data was processed using Ultimate Data Cleaner v7.5.0.5, which provides robust, high-performance cleaning tailored for web-scraped and scientific data.
Key Features Used:
- Advanced LaTeX & Code Protection: protect complex nested LaTeX environments (
\begin{}...\end{}
), inline math ($...$
), commands, and markdown code fences. - Quality Heuristics: Removes corrupted samples with excessive repetition, severe bracket imbalances, etc.
5. ๐ก๏ธ Contamination Detection
Test Set Protection:
- Math dataset test questions
- GSM8K evaluation problems
- Exact string matching with preprocessing
- Automatic filtering during data extraction
How to Use
Loading with Datasets
from datasets import load_dataset
# Load full dataset
dataset = load_dataset("OpenSQZ/AutoMathText-V2", streaming=True)
# Load specific domain
math_data = load_dataset("OpenSQZ/AutoMathText-V2", name="math_web", streaming=True)
๐ป RefineCode Content Download
Important: For the RefineCode domain, only metadata is included in the dataset. The actual code content was removed to reduce storage requirements. To access the full code content, use the blob_id
field from the metadata to download from AWS S3:
import os
import json
import boto3
from smart_open import open
from datasets import load_dataset
# Setup AWS credentials
session = boto3.Session(
aws_access_key_id=os.environ["AWS_ACCESS_KEY_ID"],
aws_secret_access_key=os.environ["AWS_SECRET_ACCESS_KEY"]
)
s3 = session.client("s3")
def download_code_content(blob_id, src_encoding):
"""Download code content from AWS S3 using blob_id"""
s3_url = f"s3://softwareheritage/content/{blob_id}"
try:
with open(s3_url, "rb", compression=".gz", transport_params={"client": s3}) as fin:
content = fin.read().decode(src_encoding)
return {"content": content}
except Exception as e:
return {"content": None, "error": str(e)}
# Load RefineCode domain
refinecode_data = load_dataset("OpenSQZ/AutoMathText-V2", name="refinecode", streaming=True)
# Process each sample to download content
for sample in refinecode_data:
# Parse metadata to extract blob_id and encoding
meta = json.loads(sample["meta"])
blob_id = meta.get("blob_id")
src_encoding = meta.get("src_encoding", "utf-8")
if blob_id:
# Download the actual code content
code_data = download_code_content(blob_id, src_encoding)
# Combine metadata with downloaded content
full_sample = {
**sample,
"code_content": code_data["content"]
}
print(f"Downloaded content for {sample['id']}")
print(f"Content length: {len(code_data['content']) if code_data['content'] else 0}")
break
Requirements:
- AWS credentials with access to Software Heritage S3 bucket
smart_open
library:pip install smart_open[s3]
boto3
library:pip install boto3
Note: This download method is required only for the RefineCode domain. All other domains contain the full text content directly in the dataset.
Dataset Structure & Configurations
Directory Structure
The dataset is organized by domain with quality-based token splits:
AutoMathText-V2/
โโโ dclm/ # DCLM baseline web content
โ โโโ 0-10/ # Bottom 10% quality tokens (score-based)
โ โโโ 10-20/ # 10-20% quality tokens
โ โโโ 20-30/ # 20-30% quality tokens
โ โโโ ... # Additional percentile ranges
โ โโโ 90-100/ # Top 10% highest quality tokens
โโโ fineweb_edu/ # FineWeb educational content
โ โโโ ...
โ โโโ 90-100/
โโโ fineweb_edu_chinese/ # Chinese educational content
โ โโโ ...
โ โโโ 90-100/
โโโ math_web/ # Mathematics and scientific content
โ โโโ ...
โ โโโ 90-100/
โโโ megamath/ # Specialized math collections
โ โโโ ...
โ โโโ 90-100/
โโโ nemotron_cc_high/ # High quality Nemotron CommonCrawl
โ โโโ ...
โ โโโ 90-100/
โโโ nemotron_cc_medium_high/ # Medium-high quality Nemotron CommonCrawl
โ โโโ ...
โ โโโ 90-100/
โโโ reasoning_qa/ # Instruction and reasoning data
โ โโโ ...
โ โโโ 90-100/
โโโ refinecode/ # GitHub code repositories (Academic Use Only)
โ โโโ ...
โ โโโ 90-100/
โโโ translation/ # English-Chinese translation pairs
โโโ ...
โโโ 90-100/
Quality-Based Token Distribution
Each domain is divided into 10 quality percentiles (0-10, 10-20, ..., 90-100) based on:
- Token count: Equal number of tokens per percentile bucket
- Quality scores: AI classifier scores from Qwen2-based quality assessment
- Percentile ranking: Higher percentiles contain higher quality content
Available Configurations
- Domain-specific configs: Load individual domains (
dclm
,fineweb_edu
,math_web
,reasoning_qa
, etc.) - Quality-filtered configs: Load specific quality ranges (e.g.,
dclm/90-100
for top quality DCLM content) - Nemotron variants: Choose between
nemotron_cc_high
andnemotron_cc_medium_high
based on quality needs - Combined configs: Mix domains and quality levels based on training requirements
- Custom sampling: Select percentile ranges across multiple domains for balanced training
Language Distribution
- English: ~95% of content
- Chinese: ~5% of content
Deep Dive & Contributing
๐ฌ Technical Deep Dive
For detailed technical documentation, including processing pipeline specifications, deduplication algorithm details, quality classifier training procedures, and contamination detection methodology, please refer to our Technical Documentation and GitHub Repository.
๐ค Contributing
We welcome contributions to improve dataset quality and processing techniques:
- ๐ Bug Reports: Issues with data quality or processing
- ๐ก Feature Requests: New data sources or processing improvements
- ๐ Documentation: Help improve our guides and examples
- ๐ฌ Research: Collaborate on quality assessment and deduplication methods
Licensing & Citation
Citation
@misc{automathtext_v2_2025,
title={AutoMathText-V2: A 2.46 Trillion Token AI-Curated STEM Pretraining Dataset},
author={Li, Chao and Zhang, Yifan and Yuan, Yang and Yao, Andrew C},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/datasets/OpenSQZ/AutoMathText-V2},
note={A 2.46T token multi-domain dataset with fine-grained deduplication and AI-powered quality assessment.}
}
@article{zhang2025autonomous,
title={Autonomous Data Selection with Zero-shot Generative Classifiers for Mathematical Texts},
author={Zhang, Yifan and Luo, Yifan and Yuan, Yang and Yao, Andrew C},
journal={The 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025 Findings)},
year={2025}
}
License
Released under AutoMathText Data Agreement for Model Training (See LICENSE).