FinePDFs Release: Massive PDF Dataset For AI Research

Disclosure: Some of the links on this site are affiliate links, meaning that if you click on one of the links and purchase an item, I may receive a commission. All opinions however are my own.

Hugging Face launches FinePDFs, the largest public dataset of PDF documents. This FinePDFs release includes 475 million documents in 1,733 languages, totaling 3 trillion tokens. At 3.65 terabytes, it opens new possibilities for AI training. Unlike web-based datasets, FinePDFs taps into high-quality, domain-specific content from PDFs, revolutionizing data access for researchers. FinePDFs is a newly released dataset that contains a massive collection of PDF documents, designed specifically to help advance research in Artificial Intelligence (AI).

This dataset is packed with diverse and high-quality content, which makes it a valuable tool for researchers and developers working on improving AI models. With FinePDFs, AI systems can be trained to better understand and process PDF files, leading to smarter, more accurate technologies. Whether you’re working on natural language processing, machine learning, or any AI project involving documents, FinePDFs offers an excellent resource to take your research to the next level.

FinePDFs Release: What Are The Key Features?

FreePDFs release brings these features

The FinePDFs release tackles the challenges of processing PDFs with advanced tools. It combines text extraction and GPU-powered OCR to ensure quality. The dataset spans diverse fields like law and academia. Here are its key specifications:

Also Read: Affiliate Marketing Statistics: Trends, Insights, and Opportunities

  • Size: 3.65 terabytes, 3 trillion tokens
  • Documents: 475 million across 1,733 languages
  • Top Languages: English (1.1 trillion tokens), Spanish, German, French, Russian, Japanese (over 100 billion tokens each)
  • Smaller Languages: 978 languages with over 1 million tokens
  • Processing: Uses Docling for text extraction, RolmOCR for GPU-powered OCR, deduplication, and PII anonymization
  • License: Open Data Commons Attribution, free for research
  • Access: Available via Hugging Face Hub, datasets, and Datatrove library

Hugging Face tested FinePDFs by training 1.67B parameter models. Results show it rivals SmolLM-3 Web, a top HTML dataset. Combining both datasets boosts performance across benchmarks. This highlights PDFs’ value for long-context training, as they often contain longer texts than web pages. The release includes a transparent pipeline, detailing OCR detection to deduplication. Community feedback on LinkedIn raised questions about evaluation metrics. Hugging Face’s team emphasized probability-based reporting over single scores. Researchers praise FinePDFs for advancing data transparency and AI model training.

More News To Read: Top No-Code Tools Empower AI Engineers in 2025

New Guide Boosts LLM Performance Tracking with Smart Prompts

Scroll to Top