Introducing Docmatix: A Game-Changer in Document Visual Question Answering

In the ever-evolving landscape of artificial intelligence and machine learning, the demand for robust datasets is paramount, especially for specialized tasks like Document Visual Question Answering (DocVQA). Today, we are excited to introduce Docmatix, an expansive dataset that significantly outstrips previous offerings in scale and potential. With 2.4 million images and 9.5 million question-answer pairs sourced from 1.3 million PDF documents, Docmatix presents a 240X increase in scale compared to prior datasets.

Contents

The Genesis of Docmatix
Scale and Quality of the Dataset
Evaluating Docmatix’s Performance

Performance Comparison

Exploring the Dataset
Processing Pipeline
Insights from Prompt Analysis
Conclusion

Useful Resources

The Genesis of Docmatix

The inception of Docmatix emerged during the development of The Cauldron, a comprehensive collection of 50 datasets aimed at fine-tuning Vision-Language Models (VLMs). While working on Idefics2, we identified a critical gap in the availability of large-scale DocVQA datasets. The existing datasets, particularly DocVQA, which contained only 10,000 images and 39,000 Q/A pairs, were insufficient for training advanced models. This realization catalyzed the creation of Docmatix to fill this void.

Scale and Quality of the Dataset

Docmatix is a monumental leap forward for researchers and practitioners in the AI field. By utilizing PDFA, an extensive OCR dataset with 2.1 million PDFs, we generated Q/A pairs through a Phi-3-small model. Rigorous filtering processes ensured the quality of this dataset, where we discarded 15% of Q/A pairs identified as hallucinations or irrelevant. This meticulous approach guarantees that every question-answer pair is meaningful and reliable, ultimately leading to better model performance.

An example from the dataset

Evaluating Docmatix’s Performance

To evaluate the effectiveness of Docmatix, we conducted a series of ablation studies using the Florence-2 model. This involved training two model versions: one trained over several epochs on the DocVQA dataset and another trained for just one epoch on Docmatix before being fine-tuned on DocVQA. The results were telling—a staggering 20% improvement in performance when utilizing Docmatix. This indicates that larger datasets can significantly enhance the capabilities of VLMs.

Performance Comparison

Here’s a comparative look at the performance metrics of models trained on different datasets:

Dataset	ANSL on DocVQA	Model Size
Florence 2 fine-tuned on DocVQA	60.1	700M
Florence 2 fine-tuned on Docmatix	71.4	700M
Idefics2	74.0	8B

The data illustrates that even with a smaller model size, fine-tuning on Docmatix yields results that rival those of much larger models trained on mixed datasets.

Exploring the Dataset

For those interested in delving deeper into the contents of Docmatix, we have made it accessible for exploration. Users can engage with the dataset directly to see the types of documents and question-answer pairs it contains. This hands-on approach allows researchers to better understand how to leverage Docmatix for their specific needs.

https://huggingface.co/datasets/HuggingFaceM4/Docmatix/embed/viewer/default/train" frameborder="0" width="100%" height="560px

Processing Pipeline

For the creation of Docmatix, we meticulously processed each PDF document, converting them to images at a resolution of 150 dpi. This process was resource-intensive, but it was essential for ensuring the dataset’s accessibility and usability. The original PDFs can be traced back to the PDFA dataset, providing transparency and reliability—key attributes for any dataset used in research.

Processing pipeline to generate Docmatix

Insights from Prompt Analysis

During the dataset generation phase, we aimed to create approximately four Q/A pairs per page. This balance ensures diversity without excessive overlap. We also guided the Phi-3 model to generate questions based on specific document content, which minimized repetition. The result is a dataset rich in variety, offering a robust foundation for training effective VLMs.

Analysis of Docmatix per prompt

Conclusion

Docmatix represents a significant advancement in the field of Document Visual Question Answering. By offering a dataset that is larger, more diverse, and of higher quality than its predecessors, we hope to empower the open-source community to reach new heights in model development. With a 20% improvement in performance metrics, Docmatix is poised to bridge the gap between proprietary and open-source models, fostering innovation and collaboration in the AI field.

Useful Resources

We extend our gratitude to those who contributed to the reviews and thumbnails for this blog. For further exploration and insights, be sure to check the resources linked here and dive into the exciting world of Docmatix!

Inspired by: Source

Comprehensive Dataset for Document Visual Question Answering: Enhance Your AI Models

Introducing Docmatix: A Game-Changer in Document Visual Question Answering

The Genesis of Docmatix

Scale and Quality of the Dataset

Evaluating Docmatix’s Performance

Performance Comparison

Exploring the Dataset

Processing Pipeline

Insights from Prompt Analysis

Conclusion

Useful Resources

Stay Connected

Explore Top AI Tools Instantly

Latest News

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance

Master Python Continuous Integration and Deployment with GitHub Actions: Take the Real Python Quiz

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Introducing Docmatix: A Game-Changer in Document Visual Question Answering

The Genesis of Docmatix

Scale and Quality of the Dataset

Evaluating Docmatix’s Performance

More Read

Performance Comparison

Exploring the Dataset

Processing Pipeline

Insights from Prompt Analysis

Conclusion

Useful Resources

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Enhancing Mission-Critical Small Language Models through Multi-Model Synthetic Training: Insights from Research 2509.13047

OpenAI Acquires AI Personal Finance Startup Hiro: What This Means for the Future

Google Launches Gemma 4: Emphasizing Local-First, On-Device AI Inference for Enhanced Performance

Master Python Continuous Integration and Deployment with GitHub Actions: Take the Real Python Quiz