Introducing Docmatix: A Game-Changer in Document Visual Question Answering
In the ever-evolving landscape of artificial intelligence and machine learning, the demand for robust datasets is paramount, especially for specialized tasks like Document Visual Question Answering (DocVQA). Today, we are excited to introduce Docmatix, an expansive dataset that significantly outstrips previous offerings in scale and potential. With 2.4 million images and 9.5 million question-answer pairs sourced from 1.3 million PDF documents, Docmatix presents a 240X increase in scale compared to prior datasets.
The Genesis of Docmatix
The inception of Docmatix emerged during the development of The Cauldron, a comprehensive collection of 50 datasets aimed at fine-tuning Vision-Language Models (VLMs). While working on Idefics2, we identified a critical gap in the availability of large-scale DocVQA datasets. The existing datasets, particularly DocVQA, which contained only 10,000 images and 39,000 Q/A pairs, were insufficient for training advanced models. This realization catalyzed the creation of Docmatix to fill this void.
Scale and Quality of the Dataset
Docmatix is a monumental leap forward for researchers and practitioners in the AI field. By utilizing PDFA, an extensive OCR dataset with 2.1 million PDFs, we generated Q/A pairs through a Phi-3-small model. Rigorous filtering processes ensured the quality of this dataset, where we discarded 15% of Q/A pairs identified as hallucinations or irrelevant. This meticulous approach guarantees that every question-answer pair is meaningful and reliable, ultimately leading to better model performance.
An example from the dataset
Evaluating Docmatix’s Performance
To evaluate the effectiveness of Docmatix, we conducted a series of ablation studies using the Florence-2 model. This involved training two model versions: one trained over several epochs on the DocVQA dataset and another trained for just one epoch on Docmatix before being fine-tuned on DocVQA. The results were telling—a staggering 20% improvement in performance when utilizing Docmatix. This indicates that larger datasets can significantly enhance the capabilities of VLMs.
Performance Comparison
Here’s a comparative look at the performance metrics of models trained on different datasets:
| Dataset | ANSL on DocVQA | Model Size |
|---|---|---|
| Florence 2 fine-tuned on DocVQA | 60.1 | 700M |
| Florence 2 fine-tuned on Docmatix | 71.4 | 700M |
| Idefics2 | 74.0 | 8B |
The data illustrates that even with a smaller model size, fine-tuning on Docmatix yields results that rival those of much larger models trained on mixed datasets.
Exploring the Dataset
For those interested in delving deeper into the contents of Docmatix, we have made it accessible for exploration. Users can engage with the dataset directly to see the types of documents and question-answer pairs it contains. This hands-on approach allows researchers to better understand how to leverage Docmatix for their specific needs.
Processing Pipeline
For the creation of Docmatix, we meticulously processed each PDF document, converting them to images at a resolution of 150 dpi. This process was resource-intensive, but it was essential for ensuring the dataset’s accessibility and usability. The original PDFs can be traced back to the PDFA dataset, providing transparency and reliability—key attributes for any dataset used in research.
Processing pipeline to generate Docmatix
Insights from Prompt Analysis
During the dataset generation phase, we aimed to create approximately four Q/A pairs per page. This balance ensures diversity without excessive overlap. We also guided the Phi-3 model to generate questions based on specific document content, which minimized repetition. The result is a dataset rich in variety, offering a robust foundation for training effective VLMs.
Analysis of Docmatix per prompt
Conclusion
Docmatix represents a significant advancement in the field of Document Visual Question Answering. By offering a dataset that is larger, more diverse, and of higher quality than its predecessors, we hope to empower the open-source community to reach new heights in model development. With a 20% improvement in performance metrics, Docmatix is poised to bridge the gap between proprietary and open-source models, fostering innovation and collaboration in the AI field.
Useful Resources
We extend our gratitude to those who contributed to the reviews and thumbnails for this blog. For further exploration and insights, be sure to check the resources linked here and dive into the exciting world of Docmatix!
Inspired by: Source



