Unlocking India’s Linguistic Diversity: The Vaani Dataset
In a pioneering collaboration, the Indian Institute of Science (IISc) and ARTPARK have partnered with Hugging Face to amplify accessibility to Vaani, India’s most diverse open-source, multi-modal, and multi-lingual dataset. This partnership embodies a shared commitment to fostering inclusive, accessible, and advanced AI technologies that celebrate linguistic and cultural diversity across the nation.
Partnership Overview
The collaboration between Hugging Face and IISc/ARTPARK aims to enhance the usability of the Vaani dataset, paving the way for the development of AI systems that can better comprehend India’s rich linguistic tapestry. By making this dataset widely available, the organizations are encouraging developers to create solutions tailored to the digital needs of a diverse population.
About the Vaani Dataset
Launched in 2022 through a joint effort by IISc/ARTPARK and Google, Project Vaani is a groundbreaking initiative designed to curate an open-source multi-modal dataset that authentically mirrors India’s linguistic diversity. Unlike many datasets that focus primarily on mainstream languages, Vaani adopts a geo-centric approach, capturing dialects and languages spoken in remote and underserved regions.
With a target of collecting over 150,000 hours of speech and 15,000 hours of transcribed text data from 1 million individuals across all 773 districts, Vaani is set to reshape the landscape of language technology in India. The dataset is being constructed in phases, with the first phase already made available, covering 80 districts. The ongoing Phase 2 aims to extend its reach to an additional 100 districts, further enriching the dataset’s breadth and impact.
Key Highlights of the Vaani Dataset, open sourced as of 15-02-2025
District-wise Language Distribution
The Vaani dataset reveals an impressive distribution of languages across India’s districts, showcasing the country’s linguistic richness at a granular level. This data is invaluable for researchers, AI developers, and language technology innovators aiming to construct speech models that cater to specific regions and dialects. For more detailed insights into district-wise language distribution, visit the Vaani Dataset on Hugging Face.
Transcribed Subset
For those specifically interested in transcribed data, a subset of the main dataset has been made available, consisting of 790 hours of transcribed audio from approximately 700,000 speakers covering 70,000 images. This resource includes smaller, segmented audio units paired with precise transcriptions, enabling various tasks, such as:
- Speech Recognition: Training models to accurately convert spoken language into text.
- Language Modeling: Developing refined language models that enhance understanding and interaction.
- Segmentation Tasks: Identifying discrete speech units to improve transcription accuracy.
This additional dataset complements the main Vaani dataset, facilitating the development of end-to-end speech recognition systems and targeted AI solutions.
Utility of Vaani in the Age of LLMs
The Vaani dataset boasts a multitude of advantages, including extensive coverage of 54 languages, representation from diverse geographical regions, and a broad demographic spectrum. These features empower the creation of inclusive AI models capable of:
- Speech-to-Text and Text-to-Speech: Fine-tuning models for both LLM and non-LLM applications, including transcription tagging for code-switching (Indic and English).
- Foundational Speech Models for Indic Languages: Supporting the development of robust foundational models that cater to the nuances of Indic languages.
- Speaker Identification/Verification Models: Leveraging data from over 80,000 speakers to create reliable identification and verification systems.
- Language Identification Models: Enabling the development of models tailored for diverse real-world applications.
- Speech Enhancement Systems: Utilizing the dataset’s tagging system to build advanced speech enhancement technologies.
- Enhancing Multimodal LLMs: The unique data collection method supports the development of multimodal capabilities in LLMs when integrated with other datasets.
- Performance Benchmarking: Providing an ideal platform for benchmarking speech models due to its rich linguistic, geographical, and real-world data properties.
These AI models can power various Conversational AI applications, from educational tools to telemedicine platforms, healthcare solutions, voter helplines, media localization, and multilingual smart devices, making the Vaani dataset a transformative resource in real-world scenarios.
What’s Next
The partnership between IISc/ARTPARK and Google has expanded into Phase 2, which will cover an additional 100 districts, ultimately bringing the Vaani dataset to all states in India. This expansion signifies a major milestone in making linguistic resources available for broader use.
The map highlights the districts across India where data has been collected as of Feb 5, 2025.
How You Can Contribute
The most impactful way to contribute to the Vaani project is to utilize the dataset. Whether you’re building innovative AI applications, conducting research, or exploring unique use cases, your involvement will help enhance and expand this initiative.
We welcome feedback, insights, and collaboration opportunities. For inquiries or to share your experiences, please reach out via email at vaanicontact@gmail.com or fill out our feedback form.
Made with ❤️ for India’s linguistic diversity.
Inspired by: Source


