Unlocking India’s Linguistic Diversity: The Vaani Dataset

In a pioneering collaboration, the Indian Institute of Science (IISc) and ARTPARK have partnered with Hugging Face to amplify accessibility to Vaani, India’s most diverse open-source, multi-modal, and multi-lingual dataset. This partnership embodies a shared commitment to fostering inclusive, accessible, and advanced AI technologies that celebrate linguistic and cultural diversity across the nation.

Contents

Partnership Overview
About the Vaani Dataset
District-wise Language Distribution
Transcribed Subset
Utility of Vaani in the Age of LLMs
What’s Next
How You Can Contribute

Partnership Overview

The collaboration between Hugging Face and IISc/ARTPARK aims to enhance the usability of the Vaani dataset, paving the way for the development of AI systems that can better comprehend India’s rich linguistic tapestry. By making this dataset widely available, the organizations are encouraging developers to create solutions tailored to the digital needs of a diverse population.

About the Vaani Dataset

Launched in 2022 through a joint effort by IISc/ARTPARK and Google, Project Vaani is a groundbreaking initiative designed to curate an open-source multi-modal dataset that authentically mirrors India’s linguistic diversity. Unlike many datasets that focus primarily on mainstream languages, Vaani adopts a geo-centric approach, capturing dialects and languages spoken in remote and underserved regions.

With a target of collecting over 150,000 hours of speech and 15,000 hours of transcribed text data from 1 million individuals across all 773 districts, Vaani is set to reshape the landscape of language technology in India. The dataset is being constructed in phases, with the first phase already made available, covering 80 districts. The ongoing Phase 2 aims to extend its reach to an additional 100 districts, further enriching the dataset’s breadth and impact.

Key Highlights of the Vaani Dataset, open sourced as of 15-02-2025

District-wise Language Distribution

The Vaani dataset reveals an impressive distribution of languages across India’s districts, showcasing the country’s linguistic richness at a granular level. This data is invaluable for researchers, AI developers, and language technology innovators aiming to construct speech models that cater to specific regions and dialects. For more detailed insights into district-wise language distribution, visit the Vaani Dataset on Hugging Face.

Transcribed Subset

For those specifically interested in transcribed data, a subset of the main dataset has been made available, consisting of 790 hours of transcribed audio from approximately 700,000 speakers covering 70,000 images. This resource includes smaller, segmented audio units paired with precise transcriptions, enabling various tasks, such as:

Speech Recognition: Training models to accurately convert spoken language into text.
Language Modeling: Developing refined language models that enhance understanding and interaction.
Segmentation Tasks: Identifying discrete speech units to improve transcription accuracy.

This additional dataset complements the main Vaani dataset, facilitating the development of end-to-end speech recognition systems and targeted AI solutions.

Utility of Vaani in the Age of LLMs

The Vaani dataset boasts a multitude of advantages, including extensive coverage of 54 languages, representation from diverse geographical regions, and a broad demographic spectrum. These features empower the creation of inclusive AI models capable of:

Speech-to-Text and Text-to-Speech: Fine-tuning models for both LLM and non-LLM applications, including transcription tagging for code-switching (Indic and English).
Foundational Speech Models for Indic Languages: Supporting the development of robust foundational models that cater to the nuances of Indic languages.
Speaker Identification/Verification Models: Leveraging data from over 80,000 speakers to create reliable identification and verification systems.
Language Identification Models: Enabling the development of models tailored for diverse real-world applications.
Speech Enhancement Systems: Utilizing the dataset’s tagging system to build advanced speech enhancement technologies.
Enhancing Multimodal LLMs: The unique data collection method supports the development of multimodal capabilities in LLMs when integrated with other datasets.
Performance Benchmarking: Providing an ideal platform for benchmarking speech models due to its rich linguistic, geographical, and real-world data properties.

These AI models can power various Conversational AI applications, from educational tools to telemedicine platforms, healthcare solutions, voter helplines, media localization, and multilingual smart devices, making the Vaani dataset a transformative resource in real-world scenarios.

What’s Next

The partnership between IISc/ARTPARK and Google has expanded into Phase 2, which will cover an additional 100 districts, ultimately bringing the Vaani dataset to all states in India. This expansion signifies a major milestone in making linguistic resources available for broader use.

The map highlights the districts across India where data has been collected as of Feb 5, 2025.

How You Can Contribute

The most impactful way to contribute to the Vaani project is to utilize the dataset. Whether you’re building innovative AI applications, conducting research, or exploring unique use cases, your involvement will help enhance and expand this initiative.

We welcome feedback, insights, and collaboration opportunities. For inquiries or to share your experiences, please reach out via email at vaanicontact@gmail.com or fill out our feedback form.

Made with ❤️ for India’s linguistic diversity.

Inspired by: Source

HuggingFace and IISc Collaborate to Boost Model Development for India’s Multilingual Landscape

Unlocking India’s Linguistic Diversity: The Vaani Dataset

Partnership Overview

About the Vaani Dataset

District-wise Language Distribution

Transcribed Subset

Utility of Vaani in the Age of LLMs

What’s Next

How You Can Contribute

Stay Connected

Explore Top AI Tools Instantly

Latest News

Pope Leo XIV Collaborates with Anthropic Co-Founder to Release Text on Human Dignity and Artificial Intelligence

LISTEN to Your Preferences: A Comprehensive LLM Framework for Effective Multi-Objective Selection

Poll Reveals One-Third of UK University Students Believe AI Job Losses Could Trigger Social Unrest

Key Google Updates and Announcements You Can Expect This Week

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Unlocking India’s Linguistic Diversity: The Vaani Dataset

Partnership Overview

About the Vaani Dataset

More Read

District-wise Language Distribution

Transcribed Subset

Utility of Vaani in the Age of LLMs

What’s Next

How You Can Contribute

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Pope Leo XIV Collaborates with Anthropic Co-Founder to Release Text on Human Dignity and Artificial Intelligence

LISTEN to Your Preferences: A Comprehensive LLM Framework for Effective Multi-Objective Selection

Poll Reveals One-Third of UK University Students Believe AI Job Losses Could Trigger Social Unrest

Key Google Updates and Announcements You Can Expect This Week