Data Is Better Together: Empowering Open-Source Dataset Creation
In the rapidly evolving landscape of machine learning, the collaboration between Hugging Face and Argilla has birthed an innovative initiative known as Data Is Better Together (DIBT). This initiative harnesses the collective power of the open-source community to create impactful datasets that can drive advancements in machine learning models. This article delves into the achievements, community involvement, and tools designed to facilitate collaborative dataset creation.
Community Efforts
At the heart of the DIBT initiative lies a commitment to fostering community engagement. Our initial focus was on the Prompt Ranking Project, which aimed to compile a dataset of 10,000 prompts—both synthetic and human-generated—ranked by quality. The response from the community was overwhelming:
- Within days, over 385 individuals joined the initiative.
- We successfully launched the DIBT/10k_prompts_ranked dataset, which is tailored for prompt ranking tasks and synthetic data generation.
- This dataset has already been instrumental in developing new models, such as SPIN.
Recognizing the need for inclusivity, we acknowledged that English-centric data was not enough. To address the lack of language-specific benchmarks for open Large Language Models (LLMs), we initiated the Multilingual Prompt Evaluation Project (MPEP). The goal of MPEP is to create a leaderboard that evaluates prompts across multiple languages.
From this project, we achieved several milestones:
- A curated selection of 500 high-quality prompts from the DIBT/10k_prompts_ranked dataset was translated into various languages.
- More than 18 language leaders took the initiative to create spaces for these translations.
- Completed translations have been achieved in Dutch, Russian, and Spanish, with ongoing efforts to expand these translations.
The establishment of a community of dataset builders on Discord has also been a significant achievement, providing a platform for collaboration and knowledge sharing.
Cookbook Efforts
Beyond community involvement, the DIBT initiative is dedicated to equipping individuals with the resources needed to create high-quality datasets independently. This is encapsulated in our Cookbook Efforts, which provide guides and tools that empower users to build valuable datasets tailored to their unique needs.
Some key projects within the cookbook efforts include:
- Domain Specific Dataset: Designed to jumpstart the creation of domain-specific datasets, this project connects engineers with domain experts to enhance the relevance of the data produced.
- DPO/ORPO Dataset: Aimed at encouraging the community to produce more DPO-style datasets across various languages and domains, fostering diversity in dataset creation.
- KTO Dataset: A resource to assist the community in developing their own KTO datasets, enabling a broader range of datasets for different tasks.
What Have We Learned?
Throughout the development of these initiatives, several key insights have emerged:
- Eagerness to Participate: The community’s response has demonstrated a strong desire to engage in collaborative efforts focused on dataset creation.
- Addressing Inequalities: Our work has highlighted existing disparities in the availability of comprehensive benchmarks. Certain languages, domains, and tasks remain underrepresented in the open-source community, necessitating targeted efforts to rectify these gaps.
- Tools for Collaboration: We have identified that many of the necessary tools for effective collaboration already exist. The challenge now lies in harnessing these tools to build valuable datasets collectively.
How Can You Get Involved?
The DIBT initiative is open for continued participation and collaboration. If you’re interested in contributing to the cookbook efforts, here are several ways to get involved:
- Follow the Project Instructions: Each project has a README file with guidelines on how to contribute. This is your starting point for getting involved.
- Share Your Datasets: If you have created datasets or have results to share, please contribute them to the community.
- Provide New Guides and Tools: Your insights and expertise can help others in the community. Offering new guides or tools can significantly enhance the dataset-building process.
For those eager to join this collaborative effort, we invite you to participate in the #data-is-better-together channel on the Hugging Face Discord. This is a space where you can connect with like-minded individuals and share your ideas on what can be developed together.
The strength of the open-source community lies in its ability to collaborate and innovate. With your contributions, we can continue to build better datasets and drive the future of machine learning forward. Join us in this exciting journey of collective dataset creation!
Inspired by: Source

