Data Creation and Verification: Building the ECLeKTic Dataset
Creating a robust dataset is fundamental to advancing the capabilities of language models. One such initiative, ECLeKTic, utilizes a systematic approach to curate a multilingual dataset that focuses on unique Wikipedia articles. This article delves into the methodology behind the construction and verification processes that ensure the quality and relevance of the data.
Selecting Unique Articles from Wikipedia
To kick off the ECLeKTic project, the first step involved selecting Wikipedia articles that are available in only one language across twelve different languages: English, French, German, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Mandarin Chinese, Portuguese, and Spanish. This selection criterion is significant, as these articles tend to cover topics that resonate deeply with speakers of their respective languages. Despite their localized nature, the information contained within these articles can be of interest to a broader audience worldwide.
The rationale behind using Wikipedia as a foundational database is rooted in its vastness and diversity of content. While it is impossible to analyze the training data of every large language model (LLM), the presence of an article in Wikipedia serves as a reliable proxy for its potential exposure to various models. This assumption guides the focus on unique content, suggesting that models must transfer knowledge from the source language to the other eleven target languages to successfully tackle the question-answering (QA) tasks posed by ECLeKTic.
Analyzing Wikipedia Downloads
The methodology for constructing the ECLeKTic dataset involved a detailed analysis of the July 2023 Wikipedia download. For each of the twelve languages, researchers selected 100 random articles that met specific criteria: they needed to contain at least 200 characters, have been viewed at least 100 times during the year, and, crucially, lack equivalent articles in any of the other eleven languages. This rigorous selection process ensures that the chosen articles are not only relevant but also popular among readers.
From these selected articles, the first ten sentences were extracted. This extraction serves as the basis for generating question-answer pairs, a critical component of the dataset. By focusing on shorter, concise segments of text, the team aimed to maintain clarity and relevance in the information conveyed.
Human Annotation for Quality Assurance
Quality control is vital in any dataset creation process, and ECLeKTic is no exception. Human annotators, each fluent in the relevant language, played a crucial role in filtering and correcting the generated question-answer pairs. The annotators first ensured that each question could be answered in a closed book setting, meaning that the answer should not rely on context explicitly referenced in the article. This step is essential for creating questions that stand alone and can be answered without additional information.
Furthermore, the annotators validated that the questions pertained specifically to topics that are salient for speakers of the language in question. This focus on localized knowledge helps to ensure that the dataset reflects the unique cultural and contextual nuances that may be lost in more generalized datasets. Questions that strayed from this focus, such as those relating to general knowledge in science or current events, were discarded.
The Process of Decontextualization
An intriguing aspect of the annotation process in ECLeKTic is the decontextualization step. Here, the annotators worked to ensure that each question contained all necessary information for it to be answerable when translated into other languages. For instance, a question in Hebrew about the “supreme court” was refined to specify “the Israeli supreme court,” thereby eliminating ambiguity. Similarly, named entities were clarified to enhance understanding across languages, ensuring that a reference to “Ambev” explicitly identified it as “the Brazilian brewing company, Ambev.”
Translation and Verification
Once the questions and answers were finalized, the next step involved automatic translation into the other eleven languages. This multilingual approach is crucial for the dataset’s objective, which aims to facilitate knowledge transfer across languages. To ensure accuracy and cultural relevance, another set of human annotators verified the translations, making modifications where necessary. During this stage, some examples were even discarded if they proved untranslatable, particularly questions that referred to the meaning of a word in the source language.
Finalizing the ECLeKTic Dataset
The culmination of this meticulous process resulted in the final ECLeKTic dataset, which consists of 384 unique questions and a total of 4,224 translated examples. This dataset not only serves as a valuable resource for training language models but also highlights the importance of combining automated processes with human expertise to create high-quality, contextually relevant data.
By focusing on unique, language-specific articles and employing rigorous verification methods, ECLeKTic stands as a testament to the potential of multilingual datasets in enhancing the capabilities of language models.

