Explore our research paper titled MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining, authored by Zhixun Chen and a team of 12 contributors.
Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM, and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.
Submission History
From: Zhixun Chen [view email]
[v1] Wed, 2 Jul 2025 15:11:12 UTC (2,780 KB)
[v2] Tue, 30 Dec 2025 08:00:04 UTC (2,772 KB)
[v3] Thu, 5 Mar 2026 07:04:12 UTC (2,772 KB)
—
### Understanding MuRating: A Leap Towards Multilingual Model Efficiency
In the rapidly evolving landscape of artificial intelligence, the performance of large language models (LLMs) has been primarily governed by the quality of data used during pretraining. Traditional selection methods have concentrated heavily on English datasets, leaving a noticeable gap in multilingual applications. Addressing this challenge, the innovative framework known as MuRating emerges as a solution, facilitating enhanced data quality across a spectrum of languages.
### The Foundations of MuRating
MuRating endeavors to establish a comprehensive framework capable of evaluating data quality beyond just English. By leveraging high-caliber English data-quality signals, MuRating creates a robust system that operates effectively across 17 different languages. This is a significant advancement, especially as the demand for multilingual capabilities in technology grows.
### Methodology: Pairwise Comparisons and Quality Scores
At the core of MuRating’s effectiveness is its unique approach to aggregating multiple English “raters.” Utilizing pairwise comparison methodologies, the framework learns to develop unified document-quality scores. This innovative method enables MuRating to translate these quality judgments, forming the basis for training a multilingual evaluator. It handles monolingual, cross-lingual, and parallel text pairs, showcasing its versatility in diverse linguistic contexts.
### Practical Applications: Data Selection for Pretraining
The practical implications of MuRating are particularly significant when applied to web data. By selecting balanced subsets of both English and multilingual content, the framework is pivotal in pretraining a 1.2 billion parameter LLaMA model. This pretraining process is crucial as it enhances the model’s capability to perform across a multitude of languages and tasks, addressing previously identified performance gaps.
### Comparative Performance: Advancements Over Existing Models
When evaluated against established benchmarks and competing models such as QuRater, AskLLM, and DCLM, MuRating demonstrates a marked improvement in average accuracy. Not only does it excel in English-specific benchmarks, but it also significantly enhances performance in multilingual evaluations, particularly in tasks that require intricate knowledge. These results underscore the framework’s potential to elevate the standard of multilingual natural language understanding.
### Insights and Future Directions
Through the lens of MuRating, further analyses reveal critical aspects of translation fidelity, selection biases, and the underrepresentation of narrative materials across languages. These findings illuminate pathways for future research, providing a granular understanding of how to improve multilingual models and mitigate biases within training datasets.
### The Authors Behind MuRating
This groundbreaking research is a collaborative effort involving esteemed researchers such as Zhixun Chen, Ping Guo, and Yifan Zhang, among others. Collectively, they bring a wealth of knowledge and expertise to the project, paving the way for advancements in multilingual language processing.
In this era of technological globalization, frameworks like MuRating not only enhance the efficiency and quality of multilingual data processing but also foster a more inclusive approach to AI. The future of language processing technology is indeed multilingual, and MuRating represents a substantial step forward in realizing this vision.
Inspired by: Source

