Optimizing Multilingual Large Language Model Pretraining: A High-Quality Data Selection Strategy

Submitted on: 2 Jul 2025 (v1), Last revised: 5 Mar 2026 (this version, v3)

Authors: Zhixun Chen, Ping Guo, Wenhan Han, Yifan Zhang, Binbin Liu, Haobin Lin, Fengze Liu, Yan Zhao, Bingni Zhang, Taifeng Wang, Yin Zheng, Trevor Cohn, Meng Fang

Explore our research paper titled MuRating: A High Quality Data Selecting Approach to Multilingual Large Language Model Pretraining, authored by Zhixun Chen and a team of 12 contributors.

View PDF

Abstract: Data quality is a critical driver of large language model performance, yet existing model-based selection methods focus almost exclusively on English. We introduce MuRating, a scalable framework that transfers high-quality English data-quality signals into a single rater for 17 target languages. MuRating aggregates multiple English “raters” via pairwise comparisons to learn unified document-quality scores, then projects these judgments through translation to train a multilingual evaluator on monolingual, cross-lingual, and parallel text pairs. Applied to web data, MuRating selects balanced subsets of English and multilingual content to pretrain a 1.2 B-parameter LLaMA model. Compared to strong baselines, including QuRater, AskLLM, DCLM, and so on, our approach boosts average accuracy on both English benchmarks and multilingual evaluations, with especially large gains on knowledge-intensive tasks. We further analyze translation fidelity, selection biases, and underrepresentation of narrative material, outlining directions for future work.

Submission History

From: Zhixun Chen [view email]

[v1] Wed, 2 Jul 2025 15:11:12 UTC (2,780 KB)

[v2] Tue, 30 Dec 2025 08:00:04 UTC (2,772 KB)

[v3] Thu, 5 Mar 2026 07:04:12 UTC (2,772 KB)

—

### Understanding MuRating: A Leap Towards Multilingual Model Efficiency

In the rapidly evolving landscape of artificial intelligence, the performance of large language models (LLMs) has been primarily governed by the quality of data used during pretraining. Traditional selection methods have concentrated heavily on English datasets, leaving a noticeable gap in multilingual applications. Addressing this challenge, the innovative framework known as MuRating emerges as a solution, facilitating enhanced data quality across a spectrum of languages.

### The Foundations of MuRating

MuRating endeavors to establish a comprehensive framework capable of evaluating data quality beyond just English. By leveraging high-caliber English data-quality signals, MuRating creates a robust system that operates effectively across 17 different languages. This is a significant advancement, especially as the demand for multilingual capabilities in technology grows.

### Methodology: Pairwise Comparisons and Quality Scores

At the core of MuRating’s effectiveness is its unique approach to aggregating multiple English “raters.” Utilizing pairwise comparison methodologies, the framework learns to develop unified document-quality scores. This innovative method enables MuRating to translate these quality judgments, forming the basis for training a multilingual evaluator. It handles monolingual, cross-lingual, and parallel text pairs, showcasing its versatility in diverse linguistic contexts.

### Practical Applications: Data Selection for Pretraining

The practical implications of MuRating are particularly significant when applied to web data. By selecting balanced subsets of both English and multilingual content, the framework is pivotal in pretraining a 1.2 billion parameter LLaMA model. This pretraining process is crucial as it enhances the model’s capability to perform across a multitude of languages and tasks, addressing previously identified performance gaps.

### Comparative Performance: Advancements Over Existing Models

When evaluated against established benchmarks and competing models such as QuRater, AskLLM, and DCLM, MuRating demonstrates a marked improvement in average accuracy. Not only does it excel in English-specific benchmarks, but it also significantly enhances performance in multilingual evaluations, particularly in tasks that require intricate knowledge. These results underscore the framework’s potential to elevate the standard of multilingual natural language understanding.

### Insights and Future Directions

Through the lens of MuRating, further analyses reveal critical aspects of translation fidelity, selection biases, and the underrepresentation of narrative materials across languages. These findings illuminate pathways for future research, providing a granular understanding of how to improve multilingual models and mitigate biases within training datasets.

### The Authors Behind MuRating

This groundbreaking research is a collaborative effort involving esteemed researchers such as Zhixun Chen, Ping Guo, and Yifan Zhang, among others. Collectively, they bring a wealth of knowledge and expertise to the project, paving the way for advancements in multilingual language processing.

In this era of technological globalization, frameworks like MuRating not only enhance the efficiency and quality of multilingual data processing but also foster a more inclusive approach to AI. The future of language processing technology is indeed multilingual, and MuRating represents a substantial step forward in realizing this vision.

Inspired by: Source

Optimizing Multilingual Large Language Model Pretraining: A High-Quality Data Selection Strategy

Submission History

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Disables Instagram Feature Allowing Users to Create AI Deepfakes of Public Accounts

Optimizing Layer-Adaptive Large Language Models: Curvature-Weighted Capacity Allocation Using Minimum Description Length Framework

Concerns Rise as UK Shops Launch Facial Recognition Technology with Real-Time Police Alerts

Cloudflare Launches Temporary Accounts for Seamless Autonomous Worker Deployment

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

Submission History

More Read

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Meta Disables Instagram Feature Allowing Users to Create AI Deepfakes of Public Accounts

Optimizing Layer-Adaptive Large Language Models: Curvature-Weighted Capacity Allocation Using Minimum Description Length Framework

Concerns Rise as UK Shops Launch Facial Recognition Technology with Real-Time Police Alerts

Cloudflare Launches Temporary Accounts for Seamless Autonomous Worker Deployment