Initial Assessment Of Language Models: Early Training Evaluation Techniques

Join us in building benchmarks that capture early-stage reasoning & scientific knowledge in LLMs!

The landscape of Large Language Models (LLMs) evolves rapidly, driven by innovative research and experimentation. A fundamental aspect of developing LLMs revolves around **ablation experiments**—a meticulous process where various model architectures, data mixtures, and training hyperparameters are evaluated systematically. This is imperative during the early stages of training, where researchers mainly focus on two crucial metrics: the training loss curve and evaluation scores. Unfortunately, much of the existing evaluation benchmarks are often inadequate during these initial stages, where LLMs are being trained on about 200 billion tokens. The lack of meaningful signals makes it difficult to glean actionable insights from ongoing experiments.

In our exciting new competition, we invite contributors to collaborate in building innovative benchmarks aimed at effectively capturing relevant signals during the early training phases of LLMs, focusing particularly on the **scientific knowledge domain**. This initiative not only enhances the research landscape but also facilitates a vibrant community of practitioners and researchers alike.

How to Participate

We are thrilled to host this competition on a dedicated **Hugging Face organization**, ensuring a seamless experience for participants. To get started, simply register for the competition through our registration link 👉 [Registration Link](https://e2lmc.github.io/registration). Participants are required to submit solutions based on the lm-evaluation-harness library via a Hugging Face Space.

An active leaderboard will maintain transparency and excitement during the competition, enabling participants to track promising submissions. The models’ architecture is designed to be accessible, easily runnable by everyone on free-tier **Google Colab GPUs**. Additionally, we provide a comprehensive **starting kit** that includes several notebooks to guide newcomers through the competition.

Evaluation Metrics

Each submission undergoes rigorous evaluation through three distinct scores: the **Signal Quality Score (Score_SQ)**, the **Ranking Consistency Score (Score_RC)**, and the **Compliance with Scientific Knowledge Score (Score_CS)**. These criteria come together to form a global score that determines the final rankings.

To ensure high standards, two validation procedures will be applied to all submissions: firstly, verifying alignment with established scientific knowledge, and secondly, detecting potential information leakage, especially regarding the presence of answers within the question prompts. The overall score is a weighted sum:

Score = α₁ × Score_SQ + α₂ × Score_RC + α₃ × Score_CS

Here, the weighting coefficients (α_SQ, α_RC, and α_CS) reflect the relative importance of each criterion. For this competition, we have set the weights as α₁ = 0.5, α₂ = 0.1, and α₃ = 0.4. This structure places a greater emphasis on signal quality and compliance with scientific knowledge, which we consider critical metrics for evaluating submissions.

Participants can compute the Signal Quality subscore locally using the provided model checkpoints, featuring three Small Language Models of 0.5B, 1B, and 3B parameters (trained on token counts ranging from 0 to 200B). However, the other two subscores cannot be computed independently, as the relevant checkpoints will remain hidden throughout the competition. Nonetheless, the global score will be automatically calculated upon submission through the Hugging Face competition space, allowing continuous tracking of overall performance and encouraging innovative solutions.

For detailed insights regarding each evaluation metric and comprehensive scoring results on state-of-the-art benchmarks, interested individuals can explore the competition proposal.

Competition Timeline

Competition kick-off	14 July 2025
Warm-up Phase	14 July 2025 – 17 August 2025 (5 weeks)
Development Phase	18 August 2025 – 26 October 2025 (10 weeks)
Final Phase	27 October 2025 – 03 November 2025 (3 weeks)
Results Announcement	04 November 2025
Winners’ Fact Sheets & Code Release Due	22 November 2025
NeurIPS Competition Workshop Presentation	6 or 7 December 2025

Prizes

🥇 1st Place: 6,000 USD
🥈 2nd Place: 4,000 USD
🥉 3rd Place: 2,000 USD
🎓 Student Awards: 2x 2,000 USD for the top 2 solutions submitted by participants justifying a student status

Support and Contact

For any inquiries or support, feel free to reach out to our task coordinators at e2lmc@tii.ae. Additionally, participants can join our **Discord channel** to interact directly with our team for real-time assistance.

Affiliated Institutions

Inspired by: Source

Contents

How to Participate
Evaluation Metrics
Competition Timeline
Prizes
Support and Contact
Affiliated Institutions

Initial Assessment of Language Models: Early Training Evaluation Techniques

How to Participate

Evaluation Metrics

Competition Timeline

Prizes

Support and Contact

Affiliated Institutions

Stay Connected

Explore Top AI Tools Instantly

Latest News

AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report

Navigating the Modern Cybercrime Landscape: Key Insights and Trends

Agoda Launches Innovative Multimodal Content System to Enhance Travel Discovery Through Images and Reviews

Ultimate Guide to Absolute vs Relative Imports in Python: Test Your Knowledge with Our Quiz – Real Python

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

How to Participate

Evaluation Metrics

Competition Timeline

Prizes

Support and Contact

Affiliated Institutions

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

AI-Driven Shift Transforming Cybersecurity Skills and Talent Strategy: Insights from the Hack The Box Report

Navigating the Modern Cybercrime Landscape: Key Insights and Trends

Agoda Launches Innovative Multimodal Content System to Enhance Travel Discovery Through Images and Reviews

Ultimate Guide to Absolute vs Relative Imports in Python: Test Your Knowledge with Our Quiz – Real Python