Join us in building benchmarks that capture early-stage reasoning & scientific knowledge in LLMs!
The landscape of Large Language Models (LLMs) evolves rapidly, driven by innovative research and experimentation. A fundamental aspect of developing LLMs revolves around **ablation experiments**—a meticulous process where various model architectures, data mixtures, and training hyperparameters are evaluated systematically. This is imperative during the early stages of training, where researchers mainly focus on two crucial metrics: the training loss curve and evaluation scores. Unfortunately, much of the existing evaluation benchmarks are often inadequate during these initial stages, where LLMs are being trained on about 200 billion tokens. The lack of meaningful signals makes it difficult to glean actionable insights from ongoing experiments.
In our exciting new competition, we invite contributors to collaborate in building innovative benchmarks aimed at effectively capturing relevant signals during the early training phases of LLMs, focusing particularly on the **scientific knowledge domain**. This initiative not only enhances the research landscape but also facilitates a vibrant community of practitioners and researchers alike.
How to Participate
How to Participate
We are thrilled to host this competition on a dedicated **Hugging Face organization**, ensuring a seamless experience for participants. To get started, simply register for the competition through our registration link 👉 [Registration Link](https://e2lmc.github.io/registration). Participants are required to submit solutions based on the lm-evaluation-harness library via a Hugging Face Space.
An active leaderboard will maintain transparency and excitement during the competition, enabling participants to track promising submissions. The models’ architecture is designed to be accessible, easily runnable by everyone on free-tier **Google Colab GPUs**. Additionally, we provide a comprehensive **starting kit** that includes several notebooks to guide newcomers through the competition.
Evaluation Metrics
Evaluation Metrics
Each submission undergoes rigorous evaluation through three distinct scores: the **Signal Quality Score (ScoreSQ)**, the **Ranking Consistency Score (ScoreRC)**, and the **Compliance with Scientific Knowledge Score (ScoreCS)**. These criteria come together to form a global score that determines the final rankings.
To ensure high standards, two validation procedures will be applied to all submissions: firstly, verifying alignment with established scientific knowledge, and secondly, detecting potential information leakage, especially regarding the presence of answers within the question prompts. The overall score is a weighted sum:
Score = α1 × ScoreSQ + α2 × ScoreRC + α3 × ScoreCS
Here, the weighting coefficients (αSQ, αRC, and αCS) reflect the relative importance of each criterion. For this competition, we have set the weights as α1 = 0.5, α2 = 0.1, and α3 = 0.4. This structure places a greater emphasis on signal quality and compliance with scientific knowledge, which we consider critical metrics for evaluating submissions.
Participants can compute the Signal Quality subscore locally using the provided model checkpoints, featuring three Small Language Models of 0.5B, 1B, and 3B parameters (trained on token counts ranging from 0 to 200B). However, the other two subscores cannot be computed independently, as the relevant checkpoints will remain hidden throughout the competition. Nonetheless, the global score will be automatically calculated upon submission through the Hugging Face competition space, allowing continuous tracking of overall performance and encouraging innovative solutions.
For detailed insights regarding each evaluation metric and comprehensive scoring results on state-of-the-art benchmarks, interested individuals can explore the competition proposal.
Competition Timeline
Competition Timeline
| Competition kick-off | 14 July 2025 |
| Warm-up Phase | 14 July 2025 – 17 August 2025 (5 weeks) |
| Development Phase | 18 August 2025 – 26 October 2025 (10 weeks) |
| Final Phase | 27 October 2025 – 03 November 2025 (3 weeks) |
| Results Announcement | 04 November 2025 |
| Winners’ Fact Sheets & Code Release Due | 22 November 2025 |
| NeurIPS Competition Workshop Presentation | 6 or 7 December 2025 |
Prizes
Prizes
- 🥇 1st Place: 6,000 USD
- 🥈 2nd Place: 4,000 USD
- 🥉 3rd Place: 2,000 USD
- 🎓 Student Awards: 2x 2,000 USD for the top 2 solutions submitted by participants justifying a student status
Support and Contact
Support and Contact
For any inquiries or support, feel free to reach out to our task coordinators at e2lmc@tii.ae. Additionally, participants can join our **Discord channel** to interact directly with our team for real-time assistance.
Affiliated Institutions
Affiliated Institutions
Inspired by: Source

