TL;DR: Benchmark datasets on Hugging Face can now host leaderboards. Models store their own eval scores. Everything links together. The community can submit results via PR. Verified badges prove that the results can be reproduced.
Evaluation is Broken
As we move into 2026, it’s essential to confront some harsh truths about the state of evaluations in AI and machine learning. Renowned benchmarks like MMLU have seemingly plateaued at scores above 91%, while GSM8K has reached an impressive 94%+. HumanEval has also been waylaid by overachieving models. However, many of these models, which excel in benchmark evaluations, still struggle with practical applications, such as browsing the web effectively, writing production-level code, or tackling multi-step tasks without resorting to hallucination. This stark contrast highlights a gap between the scores we see and the reality of model performance.
Moreover, another layer of confusion arises from discrepancies in reported benchmark scores. Various sources—ranging from Model Cards to academic papers—often provide differing results. As a result, the AI community finds itself lacking a unified source of truth.
What We’re Shipping
Decentralized and Transparent Evaluation Reporting
The team is excited to announce a transformative shift in how evaluations are reported on the Hugging Face Hub. We aim to decentralize the reporting process, allowing for an inclusive, community-driven way to submit evaluation scores for benchmarks. Initially, we will focus on four pivotal benchmarks, with plans to expand to more relevant ones over time.
Benchmarks: Dataset repositories can now register as benchmarks (MMLU-Pro, GPQA, HLE are already live). These datasets will automatically aggregate results reported from various sources on the Hub, creating leaderboards visible on the dataset card. Every benchmark will define its evaluation specifications via eval.yaml, formatted according to the Inspect AI standards, ensuring reproducibility. Reported results must align with specific task definitions.
For Models: Evaluation scores will reside in .eval_results/*.yaml files within the model repository. These scores will appear on the model card and be incorporated into benchmark datasets. Results from model authors will be aggregated along with any open pull requests for reported scores, providing a clearer picture of each model’s performance.
For the Community: Any user can contribute evaluation results for any model via a PR, showcasing their contributions as “community” sources without waiting for model authors to approve or merge these changes. Users can link to external references, such as research papers or third-party evaluation platforms, and discussions surrounding these scores will be encouraged just like in any PR. Thanks to Git-based architecture, a history of when evaluations were submitted and amended will always be available.
To delve deeper into evaluation results, feel free to explore our documentation.
Why This Matters
The decentralization of evaluation practices will expose scores that already exist within the community, often tucked away in model cards and academic literature. By bringing these scores into the light, we enable the community to aggregate, analyze, and understand evaluation results comprehensively. Additionally, comprehensive APIs will make it straightforward to build curated leaderboards and dashboards based on these results.
It’s important to clarify that community evaluations will not replace established benchmarks. Leaderboards and closed evaluations remain essential. However, contributing to the field with accessible and reproducible evaluation results is equally key. While this initiative won’t entirely resolve issues like benchmark saturation or the disparity between benchmarks and real-world applications, it does shine a spotlight on what’s being evaluated, the methods employed, the timing, and the evaluators themselves.
Ultimately, our aspiration is to transform the Hub into a thriving space for sharing and developing reproducible benchmarks, with a particular emphasis on new tasks and domains that rigorously challenge state-of-the-art models.
Get Started
Add Eval Results: To contribute, publish your evaluation results as YAML files in .eval_results/ within any model repository.
Check Out Scores: You can view the updated scores on your chosen benchmark dataset.
Register a New Benchmark: If you’re interested in creating a new benchmark, add eval.yaml to your dataset repository and reach out to us for inclusion in the shortlist.
Please note that this feature is currently in beta. We are building this in an open environment, and your feedback is immensely welcome!
Inspired by: Source



