Judge Arena: The Next Frontier in Evaluating LLMs
In the rapidly evolving sector of AI, particularly in language model applications, the role of Large Language Models (LLMs) as judges has gained significance. But the big question that looms is: how do we determine which models excel in this judging capacity? Enter Judge Arena—a groundbreaking platform that simplifies the process of comparing models side-by-side, all while harnessing the power of crowdsourced feedback.
What is Judge Arena?
Judge Arena is designed to facilitate a fun and interactive way to assess LLMs. The platform allows users to evaluate different models based on how they score and critique AI-generated responses. Once you run the judges on a test sample, you can cast your vote on the evaluation that resonates most with you. Ultimately, the results culminate in a leaderboard showcasing the top-performing models.
The concept of crowdsourced, randomized battles isn’t new; it has proven to be a potent method for benchmarking LLMs. Inspired by LMSys’s Chatbot Arena— which has accumulated over 2 million votes—Judge Arena similarly aims to leverage human preferences to refine AI evaluations. Your direct feedback is critical in determining which LLM judges prove to be the most effective.
How Judge Arena Works
Using Judge Arena is straightforward, involving a few simple steps:
-
Choose Your Sample for Evaluation:
- You can either let the system randomly generate a User Input/AI Response pair or input your custom sample.
-
Evaluation by Two LLM Judges:
- Each judge will score the response and provide their reasoning for the assessment.
- Review and Vote:
- After reviewing both evaluations, you vote for the judge whose critique aligns more closely with your judgment. It’s recommended to look at the scores before delving into the critiques for a balanced perspective.
Following each vote, you have various options:
- Regenerate Judges: Get fresh evaluations for the same sample.
- Start a New Round: Randomly generate a fresh sample for evaluation.
- Input a New Custom Sample: Engage with your content uniquely and receive tailored assessments.
To maintain objectivity, model names are disclosed only after the vote is submitted, thus eliminating bias from the decision-making process.
Selected Models for Evaluation
Judge Arena focuses specifically on the LLM-as-a-Judge paradigm, which includes generative models as evaluators. They set high standards for model selection, emphasizing two main criteria:
- Scoring and Critiquing: The model should effectively score and critique responses.
- Versatility: The model should be capable of evaluating in various scoring formats across diverse criteria.
Currently, 18 cutting-edge LLMs are included in the leaderboard, representing a mix of popular open-source models and proprietary API services. This allows for a comparative analysis that reveals insights into both open and closed approaches.
Featured Models:
- OpenAI: GPT-4o, GPT-4 Turbo, GPT-3.5 Turbo
- Anthropic: Claude 3.5 Sonnet, Opus, Haiku
- Meta: Llama 3.1 Instruct
- Alibaba: Qwen 2.5 Instruct Turbo
- Google: Gemma 2
- Mistral: Instruct v0.3, v0.1
This collection is representative of models frequently utilized in AI evaluation pipelines, and there are plans for expanding this list based on community feedback.
The Leaderboard: Tracking Performance
The cumulative votes gathered from Judge Arena will be compiled into a public leaderboard that showcases each model’s performance. The leaderboard updates hourly, calculating an Elo score for each model to indicate their ranking among peers.
Early Insights from Judge Arena
As we launch Judge Arena, initial observations offer a glimpse into its potential:
-
A Competitive Mix: The leaderboard indicates a robust blend of proprietary and open-source models. GPT-4 Turbo currently leads, but alternatives like Llama and Qwen have shown commendable performance.
-
Surprising Performance from Smaller Models: Notably, the Qwen 2.5 7B and Llama 3.1 8B models are showing impressive capabilities, competing fiercely with their larger counterparts. As more data becomes available, we look forward to exploring the correlation between model scale and judging proficiency.
- Alignment with Existing Research: Early data supports literature suggesting Llama models function well as foundational models for evaluations. Their strong out-of-the-box performance affirms their place in the landscape, as seen with Llama 3.1 ranking prominently on the leaderboard.
How to Contribute to the Judge Arena
The developers of Judge Arena aim to enrich resources for the community. By engaging with the leaderboard, users help guide developers in choosing suitable models for their evaluation frameworks. A forthcoming initiative will allow the sharing of 20% of anonymized voting data, empowering researchers and developers to craft more aligned evaluators.
We welcome community input! Whether you have feature requests, model suggestions, or general feedback, the team encourages open dialogue. Engage through the community tab, via Discord, or even reach out on social media platforms like X/Twitter.
Atla funds this initiative independently and is currently looking for API credits to further support this community endeavor—collaboration inquiries are welcome!
Acknowledgments
A heartfelt thanks to everyone who contributed to testing the arena, along with a special nod to the LMSYS team for their inspiration. Additional gratitude goes to Clémentine Fourrier and the Hugging Face team for their invaluable support.
Judge Arena is set to redefine how we evaluate LLMs, making it an exciting resource for developers, researchers, and the broader AI community alike.
Inspired by: Source

