By using this site, you agree to the Privacy Policy and Terms of Use.
Accept
AIModelKitAIModelKitAIModelKit
  • Home
  • News
    NewsShow More
    Bernie Sanders Urges Caution: The US Lacks Understanding of the Speed and Scale of the Impending AI Revolution | US News
    Bernie Sanders Urges Caution: The US Lacks Understanding of the Speed and Scale of the Impending AI Revolution | US News
    6 Min Read
    Executives Share Positive Outlook on Future Business Prospects
    Executives Share Positive Outlook on Future Business Prospects
    6 Min Read
    India’s Sarvam Unveils Indus AI Chat App Amid Intensifying Competition in the Market
    India’s Sarvam Unveils Indus AI Chat App Amid Intensifying Competition in the Market
    5 Min Read
    Trump’s Environmental Policies Lead to Dirtier Coal Plants Amid Rising Energy Demands from AI
    Trump’s Environmental Policies Lead to Dirtier Coal Plants Amid Rising Energy Demands from AI
    5 Min Read
    India Poised to Harness US Tech Giants’ Innovations at Delhi Summit: A Focus on AI Advancements
    India Poised to Harness US Tech Giants’ Innovations at Delhi Summit: A Focus on AI Advancements
    6 Min Read
  • Open-Source Models
    Open-Source ModelsShow More
    Streamline Your Web Apps: Leverage Gradio’s gr.HTML for One-Shot Integration
    Streamline Your Web Apps: Leverage Gradio’s gr.HTML for One-Shot Integration
    6 Min Read
    Boosting Throughput with Adaptive Time-Varying Capacity Strategies
    Boosting Throughput with Adaptive Time-Varying Capacity Strategies
    5 Min Read
    Creating, Simulating, and Testing Dynamic Human-AI Group Conversations: A Comprehensive Guide
    Creating, Simulating, and Testing Dynamic Human-AI Group Conversations: A Comprehensive Guide
    5 Min Read
    Unlocking Underwater Mysteries: How AI Trained on Birds is Revolutionizing Ocean Research
    Unlocking Underwater Mysteries: How AI Trained on Birds is Revolutionizing Ocean Research
    4 Min Read
    Empower Your LLMs with JavaScript: Essential Tools and Techniques
    Empower Your LLMs with JavaScript: Essential Tools and Techniques
    6 Min Read
  • Guides
    GuidesShow More
    Comprehensive Quiz on Deep Dive Concepts with Examples – Real Python
    Comprehensive Quiz on Deep Dive Concepts with Examples – Real Python
    1 Min Read
    Ultimate Real Python Quiz Guide: Test Your Skills and Knowledge
    Ultimate Real Python Quiz Guide: Test Your Skills and Knowledge
    4 Min Read
    Mastering Python Docstrings: A Comprehensive Guide from Real Python
    Mastering Python Docstrings: A Comprehensive Guide from Real Python
    6 Min Read
    Comprehensive Real Python Quiz: Test Your Knowledge with In-Depth Examples
    Comprehensive Real Python Quiz: Test Your Knowledge with In-Depth Examples
    5 Min Read
    Mastering the File System: Take the Real Python Quiz
    Mastering the File System: Take the Real Python Quiz
    4 Min Read
  • Tools
    ToolsShow More
    Discover SyGra Studio: Your Gateway to Exceptional Creative Solutions
    Discover SyGra Studio: Your Gateway to Exceptional Creative Solutions
    6 Min Read
    Maximizing Power Efficiency in AI Manufacturing with NVIDIA Spectrum-X Ethernet Photonics
    Maximizing Power Efficiency in AI Manufacturing with NVIDIA Spectrum-X Ethernet Photonics
    5 Min Read
    Understanding Mantle’s Zero Operator Access Design: An In-Depth Exploration
    Understanding Mantle’s Zero Operator Access Design: An In-Depth Exploration
    5 Min Read
    Optimizing Hardware-Software Co-Design with PyTorch: A Comprehensive Guide
    Optimizing Hardware-Software Co-Design with PyTorch: A Comprehensive Guide
    6 Min Read
    How to Enable Cluster Launch Control with TLX in PyTorch: A Step-by-Step Guide
    How to Enable Cluster Launch Control with TLX in PyTorch: A Step-by-Step Guide
    5 Min Read
  • Events
    EventsShow More
    error code: 524
    error code: 524
    5 Min Read
    NVIDIA Joins Forces with India’s Leading Manufacturers and Global Industrial Software Giants to Propel AI Revolution
    NVIDIA Joins Forces with India’s Leading Manufacturers and Global Industrial Software Giants to Propel AI Revolution
    5 Min Read
    Explore Highlights from NVIDIA AI Day São Paulo: Innovations and Insights
    Explore Highlights from NVIDIA AI Day São Paulo: Innovations and Insights
    6 Min Read
    Auto Browse: Essential Insights for Educators on Google’s New AI Tool
    Auto Browse: Essential Insights for Educators on Google’s New AI Tool
    6 Min Read
    How to Avoid the Rising Trend of AI-Generated Pink Slime
    How to Avoid the Rising Trend of AI-Generated Pink Slime
    4 Min Read
  • Ethics
    EthicsShow More
    The Download: Microsoft’s Online Reality Check and the Alarming Surge in Measles Cases
    The Download: Microsoft’s Online Reality Check and the Alarming Surge in Measles Cases
    4 Min Read
    Enhancing Research in Taiwan’s Humanities and Social Sciences: How AI Agents Transform Labor into Collaborative Methodologies
    Enhancing Research in Taiwan’s Humanities and Social Sciences: How AI Agents Transform Labor into Collaborative Methodologies
    6 Min Read
    Is Google DeepMind Questioning the Authenticity of Chatbots: Are They Just Virtue Signaling?
    Is Google DeepMind Questioning the Authenticity of Chatbots: Are They Just Virtue Signaling?
    5 Min Read
    Exploring the Ethical and Societal Implications of Generative AI in Higher Education for Computing
    Exploring the Ethical and Societal Implications of Generative AI in Higher Education for Computing
    6 Min Read
    Exploring the ‘Uncanny Valley’: ICE’s Hidden Expansion Strategies, Palantir Employees’ Ethical Dilemmas, and the Role of AI Assistants
    Exploring the ‘Uncanny Valley’: ICE’s Hidden Expansion Strategies, Palantir Employees’ Ethical Dilemmas, and the Role of AI Assistants
    5 Min Read
  • Comparisons
    ComparisonsShow More
    OpenAI Launches Harness Engineering: Empowering Large-Scale Software Development with Codex Agents
    5 Min Read
    Examining Community Perspectives on Body-Worn Camera Footage: A Comprehensive Analysis
    Examining Community Perspectives on Body-Worn Camera Footage: A Comprehensive Analysis
    6 Min Read
    Optimizing Policy-Based Few-Step Generation through Imitation Distillation Techniques
    Optimizing Policy-Based Few-Step Generation through Imitation Distillation Techniques
    5 Min Read
    Understanding Block-Recurrent Dynamics in Vision Transformers: Insights from Paper [2512.19941]
    Understanding Block-Recurrent Dynamics in Vision Transformers: Insights from Paper [2512.19941]
    5 Min Read
    Exploring the Mechanistic Interpretability of Cognitive Complexity in LLMs Through Linear Probing and Bloom’s Taxonomy
    Exploring the Mechanistic Interpretability of Cognitive Complexity in LLMs Through Linear Probing and Bloom’s Taxonomy
    4 Min Read
Search
  • Privacy Policy
  • Terms of Service
  • Contact Us
  • FAQ / Help Center
  • Advertise With Us
  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events
© 2025 AI Model Kit. All Rights Reserved.
Reading: Explore Arabic Instruction Following, AraGen Updates, and Additional Enhancements
Share
Notification Show More
Font ResizerAa
AIModelKitAIModelKit
Font ResizerAa
  • 🏠
  • 🚀
  • 📰
  • 💡
  • 📚
  • ⭐
Search
  • Home
  • News
  • Models
  • Guides
  • Tools
  • Ethics
  • Events
  • Comparisons
Follow US
  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events
© 2025 AI Model Kit. All Rights Reserved.
AIModelKit > Comparisons > Explore Arabic Instruction Following, AraGen Updates, and Additional Enhancements
Comparisons

Explore Arabic Instruction Following, AraGen Updates, and Additional Enhancements

aimodelkit
Last updated: April 13, 2025 8:53 am
aimodelkit
Share
Explore Arabic Instruction Following, AraGen Updates, and Additional Enhancements
SHARE

Explore Arabic Instruction Following, AraGen Updates, and Additional Enhancements

 

 

Neha Sengupta's avatar

 




At Inception, we have been working to enhance AI model evaluations within the Arabic language context. Previously, we introduced AraGen, one of the first generative Arabic leaderboards, serving as a benchmark for evaluating Arabic LLMs on generative tasks.

As part of our ongoing efforts, we are excited to share the following updates:

  • Arabic-Leaderboards Space, launched in collaboration with Mohammed bin Zayed University of Artificial Intelligence (MBZUAI) to consolidate Arabic AI evaluations in one place. This platform currently supports AraGen-03-25 and Arabic Instruction Following, with plans to expand to leaderboards for Arabic AI models across various modalities.
  • AraGen 03-25 release with improvements and updated benchmark.
  • Instruction Following leaderboard, powered by Arabic IFEval Benchmark, the first publicly available benchmark for evaluating instruction-following capabilities in Arabic.

The following sections provide details about each of these updates.

Arabic-Leaderboards Space

Arabic-Leaderboards is a comprehensive and unified space for all Arabic evaluations and tasks. It is meant to serve as a central hub covering a broad spectrum of evaluations, for models across modalities. Currently, it has AraGen-03-25 and Arabic Instruction Following as live leaderboards. We plan to expand this space with more leaderboards and tasks for Arabic AI models across various modalities.

We invite interested contributors to reach out to us through the community tab or directly through email in order to discuss how to integrate their work/leaderboards as additional tabs into this space.

Latest Updates in AraGen Leaderboard

In December 2024, we introduced the AraGen Benchmark as the foundation for the AraGen Leaderboard. A key feature of this leaderboard is its dynamic nature, with evaluation datasets remaining private (blind testing) for three months to ensure fair and unbiased assessments. Adhering to the same philosophy, we are publicly releasing the AraGen-12-24 benchmark, along with all model responses evaluated by Claude-3.5-Sonnet following the 3C3H guidelines.

By sharing this benchmark and model responses, we aim to encourage the community to review them, identify any unexpected behaviors we may have missed and help us refine our evaluation framework.

AraGen-03-25 Release

In this latest AraGen release, we have expanded the dataset to include 340 pairs of questions and answers, up from 279 in the previous version. The distribution remains relatively similar:

  • Question Answering: ~200 pairs
  • Reasoning: 70 pairs
  • Safety Questions: 40 pairs
  • Orthographic and Grammatical Analysis: 30 pairs

This allocation reflects the primary focus on question answering as the main use cases of any Language-Model/Chatbot/AI-Assistant, while still addressing other evaluation areas, particularly given the complexity of generating challenging queries in Arabic grammar and orthography.

Tasks Distribution (%)

Additionally, we refined the judge system prompt to enhance clarity, even for smaller/weaker judge models.

Dynamic Evaluation and Ranking Analysis

Maintaining consistency and reliability in our benchmark and evaluation pipeline is crucial as we introduce dynamic evaluation cycles. To ensure this, we analyzed ranking variations among the top 10 models across different dataset versions and system prompt configurations.

Analysis of Ranking Changes

We analyzed model performance under two evaluation scenarios:

  • Compared previous system prompt (SP1) vs. current system prompt (SP2) using the latest AraGen version (AraGen-03-25).
  • Assessed the impact of updating both the dataset and judge system prompt.

The overall rankings were stable, with the top-performing model (o1-2024-12-17) consistently maintaining its lead. Notably, we observed a swap in rankings between two Claude models, underscoring the sensitivity of our evaluation approach, especially given their initially close scores.

The only significant change in rankings was for the gpt-4o-2024-08-06 model, whose performance markedly improved with the updated dataset and prompt. This sudden jump is currently under investigation as part of our ongoing benchmarks-design research.

No major variations occurred solely due to changes in the system prompt, indicating good reproducibility as long as the same judge model (claude-3.5-sonnet) is used. However, we anticipate potential variations with smaller or weaker models as judges, where employing the second system prompt (SP2) may enhance consistency.

As a summary, the robust, consistently top-ranking performance of o1-2024-12-17—whose top score dropped from 82.67% to 70.25%—continues to reinforce its reliability for Arabic applications under the more challenging updated benchmark. While recent updates to the evaluation pipeline introduced minor ranking shifts, the overall framework remained stable, with top and bottom performers showing consistent positions. Many observed ranking adjustments likely reflect typical evaluation error margins due to minor score differences. Notably, scores for the second- through fifth-ranked models, previously between 70–78%, are now between 51–57%. This highlights that the updated AraGen dataset poses a markedly more difficult benchmark aligned with ongoing advances in reasoning models. Despite these shifts in absolute scores, it is encouraging that leaderboard positions remained largely consistent, underscoring the robustness of the evaluation approach moving forward.

More Detailed Scores

Couple 1: System Prompt Effect (AraGen-03-25 SP1 vs. AraGen-03-25 SP2)

Table 1. AraGen-03-25 (SP1) Rankings

Rank Model Name 3C3H Score Correctness Completeness Conciseness Helpfulness Honesty Harmlessness
1 o1-2024-12-17 69.49% 74.90% 73.04% 47.11% 72.40% 74.56% 74.90%
2 gpt-4o-2024-08-06 56.10% 61.96% 58.92% 34.22% 58.80% 60.81% 61.89%
3 claude-3-5-sonnet-20241022 54.29% 59.31% 57.65% 34.31% 57.13% 58.01% 59.31%
4 claude-3-7-sonnet-20250219 53.21% 59.31% 56.76% 28.53% 56.86% 58.53% 59.24%
5 o3-mini-2025-01-31 51.65% 56.67% 54.31% 31.74% 54.46% 56.10% 56.59%
6 deepseek-chat 47.82% 54.31% 52.35% 20.56% 51.94% 53.46% 54.31%
7 claude-3-5-haiku-20241022 43.62% 48.14% 44.61% 28.92% 45.37% 46.57% 48.14%
8 o1-mini-2024-09-12 43.60% 47.55% 47.06% 26.54% 46.35% 46.57% 47.55%
9 Qwen/Qwen2.5-72B-Instruct 42.18% 48.63% 47.55% 16.03% 44.93% 47.38% 48.55%
10 gpt-4o-mini-2024-07-18 40.96% 45.10% 44.02% 24.24% 43.19% 44.14% 45.10%

Table 2. AraGen-03-25 (SP2) Rankings

Rank Model Name 3C3H Score Correctness Completeness Conciseness Helpfulness Honesty Harmlessness
1 o1-2024-12-17 70.25% 75.88% 70.98% 51.25% 72.55% 75.25% 75.59%
2 gpt-4o-2024-08-06 57.38% 63.14% 56.67% 39.95% 59.66% 61.79% 63.06%
3 claude-3-7-sonnet-20250219 56.54% 62.25% 58.53% 34.49% 60.39% 61.40% 62.18%
4 claude-3-5-sonnet-20241022 55.60% 60.49% 56.67% 39.14% 58.60% 58.50% 60.20%
5 o3-mini-2025-01-31 51.63% 56.08% 52.35% 36.72% 53.53% 55.10% 56.00%
6 deepseek-chat 51.00% 57.55% 53.92% 25.61% 54.95% 56.42% 57.55%
7 claude-3-5-haiku-20241022 44.79% 48.92% 44.51% 32.40% 46.67% 47.38% 48.85%
8 o1-mini-2024-09-12 43.78% 47.55% 46.76% 28.04% 46.27% 46.67% 47.40%
9 Qwen/Qwen2.5-72B-Instruct 43.09% 48.82% 47.55% 19.73% 46.59% 47.11% 48.75%
10 gpt-4o-mini-2024-07-18 40.62% 45.10% 40.88% 27.60% 42.06% 43.58% 44.51%

Couple 2: Dataset and Prompt Update Effect (AraGen-12-24 SP1 (old) vs. AraGen-03-25 SP2 (new))

Table 3. AraGen-12-24 (SP1) Rankings

Rank Model Name 3C3H Score Correctness Completeness Conciseness Helpfulness Honesty Harmlessness
1 o1-2024-12-17 82.67% 92.71% 92.47% 34.65% 91.19% 92.26% 92.71%
2 claude-3-5-sonnet-20241022 78.74% 88.31% 87.81% 33.27% 86.97% 87.78% 88.31%
3 claude-3-7-sonnet-20250219 77.71% 87.89% 87.77% 29.20% 86.27% 87.26% 87.89%
4 gpt-4o-2024-08-06 73.89% 83.75% 82.91% 28.94% 80.99% 83.00% 83.75%
5 deepseek-chat 71.28% 81.89% 81.89% 21.13% 79.53% 81.32% 81.89%
6 o3-mini-2025-01-31 70.91% 80.29% 79.21% 27.33% 78.38% 79.99% 80.29%
7 claude-3-5-haiku-20241022 66.40% 74.43% 73.36% 30.56% 72.34% 73.30% 74.43%
8 o1-mini-2024-09-12 64.95% 74.22% 74.22% 21.46% 72.24% 73.32% 74.22%
9 gpt-4o-mini-2024-07-18 63.40% 72.10% 71.38% 22.98% 70.41% 71.41% 72.10%
10 Qwen/Qwen2.5-72B-Instruct 62.58% 71.92% 71.80% 19.06% 69.86% 70.94% 71.92%

Table 4. AraGen-03-25 (SP2) Rankings

Rank Model Name 3C3H Score Correctness Completeness Conciseness Helpfulness Honesty Harmlessness
1 o1-2024-12-17 70.25% 75.88% 70.98% 51.25% 72.55% 75.25% 75.59%
2 gpt-4o-2024-08-06 57.38% 63.14% 56.67% 39.95% 59.66% 61.79% 63.06%
3 claude-3-7-sonnet-20250219 56.54% 62.25% 58.53% 34.49% 60.39% 61.40% 62.18%
4 claude-3-5-sonnet-20241022 55.60% 60.49% 56.67% 39.14% 58.60% 58.50% 60.20%
5 o3-mini-2025-01-31 51.63% 56.08% 52.35% 36.72% 53.53% 55.10% 56.00%
6 deepseek-chat 51.00% 57.55% 53.92% 25.61% 54.95% 56.42% 57.55%
7 claude-3-5-haiku-20241022 44.79% 48.92% 44.51% 32.40% 46.67% 47.38% 48.85%
8 o1-mini-2024-09-12 43.78% 47.55% 46.76% 28.04% 46.27% 46.67% 47.40%
9 Qwen/Qwen2.5-72B-Instruct 43.09% 48.82% 47.55% 19.73% 46.59% 47.11% 48.75%
10 gpt-4o-mini-2024-07-18 40.62% 45.10% 40.88% 27.60% 42.06% 43.58% 44.51%

Analysis of 3C3H

As part of our December release, we introduced 3C3H as a new evaluation measure of the chat capability of models, aimed at assessing both the factuality and usability of LLMs’ answers. Over the past three months, we have observed some interesting findings, which we share in this section.

One emergent trend is that the various dimensions are almost perfectly correlated. In most cases, correct answers are scored as both highly helpful and harmless, while most models fail to maintain this correlation for the conciseness dimension. This generally reflects the way we train these models today, where more verbose answers are often rewarded as more helpful. This trend has recently caught the attention of the research community, as exemplified by the release of OpenAI’s GPT-4.5 model. According to their use cases section, answers from GPT-4.5 are more concise than those from GPT-4, while still being equally helpful.

HeatMap for o1-2024-12-17

A model that stood out in this analysis is “silma-ai/SILMA-9B-Instruct-v1.0”, which exhibited a higher conciseness score compared to other open-weight models—even those with larger sizes. However, this gain in conciseness came at the cost of helpfulness and other dimensions when compared to its base model, “google/gemma-2-9b-it”. We believe that this analysis, along with optimizing for 3C3H, will enable the community to develop better models through curated datasets while maintaining the correlation across all dimensions.

SILMA-9B-Instruct-v1.0 VS Gemma-2-9b-it HeatMaps

This is an ongoing effort to better understand how these dimensions are interconnected and how various scenarios and training recipes affect this relationship. Below, we provide a space where you can generate heatmaps for any combination of models of your choice. We hope the community finds it helpful in spotting additional trends that we may not have noticed. Ultimately, we aim for this tool to foster more discussion about evaluation and 3C3H, serving as a resource for others’ work.

We believe that one limitation of this analysis is the zeroing rule, whereby we do not evaluate the other dimensions if the answer is not correct. In the future, we plan to investigate further whether an answer can be helpful despite being incorrect, and how dimensions such as conciseness and harmlessness factor into this evaluation if the answer is not correct.

Instruction Following Leaderboard

What is Instruction Following as a Benchmark?

One of the core capabilities of large language models (LLMs) is their ability to understand and follow human instructions. This skill is crucial for building reliable chatbots, virtual assistants, and AI systems that do what users ask. Without strong instruction-following, a model might generate correct information but in the wrong format, ignore user-specified constraints, or produce unwanted content. Instruction-Following benchmark is standardized, objective way to measure a model’s instruction adherence and compare models fairly to drive improvements.

Dataset: Arabic IFEval

Our work took inspiration from the IFEval dataset. IFEval, originally introduced by Google, provides a structured benchmark designed to evaluate LLMs on their ability to follow verifiable instructions. It consists of prompts containing specific, objectively measurable commands such as “use exactly three bullet points,” “include the word ‘innovation’ twice,” or “limit your answer to 100 words.” English IFEval dataset contains around 500 prompts covering 25 different types of such verifiable instructions. Evaluation within IFEval is conducted through Python functions that automatically verify whether instructions are followed, avoiding the need for human evaluators or another AI judge. This makes the evaluations reproducible and unbiased. While IFEval has become the standard for assessing LLMs responding in English, a similarly detailed and structured resource is absent for Arabic.

Construction of our Arabic IFEval dataset began by carefully adapting approximately 300 prompts from the original English IFEval. This wasn’t a straightforward, word-for-word translation; instead, we thoughtfully adjusted prompts to clearly reflect Arabic linguistic nuances and cultural contexts. Instructions that made little sense in Arabic, such as those involving English-specific vowel constraints, were either adapted to equivalent Arabic linguistic challenges or omitted entirely. Cultural references specific to English-speaking contexts were replaced with culturally relevant or Arabic-language equivalents to maintain contextual clarity. Additionally, we created unique Arabic-specific samples from scratch, specifically designed to emphasize distinctive Arabic phonetics, orthographic characteristics, and morphology, such as the careful use of diacritical marks (tashkīl), phonetic constraints like avoiding certain letters (e.g., writing without using the letter Alef (ا)), and leveraging root-based morphology to challenge models’ word-selection abilities. All prompts underwent rigorous expert validation by Arabic linguists and domain experts who ensured grammatical accuracy, cultural appropriateness, and unambiguous clarity of each instruction.

Arabic IFEval dataset is publicly available for the research community to utilize, test, and contribute to. It is available on Huggingface under inceptionai/Arabic_IFEval

Sample I: Arabic IFEval

Prompt (Ar):
فسر كيف يمكن للتقنيات الحديثة مثل الذكاء الاصطناعي أن تسهم في الحفاظ على الأدب العربي، مع تضمين 12 كلمة تنتهي بأحد الحروف الرافسة (د، ذ، أ، ر، ز، و)، وأن تكون الإجابة مكتوبة بأسلوب موجز لا يتجاوز 120 كلمة. يجب أن لا تحتوي إجابتك على أي فواصل.

Prompt Translation (En):
Explain how modern technologies, such as artificial intelligence, can contribute to preserving Arabic literature. Your answer should include at least 12 words ending with one of these specific Arabic letters (د، ذ، أ، ر، ز، و), be concise, and should not exceed 120 words. Your response must not contain any commas.

Instructions to follow:

  • Letter Frequency Constraint: Include at least 12 words ending with one of the letters (د، ذ، أ، ر، ز، و).
  • Punctuation Constraint: Do not use commas.
  • Length Constraint: Write concisely, not exceeding 120 words.

Example JSON Format:

{
  "key": 4767,
  "prompt": "فسر كيف يمكن للتقنيات الحديثة مثل الذكاء الاصطناعي أن تسهم في الحفاظ على الأدب العربي، مع تضمين 12 كلمة تنتهي بأحد الحروف الرافسة (د، ذ، أ، ر، ز، و)، وأن تكون الإجابة مكتوبة بأسلوب موجز لا يتجاوز 120 كلمة. يجب أن لا تحتوي إجابتك على أي فواصل.",
  "instruction_id_list": [
    "keywords:letter_list_freq",
    "punctuation:no_comma",
    "length_constraints:number_words"
  ],
  "kwargs": [
    {
      "letters": ["د", "ذ", "أ", "ر", "ز", "و"],
      "frequency": 12,
      "relation": "at least",
      "position": "end"
    },
    {},
    {
      "relation": "less than",
      "num_words": 500
    }
  ],
  "lang": ["ar"]
}
Sample II: Arabic IFEval

Prompt (Ar):
اكتب قصة قصيرة عن الرقم 600، على أن يكتب الرقم في القصة بالكلمات وبكل الصيغ المفقطة الممكنة له على الأقل مرة (ستة مائة – ست مئة – ستمئة – ستمائة).

Prompt Translation (En):
Write a short story about the number 600. Within the story, the number should be spelled out in Arabic in all possible written forms at least once each (“ستة مائة”, “ست مئة”, “ستمئة”, “ستمائة”).

Instructions to follow:
Your response must explicitly include the following Arabic spellings at least once each:

  • ستة
  • مائة
  • ست
  • مئة
  • ستمئة
  • ستمائة

Example JSON Format:

{
  "key": 4768,
  "prompt": "اكتب قصة قصيرة عن الرقم 600، على أن يكتب الرقم في القصة بالكلمات وبكل الصيغ المفقطة الممكنة له على الأقل مرة (ستة مائة - ست مئة - ستمئة - ستمائة).",
  "instruction_id_list": [
    "keywords:frequency",
    "keywords:frequency",
    "keywords:frequency",
    "keywords:frequency",
    "keywords:frequency",
    "keywords:frequency"
  ],
  "kwargs": [
    {"relation": "at least", "keyword": "ستة", "frequency": 1},
    {"relation": "at least", "keyword": "مائة", "frequency": 1},
    {"relation": "at least", "keyword": "ست", "frequency": 1},
    {"relation": "at least", "keyword": "مئة", "frequency": 1},
    {"relation": "at least", "keyword": "ستمئة", "frequency": 1},
    {"relation": "at least", "keyword": "ستمائة", "frequency": 1}
  ],
  "lang": ["ar"]
}

Evaluation Methodology & Metrics

To evaluate the models, we adopted a comprehensive methodology combining both explicit and implicit evaluation techniques. Explicit evaluation involved using automated scripts to assess whether instructions were strictly followed, focusing on elements such as correct formatting and specific word usage. Implicit evaluation addressed more nuanced linguistic expectations, such as maintaining the intended response language and avoiding repetitive patterns.

Additionally, we utilized scoring metrics introduced by Google in the IFEval framework, applying these metrics at both prompt-level and instruction-level granularity. These metrics were measured using strict criteria accuracy, which requires adherence to the provided instructions. The prompt-level score is notably harder, reflecting the user’s viewpoint by asking, “Did I get everything I requested?” If a prompt included multiple requirements, failing to meet any single requirement would mean the user’s request was not fully satisfied. In contrast, the instruction-level score is more lenient, allowing us to evaluate partial compliance.

In our analysis, we will emphasize the prompt-level strict accuracy as it provides the most rigorous assessment of a model’s instruction-following capabilities.

Results & Analysis

We evaluated a broad range of LLMs on both the English IFEval benchmark and our newly introduced Arabic IFEval. This encompassed closed-source models (such as OpenAI’s GPT series and Anthropic’s Claude models) and open-source alternatives (including the Jais series, Meta’s LLaMA-2 variants, and various open bilingual models). Below, we present a summary of results for a representative subset of these models, comparing their prompt-level accuracy on both English and Arabic IFEval. Accuracy is reported using both strict and loose criteria, with values expressed as the percentage of prompts successfully completed.

Instruction Following Leaderboard Sample

Table 5. Sample Scores from Instruction Following Benchmark

Rank Model Name Arabic Prompt-lvl (%) English Prompt-lvl (%)
1 claude-3.5-sonnet 72.5 84.7
2 gpt-4o-2024-08-06 70.8 79.4
3 gpt-4o-mini-2024-07-18 68.1 76.9
4 claude-3.5-haiku 67.1 78.2
5 Qwen/Qwen2.5-72B-Instruct 67.3 83.5
6 Qwen/Qwen2.5-32B-Instruct 60.4 77.6
7 google/gemma-2-27b-it 59.4 76.1
8 CohereForAI/aya-expanse-32b 56.7 65.1
9 CohereForAI/c4ai-command-r7b-12-2024 56.4 74.9
10 meta-llama/Llama-3.3-70B-Instruct 58.2 88.2

Upcoming Work

As part of our work we will be adding and updating more leaderboards to the Arabic-Leaderboards Space as our internal work progresses.
In upcoming releases, we expect to publish a leaderboard for visual question-answering across multiple tasks, powered by camel-bench and kitab from our collaborators at MBZUAI.

. Structure the content with clear, distinct paragraphs, each covering a specific aspect of the topic. Use a natural, human, and conversational tone while ensuring the article remains informative, well-organized, and easy to read. Integrate relevant keywords naturally, avoid repetition, and do not include a conclusion. Focus entirely on delivering valuable and well-structured insights.

Enhancing Fluid-Structure Interaction Dynamics through Physics-Informed Neural Networks and Immersed Boundary Methods
Exploring StarCoder2 and The Stack v2: Features, Benefits, and Innovations
Enhanced Disease Diagnosis Through Information Completeness in Guided Adaptive Retrieval-Augmented Generation
Understanding the Secondary Risks Associated with Large Language Models: A Comprehensive Exploration
End-to-End Joint Punctuated and Normalized Automatic Speech Recognition (ASR) with Minimal Punctuated Training Data: Insights from Paper 2311.17741

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

By signing up, you agree to our Terms of Use and acknowledge the data practices in our Privacy Policy. You may unsubscribe at any time.
Share This Article
Facebook Copy Link Print
Previous Article Explore Our Open Source Build System: Streamline Your Development Process Explore Our Open Source Build System: Streamline Your Development Process
Next Article Llama 3 and MoE: Revolutionizing Affordable High-Performance AI Solutions Llama 3 and MoE: Revolutionizing Affordable High-Performance AI Solutions

Stay Connected

XFollow
PinterestPin
TelegramFollow
LinkedInFollow

							banner							
							banner
Explore Top AI Tools Instantly
Discover, compare, and choose the best AI tools in one place. Easy search, real-time updates, and expert-picked solutions.
Browse AI Tools

Latest News

Bernie Sanders Urges Caution: The US Lacks Understanding of the Speed and Scale of the Impending AI Revolution | US News
Bernie Sanders Urges Caution: The US Lacks Understanding of the Speed and Scale of the Impending AI Revolution | US News
News
Executives Share Positive Outlook on Future Business Prospects
Executives Share Positive Outlook on Future Business Prospects
News
OpenAI Launches Harness Engineering: Empowering Large-Scale Software Development with Codex Agents
Comparisons
The Download: Microsoft’s Online Reality Check and the Alarming Surge in Measles Cases
The Download: Microsoft’s Online Reality Check and the Alarming Surge in Measles Cases
Ethics
//

Leading global tech insights for 20M+ innovators

Quick Link

  • Latest News
  • Model Comparisons
  • Tutorials & Guides
  • Open-Source Tools
  • Community Events

Support

  • Privacy Policy
  • Terms of Service
  • Contact Us
  • FAQ / Help Center
  • Advertise With Us

Sign Up for Our Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

AIModelKitAIModelKit
Follow US
© 2025 AI Model Kit. All Rights Reserved.
Welcome Back!

Sign in to your account

Username or Email Address
Password

Lost your password?