Evaluating LLM Performance On AI-Generated CUDA Code Using ComputeEval 2025.2: A Comprehensive Benchmarking Study

As the landscape of artificial intelligence evolves, the question arises: Can AI coding assistants effectively write efficient CUDA code? To probe this nuanced topic, we introduced ComputeEval, a comprehensive open-source benchmark designed to evaluate AI models and agents across various CUDA programming tasks.

A few months back, we proudly unveiled the first iteration of ComputeEval and are now thrilled to announce its first major expansion—over 100 new CUDA challenges. This significant update brings the dataset to a total of 232 CUDA and CUDA Compute Core Libraries (CCCL) problems, showcasing our commitment to continuously push the envelope of what’s possible in AI-assisted coding.

With this expansion, we have purposefully elevated the complexity of the challenges. The new problems require large language models (LLMs) to harness modern CUDA features, incorporating elements like Tensor Cores, advanced shared memory patterns, and warp-level primitives. Furthermore, these challenges rigorously test the ability to orchestrate cutting-edge features such as CUDA Graphs, Streams, and Events—all within the framework of real-world applications like dynamic simulations.

LLM Performance on CUDA Programming

To gauge the performance of several leading LLMs on ComputeEval, our team conducted extensive evaluations, establishing baseline performance metrics and gaining insights into the current state of AI-assisted CUDA programming. As per our findings, displayed in Table 1, the results reflect the challenges posed by our latest benchmark.

Model	ComputeEval 2025.2 232 new problems pass@1	ComputeEval 2025.1 128 problems pass@1
GPT-5 (medium)	0.5819	0.61
Claude Sonnet 4.0	0.5517	0.64
gpt-oss-20B (high)	0.5474	N/A
gpt-oss-120b (high)	0.5302	N/A
Claude Opus 4.0	0.5216	N/A
DeepSeek-R1	0.4397	0.55
gpt-oss-120b (medium)	0.4224	N/A
gpt-oss-20b (medium)	0.4224	N/A
gpt-oss-120b (low)	0.4052	N/A
DeepSeek-V3.1	0.3750	0.44
Llama 4 Maverick 17B 128E	0.3448	0.47
Llama 3.1 405B	0.3405	0.4
gpt-oss-20B (low)	0.3319	0.41

Table 1. Pass@1 accuracy of state-of-the-art LLMs on ComputeEval 2025.1 and 2025.2. The latest version introduces 232 new CUDA programming challenges, providing a tougher benchmark for AI-assisted coding.

Interestingly, all models exhibited a decline in performance metrics with the transition to ComputeEval 2025.2. This is not an indication of decreasing capabilities; rather, it highlights the increased difficulty of the benchmark. Each new release represents a step forward in our efforts to demand a deeper understanding from AI systems regarding the subtleties of accelerated computing.

What’s Next and How to Get Involved

The journey doesn’t stop here. We are committed to further expanding the dataset and enhancing the capabilities of the ComputeEval evaluation framework. Plans are already underway to broaden ComputeEval’s coverage to include additional CUDA-X libraries such as cuBLAS, CUTLASS, cuDNN, RAPIDS, and beyond. We enthusiastically invite members of the HPC and AI communities to contribute and collaborate in this pioneering initiative.

Explore the code on GitHub and access the dataset on Hugging Face. Together, let’s reshape the future of AI-powered coding!

Inspired by: Source

Contents

LLM Performance on CUDA Programming
What’s Next and How to Get Involved

Evaluating LLM Performance on AI-Generated CUDA Code Using ComputeEval 2025.2: A Comprehensive Benchmarking Study

LLM Performance on CUDA Programming

What’s Next and How to Get Involved

Stay Connected

Explore Top AI Tools Instantly

Latest News

Stripe Benchmark Report: AI Agents Excel in Building Integrations but Face Challenges in Validation

Trump Condemns New York’s Statewide Data Center Moratorium: Insights and Implications

Unlocking the Secrets of Diffusion Models: Understanding Their Creative Potential

Enhancing KV Cache Efficiency: Near-Lossless Compression Techniques Using Joint Tucker and JL-Residual Allocation for Large Language Models (LLMs)

Leading global tech insights for 20M+ innovators

Quick Link

Support

Sign Up for Our Newsletter

LLM Performance on CUDA Programming

What’s Next and How to Get Involved

Sign Up For Daily Newsletter

Get AI news first! Join our newsletter for fresh updates on open-source models.

Stay Connected

Explore Top AI Tools Instantly

Latest News

Stripe Benchmark Report: AI Agents Excel in Building Integrations but Face Challenges in Validation

Trump Condemns New York’s Statewide Data Center Moratorium: Insights and Implications

Unlocking the Secrets of Diffusion Models: Understanding Their Creative Potential

Enhancing KV Cache Efficiency: Near-Lossless Compression Techniques Using Joint Tucker and JL-Residual Allocation for Large Language Models (LLMs)