As the landscape of artificial intelligence evolves, the question arises: Can AI coding assistants effectively write efficient CUDA code? To probe this nuanced topic, we introduced ComputeEval, a comprehensive open-source benchmark designed to evaluate AI models and agents across various CUDA programming tasks.
A few months back, we proudly unveiled the first iteration of ComputeEval and are now thrilled to announce its first major expansion—over 100 new CUDA challenges. This significant update brings the dataset to a total of 232 CUDA and CUDA Compute Core Libraries (CCCL) problems, showcasing our commitment to continuously push the envelope of what’s possible in AI-assisted coding.
With this expansion, we have purposefully elevated the complexity of the challenges. The new problems require large language models (LLMs) to harness modern CUDA features, incorporating elements like Tensor Cores, advanced shared memory patterns, and warp-level primitives. Furthermore, these challenges rigorously test the ability to orchestrate cutting-edge features such as CUDA Graphs, Streams, and Events—all within the framework of real-world applications like dynamic simulations.
LLM Performance on CUDA Programming
To gauge the performance of several leading LLMs on ComputeEval, our team conducted extensive evaluations, establishing baseline performance metrics and gaining insights into the current state of AI-assisted CUDA programming. As per our findings, displayed in Table 1, the results reflect the challenges posed by our latest benchmark.
| Model | ComputeEval 2025.2 232 new problems pass@1 |
ComputeEval 2025.1 128 problems pass@1 |
| GPT-5 (medium) | 0.5819 | 0.61 |
| Claude Sonnet 4.0 | 0.5517 | 0.64 |
| gpt-oss-20B (high) | 0.5474 | N/A |
| gpt-oss-120b (high) | 0.5302 | N/A |
| Claude Opus 4.0 | 0.5216 | N/A |
| DeepSeek-R1 | 0.4397 | 0.55 |
| gpt-oss-120b (medium) | 0.4224 | N/A |
| gpt-oss-20b (medium) | 0.4224 | N/A |
| gpt-oss-120b (low) | 0.4052 | N/A |
| DeepSeek-V3.1 | 0.3750 | 0.44 |
| Llama 4 Maverick 17B 128E | 0.3448 | 0.47 |
| Llama 3.1 405B | 0.3405 | 0.4 |
| gpt-oss-20B (low) | 0.3319 | 0.41 |
Interestingly, all models exhibited a decline in performance metrics with the transition to ComputeEval 2025.2. This is not an indication of decreasing capabilities; rather, it highlights the increased difficulty of the benchmark. Each new release represents a step forward in our efforts to demand a deeper understanding from AI systems regarding the subtleties of accelerated computing.
What’s Next and How to Get Involved
The journey doesn’t stop here. We are committed to further expanding the dataset and enhancing the capabilities of the ComputeEval evaluation framework. Plans are already underway to broaden ComputeEval’s coverage to include additional CUDA-X libraries such as cuBLAS, CUTLASS, cuDNN, RAPIDS, and beyond. We enthusiastically invite members of the HPC and AI communities to contribute and collaborate in this pioneering initiative.
Explore the code on GitHub and access the dataset on Hugging Face. Together, let’s reshape the future of AI-powered coding!
Inspired by: Source

