ParEval Leaderboard: Evaluating the Ability of Large Language Models to Generate Parallel Code

1 minute read

We introduced the ParEval benchmark in “Can Large Language Models Write Parallel Code?” to evaluate the capability of LLMs at parallel code generation. We found a significant gap between their ability to generate sequential vs parallel code for a large array of computational problems and parallel programming models. On this page we keep an up-to-date table tracking the progress of state-of-the-art LLMs on ParEval.

ParEval Results

Model No. Parameters HumanEval
ParEval Serial
ParEval Parallel
StarCoder2-3B 3B 31.7 42.7 9.6
StarCoder2-7B 7B 35.4 59.4 15.9
CodeLlama-7B 7B 29.9 48.4 15.3
CodeLlama-13B 13B 35.0 52.8 17.4
StarCoder2-15B 15B 46.3 61.6 23.1
StarCoderBase 15.5B 30.3 51.7 18.6
CodeLlama-34B 34B 45.1 54.0 10.2
Phind-V2 34B 71.9 65.6 32.1
Gemini-Pro 67.7 59.3 25.1
GPT-3.5 61.5 76.0 39.6
GPT-4 84.1 76.1 37.8

Last updated March 5, 2024

If you would like a model added you can reach out to or open an issue in the GitHub repo.

Citing ParEval

      title={Can Large Language Models Write Parallel Code?}, 
      author={Daniel Nichols and Joshua H. Davis and Zhaojun Xie and 
              Arjun Rajaram and Abhinav Bhatele},
      publisher = {Association for Computing Machinery},
      address = {New York, NY, USA},
      booktitle = {Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing},
      series = {HPDC '24}