Performance Portability

With the increasing dominance of heterogeneous platforms with diverse GPU vendors, using HPC machines efficiently for scientific applications demands performance portability, the capability of codes to achieve high performance across a range of target platforms. Developers might address this by creating multiple specialized code versions of an application to optimize performance for each target system. However, this divergence in code creates a significant burden for maintenance and development. Thus, there is a strong need for programming models that enable single-source performance portability in scientific applications. However, choosing a programming model for porting a CPU-only application to GPUs is a major commitment, requiring significant time for developer training and programming. If a programming model turns out to be a ill-suited for an application, resulting in an unacceptable performance, then that investment is wasted.

Thus, we are working to address the developer’s dilemma of choosing a programming model by providing a comprehensive empirical study of programming models in terms of their ability to enable performance portability on GPU-based platforms. We use a variety of proxy applications implemented in the most popular programming models and test them across multiple leadership-class production supercomputers.

Programming Models and Proxy Applications

We evaluate CUDA, HIP, SYCL, Kokkos, RAJA, OpenMP, and OpenACC using the following proxy apps:

  • BabelStream: a memory bandwidth benchmark with five kernels: copy, add, mul, triad, and dot.
  • CloverLeaf: a 2D structured compressible Euler equations solver in the Mantevo Applications Suite. Note that we do not have CloverLeaf OpenACC results available at this time.
  • XSBench: a proxy app for OpenMC (Monte Carlo) neutron transport code which represents the macroscopic cross-section lookup kernel.
  • su3_bench: a proxy app for MILC Lattice QCD, implementing a complex number matrix-matrix multiply routine.

Evaluation Platforms and Compilers

We present results on the following platforms:

  • Summit (OLCF): IBM POWER9 CPUs with NVIDIA V100 (16 GB) GPUs.
  • Perlmutter (NERSC): AMD EPYC 7763 CPUs with NVIDIA A100 (40 GB) GPUs.
  • Corona (LLNL): AMD Rome CPUs with AMD MI50 (32 GB) GPUs.
  • Frontier (OLCF): AMD Opt. 3rd Gen. EPYC CPUs with AMD MI250X (64 GB) GPUs.

The following compilers we used for each combination of system and programming model. Note that we were unable to get an OpenACC compiler working on the Corona platform.

Prog. Model Summit Perlmutter Corona Frontier
CUDA GCC 11.2.0 GCC 11.2.0 N/A N/A
HIP XL 16.1.1-10 GCC 11.2.0 LLVM 16 GCC 11.2.0
Kokkos GCC 11.2.0 GCC 11.2.0 GCC 11.2.0 GCC 11.2.0
RAJA GCC 11.2.0 GCC 11.2.0 GCC 11.2.0 GCC 11.2.0
OpenMP NVHPC 22.7 NVHPC 22.7 LLVM 16 LLVM 17 (2023-08-09)
OpenACC NVHPC 22.7 NVHPC 22.7 N/A Clacc 2023-08-15
SYCL DPC++ 2023.03 DPC++ 2023.03 DPC++ 2023.03 DPC++ 2023.03

Results

Below we present the latest performance results of the four mini-applications in seven programming models across four hardware platforms. All data points represent an average over three trials, with one warm-up execution performed before measurement. Note that all results include data movement to and from the device as required.

BabelStream copy BabelStream triad plot not available BabelStream dot CloverLeaf plot not available XSBench su3_bench plot not available


Related Publications

[1] Josh Davis et al, "Evaluating Performance Portability of GPU Programming Models", Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023