Preprints

arXiv
   

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, Abhinav Bhatele

arXiv. 2404.18864.

Abstract  

Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP code.

arXiv
   

Josh Davis, Pranav Sivaraman, Isaac Minn, Konstantinos Parasyris, Harshitha Menon, Giorgis Georgakoudis, Abhinav Bhatele

Abstract  

Ensuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. While prior work has investigated the performance portability of a few selected proxy applications or programming models, this paper provides a comprehensive study of a range of proxy applications implemented in the major programming models suitable for GPU-based platforms. We present and analyze performance results across NVIDIA and AMD GPU hardware currently deployed in leadership-class computing facilities using a representative range of scientific codes and several programming models – CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL. Based on the specific characteristics of applications tested, we include recommendations to developers on how to choose the right programming model for their code. We find that Kokkos, RAJA, and SYCL in particular offer the most promise empirically as performance portable programming models. These results provide a comprehensive evaluation of the extent to which each programming model for heterogeneous systems provides true performance portability in real-world usage.

arXiv
   

Onur Cankur, Aditya Tomar, Daniel Nichols, Connor Scully-Allison, Katherine E. Isaacs, Abhinav Bhatele

Abstract  

Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.

arXiv
   

Siddharth Singh, Prajwal Singhania, Aditya Ranjan, Zack Sating, Abhinav Bhatele

Abstract  

Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.

arXiv
   

Daniel Nichols, Siddharth Singh, Shu-Huai Lin, Abhinav Bhatele

Abstract  

The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.

arXiv
   

Ian J. Costello, Abhinav Bhatele

Abstract  

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

Papers

NeurIPS 24
   

Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele

[To Appear] Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (Main Conference Track)

Abstract  

Inference on large language models can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in such models contributes significantly to these costs, which has resulted in several recent works that propose sparse attention approximations for inference. In this work, we propose to approximate the self-attention computation by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that the key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to maintain the efficacy of the models better than other popular approximation methods, while speeding up the attention computation due to reduced data movement (load/store) and compute costs.

SC 2024
     

Daniel Nichols, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2024

Abstract  

Modern scientific software in high performance computing is often complex, and many parallel applications and libraries depend on several other software or libraries. Developers and users of such complex software often use package managers for building them. Package managers depend on hu- mans to codify package constraints (for dependency and version selection), and the dependency graph of a software package can often become large (hundreds of vertices). In addition, package constraints often become outdated and inconsistent over time since they are maintained by different people for different packages, which is a laborious task. This can result in package builds to fail for certain package configurations. In this paper, we propose a methodology that uses historical build results to assist a package manager in selecting the best versions of package dependencies with an aim to improve the likelihood of a successful build. We utilize a machine learning (ML) model to predict the probability of build outcomes of different configurations of packages in the Spack package manager. When evaluated on common scientific software stacks, this ML model-based approach is able to achieve a 13% higher success rate in building packages than the default version selection mechanism in Spack.

SC 2024
   

Siddharth Singh, Prajwal Singhania, Aditya Ranjan, John Kirchenbauer, Jonas Geiping, Yuxin Wen, Neel Jain, Abhimanyu Hans, Manli Shu, Aditya Tomar, Tom Goldstein, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2024

Abstract  

Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel 3D tensor + data parallel hybrid algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s for training of GPT-style transformer models on Frontier. While the performance of LLMs scales with the number of trainable parameters, the increase in parameter counts unlocked by AxoNN also heightens privacy risks brought forth by an enhanced ability to rapidly memorize training data. This can lead to potential disclosure of sensitive or private information when deployed. We highlight this side effect of scale through a series of experiments exploring catastrophic memorization by LLMs and methods to prevent it in the many billion-parameter regime.

HPDC 2024
       

Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele

Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

Abstract  

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.

ISC 2024
     

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the ISC High Performance Conference

Abstract  

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.

IPDPS 2024
     

Daniel Nichols, Alexander Movsesyan, Jae-Seung Yeom, Abhik Sarkar, Daniel Milroy, Tapasya Patki, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2024

Abstract  

A variety of hardware architectures, both CPUs and GPUs, are used today to build supercomputers and parallel clusters. Often times, users can choose which hardware platform they want to run on. Modern scientific workflows have multiple computational tasks, and each task may be better suited for a different architecture in terms of performance. Deciding where to run an application or workflow task is not straightforward because of the complexity of applications, and hardware ar- chitectures, which makes performance predictions challenging. Hence, modeling the performance of scientific applications across a variety of architectures is important for achieving the best performance. In this paper, we present a machine learning based methodology to model the relative performance of applica- tions across multiple architectures using hardware performance counters. Our machine learning model can predict the relative performance of an application with a mean absolute error of 0.11, and can be used effectively to make performance-aware and multi-architecture scheduling decisions, reducing makespan by up to 20%.

MSR 2024
     

Harshitha Menon, Daniel Nichols, Abhinav Bhatele, Todd Gamblin

21st International Conference on Mining Software Repositories (MSR '24)

Abstract  

Software has become increasingly complex, with a typical applica- tion depending on tens or hundreds of packages. Finding compatible versions and build configurations of these packages is challenging. This paper presents a method to learn the likelihood of software build success, and techniques for leveraging this information to guide dependency solvers to better software configurations. We leverage the heavily parameterized package recipes from the Spack package manager to produce a training data set of builds, and we use Graph Neural Networks to learn whether a given package con- figuration will build successfully or not. We apply our tool to the U.S. Exascale Computing Project’s software stack. We demonstrate its effectiveness in predicting whether a given package will build successfully. We show that our technique can be used to improve the solutions generated by dependency solvers, reducing the need for developers to find working builds by trial and error.

ICS 2023
     

Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele

Proceedings of the ACM International Conference on Supercomputing, June 2023

Abstract  

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three- dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and commu- nication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.

IPDPS 2023
     

Josh Davis, Justin Shafner, Daniel Nichols, Nathan Grube, Pino Martin, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2023

Abstract  

Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6x to 44x speedup over the CPU-only version.

IPDPS 2023
     

Siddharth Singh, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2023

Abstract  

Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely - data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.

TVCG 2023
   

Suraj Kesavan, Harsh Bhatia, Abhinav Bhatele, Stephanie Brink, Olga Pearce, Todd Gamblin, Peer-Timo Bremer, Kwan-Liu Ma

IEEE Transactions on Visualization and Computer Graphics

Abstract  

Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present Ensemble CallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes.

ISC 2022
       

Onur Cankur, Abhinav Bhatele

International Conference on High Performance Computing. Springer, May 2022

Abstract  

Call graphs generated by profiling tools are critical to dissecting the performance of parallel programs. Although many mature and sophisticated profiling tools record call graph data, each tool is different in its runtime overheads, memory consumption, and output data generated. In this work, we perform a comparative evaluation study on the call graph data generation capabilities of several popular profiling tools – Caliper, HPCToolkit, TAU, and Score-P. We evaluate their runtime overheads, memory consumption, and generated call graph data (size and quality). We perform this comparison empirically by running several proxy applications, AMG, LULESH, and Quicksilver on a parallel cluster. Our results show which tool results in the lowest overheads and produces more meaningful call graph data under different conditions.

IPDPS 2022
       

Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, Todd Gamblin, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2022

Abstract  

Resource contention on high performance computing (HPC) platforms can lead to significant variation in application performance. When several jobs experience such large variations in run times, it can lead to less efficient use of system resources. It can also lead to users over-estimating their job’s expected run time, which degrades the efficiency of the system scheduler. Mitigating performance variation on HPC platforms benefits end users and also enables more efficient use of system resources. In this paper, we present a pipeline for collecting and analyzing system and application performance data for jobs submitted over long periods of time. We use a set of machine learning (ML) models trained on this data to classify performance variation using current system counters. Additionally, we present a new resource-aware job scheduling algorithm that utilizes the ML pipeline and current system state to mitigate job variation. We evaluate our pipeline, ML models, and scheduler using various proxy applications and an actual implementation of the scheduler on an Infiniband-based fat-tree cluster.

IPDPS 2022
   

Siddharth Singh, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2022

Abstract  

In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.

IPCCC 2021
   

Saptarshi Bhowmik, Nikhil Jain, Xin Yuan, Abhinav Bhatele

Proceedings of the IEEE International Performance Computing and Communications Conference, October 2021

Abstract  

Compute nodes on high performance computing (HPC) platforms are increasingly equipped with multiple GPUs. This results in increased computational capacity per node, and reduction in the total number of nodes or endpoints in the system. This trend changes the computation and communication balance in comparison to pre-GPU era HPC platforms, which warrants a new study of hardware architectural parameters. In this work, we leverage the end-to-end system simulation capabilities of TraceR-CODES and study the impact of several hardware design parameters on the performance of realistic HPC workloads. We focus on three crucial hardware parameters: (1) number of GPUs per node, (2) network link bandwidth, and (3) network interface controller (NIC) scheduling policies, in the context of two popular network topologies – fat-tree and dragonfly.

TVCG 2021
   

Huu Tan Nguyen, Abhinav Bhatele, Nikhil Jain, Suraj Kesavan, Harsh Bhatia, Todd Gamblin, Kwan-Liu Ma, Peer-Timo Bremer

IEEE Transactions on Visualization and Computer Graphics

Abstract  

Calling context trees (CCTs) couple performance metrics with call paths, helping understand the execution and performance of parallel programs. To identify performance bottlenecks, programmers and performance analysts visually explore CCTs to form and validate hypotheses regarding degraded performance. However, due to the complexity of parallel programs, existing visual representations do not scale to applications running on a large number of processors. We present CallFlow, an interactive visual analysis tool that provides a high-level overview of CCTs together with semantic refinement operations to progressively explore CCTs. Using a flow-based metaphor, we visualize a CCT by treating execution time as a resource spent during the call chain, and demonstrate the effectiveness of our design with case studies on large-scale, production simulation codes.

ProTools 2020
   

Stephanie Brink, Ian Lumsden, Connor Scully-Allison, Katy Williams, Olga Pearce, Todd Gamblin, Michela Taufer, Katherine E. Isaacs, Abhinav Bhatele

Proceedings of the Workshop on Programming and Performance Visualization Tools. November 2020

Abstract  

Performance analysis is critical for pinpointing bottlenecks in applications. Many different profilers exist to instrument parallel programs on HPC systems, however, there is a lack of tools for analyzing such data programmatically. Hatchet, an open-source Python library, can read profiling data from several tools, and enables the user to perform a variety of analyses on hierarchical performance data. In this paper, we augment Hatchet to support new features: a call path query language for representing call path-related queries, visualizations for displaying and interacting with the structured data, and new operations for performing analysis on multiple datasets. Additionally, we present performance optimizations in Hatchet’s HPCToolkit reader and the unify operation to enable scalable analysis of large profiles.

CLUSTER 2020
   

Sascha Hunold, Abhinav Bhatele, George Bosilca, Peter Knees

Proceedings of the IEEE Cluster Conference. September 2020.

Abstract  

The Message Passing Interface (MPI) defines the semantics of data communication operations, while the imple- menting libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given com- munication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin.

ICS 2020
   

Jaemin Choi, David Richards, Laxmikant V. Kale, Abhinav Bhatele

Proceedings of the International Conference on Supercomputing. June 2020.

Abstract  

With the growing number of GPU-based supercomputing platforms and GPU-enabled applications, the ability to accurately model the performance of such applications is becoming increasingly impor- tant. Most current performance models for GPU-enabled appli- cations are limited to single node performance. In this work, we propose a methodology for end-to-end performance modeling of distributed GPU applications. Our work strives to create perfor- mance models that are both accurate and easily applicable to any distributed GPU application. We combine trace-driven simulation of MPI communication using the TraceR-CODES framework with a profiling-based roofline model for GPU kernels. We make substan- tial modifications to these models to capture the complex effects of both on-node and off-node networks in today’s multi-GPU su- percomputers. We validate our model against empirical data from GPU platforms and also vary tunable parameters of our model to observe how they might affect application performance.

IPDPS 2020
   

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, David K. Lowenthal

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract  

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

IPDPS 2020
   

Harshitha Menon, Abhinav Bhatele, Todd Gamblin

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract  

High performance computing applications, runtimes, and platforms are becoming more configurable to enable ap- plications to obtain better performance. As a result, users are increasingly presented with a multitude of options to configure application-specific as well as platform-level parameters. The combined effect of different parameter choices on application performance is difficult to predict, and an exhaustive evaluation of this combinatorial parameter space is practically infeasible. One approach to parameter selection is a user-guided exploration of a part of the space. However, such an ad hoc exploration of the parameter space can result in suboptimal choices. Therefore, an automatic approach that can efficiently explore the parameter space is needed. In this paper, we propose HiPerBOt, a Bayesian optimization based configuration selection framework to identify application and platform-level parameters that result in high performing configurations. We demonstrate the effectiveness of HiPerBOt in tuning parameters that include compiler flags, runtime settings, and application-level options for several parallel codes, including, Kripke, Hypre, LULESH, and OpenAtom.

HiPC 2019
   

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

Proceedings of the IEEE International Conference on High Performance Computing, December 2019

Abstract  

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation.

SC 2019
   

Abhinav Bhatele, Stephanie Brink, Todd Gamblin

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2019

Abstract  

Performance analysis is critical for eliminating scalability bottlenecks in parallel codes. There are many profiling tools that can instrument codes and gather performance data. However, analytics and visualization tools that are general, easy to use, and programmable are limited. In this paper, we focus on the analytics of structured profiling data, such as that obtained from calling context trees or nested region timers in code. We present a set of techniques and operations that build on the pandas data analysis library to enable analysis of parallel profiles. We have implemented these techniques in a Python-based library called Hatchet that allows structured data to be filtered, aggregated, and pruned. Using performance datasets obtained from profiling parallel codes, we demonstrate performing common performance analysis tasks reproducibly with a few lines of Hatchet code. Hatchet brings the power of modern data science tools to bear on performance analysis.

Posters

SC 2023
     

Alexander Movsesyan, Rakrish Dhakal, Aditya Ranjan, Jordan Marry, Onur Cankur, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

SC 2023
     

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

SC 2021
   

Yiheng Xu, Kathryn Mohror, Hariharan Devarajan, Cameron Stanavige, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, November 2021