Publications

arXiv

LLM Inference Beyond a Single Node: From Bottlenecks to Mitigations with Fast All-Reduce Communication

Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Akarsh Srivastava, Harshitha Menon', Charles Fredrick Jekel, Abhinav Bhatele

arXiv. 2511.09557.

Abstract

As large language models (LLMs) continue to grow in size, distributed inference has become increasingly important. Model-parallel strategies must now efficiently scale not only across multiple GPUs but also across multiple nodes. In this work, we present a detailed performance study of multi-node distributed inference using LLMs on GPU-based supercomputers. We conduct experiments with several state-of-the-art inference engines alongside YALIS, a research-oriented prototype engine designed for controlled experimentation. We analyze the strong-scaling behavior of different model-parallel schemes and identify key bottlenecks. Since all-reduce operations are a common performance bottleneck, we develop NVRAR, a hierarchical all-reduce algorithm based on recursive doubling with NVSHMEM. NVRAR achieves up to 1.9x-3.6x lower latency than NCCL for message sizes between 128 KB and 2 MB on HPE Slingshot and InfiniBand interconnects. Integrated into YALIS, NVRAR achieves up to a 1.72x reduction in end-to-end batch latency for the Llama 3.1 405B model in multi-node decode-heavy workloads using tensor parallelism.

arXiv

Performance-Aligned LLMs for Generating Fast Code

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, Abhinav Bhatele

arXiv. 2404.18864.

Abstract

Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP code.

arXiv

Automated Programmatic Performance Analysis of Parallel Programs

Onur Cankur, Aditya Tomar, Daniel Nichols, Connor Scully-Allison, Katherine E. Isaacs, Abhinav Bhatele

Abstract

Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.

SC 2025

Plexus: Taming Billion-edge Graphs with 3D Parallel Full-graph GNN Training

Aditya Ranjan, Siddharth Singh, Cunyang Wei, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2025

Abstract

Graph neural networks (GNNs) leverage the connectivity and structure of real-world graphs to learn intricate properties and relationships between nodes. Many real-world graphs exceed the memory capacity of a GPU due to their sheer size, and training GNNs on such graphs requires techniques such as mini-batch sampling to scale. The alternative approach of distributed full-graph training suffers from high communication overheads and load imbalance due to the irregular structure of graphs. We propose a three-dimensional (3D) parallel approach for full-graph training that tackles these issues and scales to billion-edge graphs. In addition, we introduce optimizations such as a double permutation scheme for load balancing, and a performance model to predict the optimal 3D configuration of our parallel implementation – plexus. We evaluate plexus on six different graph datasets and show scaling results on up to 2048 GPUs of Perlmutter, and 1024 GPUs of Frontier. plexus achieves unprecedented speedups of 2.3-12.5x over prior state of the art, and a reduction in time-to-solution by 5.2-8.7x on Perlmutter and 7.0-54.2x on Frontier.

ICPP 2025

ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele

Proceedings of the 54th International Conference on Parallel Processing, September 2025.

Abstract

GPGPU architectures have become significantly more diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. Portable programming models exist, but they require significant developer effort to port to and optimize for different hardware architectures. Large language models (LLMs) may help to reduce this programmer burden. In this paper, we present a novel benchmark and testing framework, ParEval-Repo, which can be used to evaluate the efficacy of LLM-based approaches in automatically translating entire codebases across GPGPU execution models. ParEval-Repo includes several scientific computing and AI mini-applications in a range of programming models and levels of repository complexity. We use ParEval-Repo to evaluate a range of state-of-the-art open-source and commercial LLMs, with both a non-agentic and a top-down agentic approach. We assess code generated by the LLMs and approaches in terms of compilability, functional correctness, categories of build errors, and the cost of translation in terms of the number of inference tokens. Our results demonstrate that LLM translation of scientific applications is feasible for small programs but difficulty with generating functional build systems and cross-file dependencies pose challenges in scaling to larger codebases.

SEA 2025

Elias-Fano Compression for Space-Efficient Rank and Select Structures

Lannie Dalton Hough, Abhinav Bhatele

Proceedings of the 23rd International Symposium on Experimental Algorithms, July 2025

Abstract

Bit vectors are an important component in many data structures. Such data structures are used in a variety of applications and domains including databases, search engines, and computational biology. Many use cases depend on being able to perform rank and/or select queries on the bit vector. No existing rank and select structure enabling these queries is most efficient both for space and for time; there is a tradeoff between the two. In practice, the smallest rank and select data structures, cs-poppy and pasta-flat, impose a space overhead of 3.51%, or 3.125% if only rank needs to be supported. In this paper, we present a new data structure, orzo, which reduces the overhead of the rank component by a further 26.5%. We preserve desirable cache-centric design decisions made in prior work, which allows us to minimize the performance penalty of creating a smaller data structure.

ICS 2025

Taking GPU Programming Models to Task for Performance Portability

Joshua H. Davis, Pranav Sivaraman, Joy Kitson, Konstantinos Parasyris, Harshitha Menon, Isaac Minn, Giorgis Georgakoudis, Abhinav Bhatele

Proceedings of the International Conference on Supercomputing. June 2025.

Abstract

Portability is critical to ensuring high productivity in developing and maintaining scientific software as the diversity in on-node hardware architectures increases. While several programming models provide portability for diverse GPU systems, they don’t make any guarantees about performance portability. In this work, we explore several programming models – CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL, to assess the consistency of their performance across NVIDIA and AMD GPUs. We use five proxy applications from different scientific domains, create implementations where missing, and use them to present a comprehensive comparative evaluation of the performance portability of these programming models. We provide a Spack scripting-based methodology to ensure reproducibility of experiments conducted in this work. Finally, we analyze the reasons for why some programming models underperform in certain scenarios and in some cases, present performance optimizations to the proxy applications.

IPDPS 2025

Pandemics in silico: Scaling an Agent-Based Simulation on Realistic Social Contact Networks

Joy Kitson, Ian Costello, Jiangzhuo Chen, Diego Jiménez, Stefan Hoops, Henning Mortveit, Esteban Meneses, Jae-Seung Yeom, Madhav V. Marathe, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, June 2025

Abstract

Preventing the spread of infectious diseases requires implementing interventions at various levels of government and evaluating the potential impact and efficacy of those preemptive measures. Agent-based modeling can be used for detailed studies of the spread of such diseases in the presence of possible interventions. The computational cost of modeling epidemic diffusion through large social contact networks necessitates the use of parallel algorithms and resources in order to achieve quick turnaround times. In this work, we present Loimos, a scalable parallel framework for simulating epidemic diffusion. Loimos uses a hybrid of time-stepping and discrete event simulation to model disease spread, and is implemented on top of Charm++, an asynchronous, many-task runtime that enables over-decomposition and adaptive overlap of computation and communication. We demonstrate that Loimos is able to achieve significant speedups while scaling to large core counts. In particular, Loimos is able to simulate 200 days of a COVID-19 outbreak on a digital twin of California in about 42 seconds, for an average of 4.6 billion traversed edges per second (TEPS), using 4096 cores on Perlmutter at NERSC.

ISC 2025

HPC-Coder-v2: Studying Code LLMs Across Low-Resource Parallel Languages

Aman Chaturvedi, Daniel Nichols, Siddharth Singh, Abhinav Bhatele

Proceedings of the ISC High Performance Conference

Abstract

Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.

LLM4Code 2025

The Shortcomings of Code LLMs in Modeling Code Properties

Srivishnu Pyda, Daniel Nichols, Abhinav Bhatele

Proceedings of the Second International Workshop on Large Language Models for Code

Abstract

Large language models have rapidly taken over software development tools and are now being used to generate code, write documentation, and even fix GitHub issues. Despite their success, many studies across various fields of computer science have shown that these models often struggle to reason about code properties, such as performance, security, etc. In this paper, we demonstrate the limitations of text-based learning for code properties and how structured code representations are more effective for understanding some code properties. We evaluate over several code benchmarks and demonstrate the limitations of the internal code representation within large language models.

NeurIPS 24

Loki: Low-rank Keys for Efficient Sparse Attention

Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele

Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems

Abstract

Inference on large language models can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in such models contributes significantly to these costs, which has resulted in several recent works that propose sparse attention approximations for inference. In this work, we propose to approximate the self-attention computation by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that the key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to maintain the efficacy of the models better than other popular approximation methods, while speeding up the attention computation due to reduced data movement (load/store) and compute costs.

NeurIPS 24

Be like a Goldfish, Don’t Memorize! Mitigating Memorization in Generative LLMs

Abhimanyu Hans, Yuxin Wen, Neel Jain, John Kirchenbauer, Hamid Kazemi, Prajwal Singhania, Siddharth Singh, Gowthami Somepalli, Jonas Geiping, Abhinav Bhatele, Tom Goldstein

Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems

Abstract

Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks.

SC 2024

A Probabilistic Approach To Selecting Build Configurations in Package Managers

Daniel Nichols, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2024

Abstract

Modern scientific software in high performance computing is often complex, and many parallel applications and libraries depend on several other software or libraries. Developers and users of such complex software often use package managers for building them. Package managers depend on hu- mans to codify package constraints (for dependency and version selection), and the dependency graph of a software package can often become large (hundreds of vertices). In addition, package constraints often become outdated and inconsistent over time since they are maintained by different people for different packages, which is a laborious task. This can result in package builds to fail for certain package configurations. In this paper, we propose a methodology that uses historical build results to assist a package manager in selecting the best versions of package dependencies with an aim to improve the likelihood of a successful build. We utilize a machine learning (ML) model to predict the probability of build outcomes of different configurations of packages in the Spack package manager. When evaluated on common scientific software stacks, this ML model-based approach is able to achieve a 13% higher success rate in building packages than the default version selection mechanism in Spack.

SC 2024

Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers

Siddharth Singh, Prajwal Singhania, Aditya Ranjan, John Kirchenbauer, Jonas Geiping, Yuxin Wen, Neel Jain, Abhimanyu Hans, Manli Shu, Aditya Tomar, Tom Goldstein, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2024

Abstract

Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel 3D tensor + data parallel hybrid algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s for training of GPT-style transformer models on Frontier. While the performance of LLMs scales with the number of trainable parameters, the increase in parameter counts unlocked by AxoNN also heightens privacy risks brought forth by an enhanced ability to rapidly memorize training data. This can lead to potential disclosure of sensitive or private information when deployed. We highlight this side effect of scale through a series of experiments exploring catastrophic memorization by LLMs and methods to prevent it in the many billion-parameter regime.

HPDC 2024

Can Large Language Models Write Parallel Code?

Daniel Nichols, Joshua H. Davis, Zhaojun Xie, Arjun Rajaram, Abhinav Bhatele

Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing

Abstract

Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.

ISC 2024

HPC-Coder: Modeling Parallel Programs using Large Language Models

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the ISC High Performance Conference

Abstract

Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.

IPDPS 2024

Predicting Cross-Architecture Performance of Parallel Programs

Daniel Nichols, Alexander Movsesyan, Jae-Seung Yeom, Abhik Sarkar, Daniel Milroy, Tapasya Patki, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2024

Abstract

A variety of hardware architectures, both CPUs and GPUs, are used today to build supercomputers and parallel clusters. Often times, users can choose which hardware platform they want to run on. Modern scientific workflows have multiple computational tasks, and each task may be better suited for a different architecture in terms of performance. Deciding where to run an application or workflow task is not straightforward because of the complexity of applications, and hardware ar- chitectures, which makes performance predictions challenging. Hence, modeling the performance of scientific applications across a variety of architectures is important for achieving the best performance. In this paper, we present a machine learning based methodology to model the relative performance of applica- tions across multiple architectures using hardware performance counters. Our machine learning model can predict the relative performance of an application with a mean absolute error of 0.11, and can be used effectively to make performance-aware and multi-architecture scheduling decisions, reducing makespan by up to 20%.

MSR 2024

Learning to Predict and Improve Build Successes in Package Ecosystems

Harshitha Menon, Daniel Nichols, Abhinav Bhatele, Todd Gamblin

21st International Conference on Mining Software Repositories (MSR '24)

Abstract

Software has become increasingly complex, with a typical applica- tion depending on tens or hundreds of packages. Finding compatible versions and build configurations of these packages is challenging. This paper presents a method to learn the likelihood of software build success, and techniques for leveraging this information to guide dependency solvers to better software configurations. We leverage the heavily parameterized package recipes from the Spack package manager to produce a training data set of builds, and we use Graph Neural Networks to learn whether a given package con- figuration will build successfully or not. We apply our tool to the U.S. Exascale Computing Project’s software stack. We demonstrate its effectiveness in predicting whether a given package will build successfully. We show that our technique can be used to improve the solutions generated by dependency solvers, reducing the need for developers to find working builds by trial and error.

ICS 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Siddharth Singh, Olatunji Ruwase, Ammar Ahmad Awan, Samyam Rajbhandari, Yuxiong He, Abhinav Bhatele

Proceedings of the ACM International Conference on Supercomputing, June 2023

Abstract

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three- dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and commu- nication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.

IPDPS 2023

Porting a Computational Fluid Dynamics Code with AMR to Large-scale GPU Platforms

Joshua H. Davis, Justin Shafner, Daniel Nichols, Nathan Grube, Pino Martin, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2023

Abstract

Accurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this paper, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo v2.0, on the Summit system, demonstrating a 6x to 44x speedup over the CPU-only version.

IPDPS 2023

Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Siddharth Singh, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2023

Abstract

Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely - data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.

TVCG 2023

Scalable comparative visualization of ensembles of call graphs

Suraj Kesavan, Harsh Bhatia, Abhinav Bhatele, Stephanie Brink, Olga Pearce, Todd Gamblin, Peer-Timo Bremer, Kwan-Liu Ma

IEEE Transactions on Visualization and Computer Graphics

Abstract

Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present Ensemble CallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes.

ISC 2022

Comparative Evaluation of Call Graph Generation by Profiling Tools

Onur Cankur, Abhinav Bhatele

International Conference on High Performance Computing. Springer, May 2022

Abstract

Call graphs generated by profiling tools are critical to dissecting the performance of parallel programs. Although many mature and sophisticated profiling tools record call graph data, each tool is different in its runtime overheads, memory consumption, and output data generated. In this work, we perform a comparative evaluation study on the call graph data generation capabilities of several popular profiling tools – Caliper, HPCToolkit, TAU, and Score-P. We evaluate their runtime overheads, memory consumption, and generated call graph data (size and quality). We perform this comparison empirically by running several proxy applications, AMG, LULESH, and Quicksilver on a parallel cluster. Our results show which tool results in the lowest overheads and produces more meaningful call graph data under different conditions.

IPDPS 2022

Resource Utilization Aware Job Scheduling to Mitigate Performance Variability

Daniel Nichols, Aniruddha Marathe, Kathleen Shoga, Todd Gamblin, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2022

Abstract

Resource contention on high performance computing (HPC) platforms can lead to significant variation in application performance. When several jobs experience such large variations in run times, it can lead to less efficient use of system resources. It can also lead to users over-estimating their job’s expected run time, which degrades the efficiency of the system scheduler. Mitigating performance variation on HPC platforms benefits end users and also enables more efficient use of system resources. In this paper, we present a pipeline for collecting and analyzing system and application performance data for jobs submitted over long periods of time. We use a set of machine learning (ML) models trained on this data to classify performance variation using current system counters. Additionally, we present a new resource-aware job scheduling algorithm that utilizes the ML pipeline and current system state to mitigate job variation. We evaluate our pipeline, ML models, and scheduler using various proxy applications and an actual implementation of the scheduler on an Infiniband-based fat-tree cluster.

IPDPS 2022

AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Siddharth Singh, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2022

Abstract

In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art.

IPCCC 2021

A Simulation Study of Hardware Parameters for Future GPU-based HPC Platforms

Saptarshi Bhowmik, Nikhil Jain, Xin Yuan, Abhinav Bhatele

Proceedings of the IEEE International Performance Computing and Communications Conference, October 2021

Abstract

Compute nodes on high performance computing (HPC) platforms are increasingly equipped with multiple GPUs. This results in increased computational capacity per node, and reduction in the total number of nodes or endpoints in the system. This trend changes the computation and communication balance in comparison to pre-GPU era HPC platforms, which warrants a new study of hardware architectural parameters. In this work, we leverage the end-to-end system simulation capabilities of TraceR-CODES and study the impact of several hardware design parameters on the performance of realistic HPC workloads. We focus on three crucial hardware parameters: (1) number of GPUs per node, (2) network link bandwidth, and (3) network interface controller (NIC) scheduling policies, in the context of two popular network topologies – fat-tree and dragonfly.

TVCG 2021

Visualizing Hierarchical Performance Profiles of Parallel Codes using CallFlow

Huu Tan Nguyen, Abhinav Bhatele, Nikhil Jain, Suraj Kesavan, Harsh Bhatia, Todd Gamblin, Kwan-Liu Ma, Peer-Timo Bremer

IEEE Transactions on Visualization and Computer Graphics

Abstract

Calling context trees (CCTs) couple performance metrics with call paths, helping understand the execution and performance of parallel programs. To identify performance bottlenecks, programmers and performance analysts visually explore CCTs to form and validate hypotheses regarding degraded performance. However, due to the complexity of parallel programs, existing visual representations do not scale to applications running on a large number of processors. We present CallFlow, an interactive visual analysis tool that provides a high-level overview of CCTs together with semantic refinement operations to progressively explore CCTs. Using a flow-based metaphor, we visualize a CCT by treating execution time as a resource spent during the call chain, and demonstrate the effectiveness of our design with case studies on large-scale, production simulation codes.

ProTools 2020

Usability and Performance Improvements in Hatchet

Stephanie Brink, Ian Lumsden, Connor Scully-Allison, Katy Williams, Olga Pearce, Todd Gamblin, Michela Taufer, Katherine E. Isaacs, Abhinav Bhatele

Proceedings of the Workshop on Programming and Performance Visualization Tools. November 2020

Abstract

Performance analysis is critical for pinpointing bottlenecks in applications. Many different profilers exist to instrument parallel programs on HPC systems, however, there is a lack of tools for analyzing such data programmatically. Hatchet, an open-source Python library, can read profiling data from several tools, and enables the user to perform a variety of analyses on hierarchical performance data. In this paper, we augment Hatchet to support new features: a call path query language for representing call path-related queries, visualizations for displaying and interacting with the structured data, and new operations for performing analysis on multiple datasets. Additionally, we present performance optimizations in Hatchet’s HPCToolkit reader and the unify operation to enable scalable analysis of large profiles.

CLUSTER 2020

Predicting MPI collective communication performance using machine learning

Sascha Hunold, Abhinav Bhatele, George Bosilca, Peter Knees

Proceedings of the IEEE Cluster Conference. September 2020.

Abstract

The Message Passing Interface (MPI) defines the semantics of data communication operations, while the imple- menting libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given com- munication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin.

ICS 2020

End-to-end performance modeling of distributed GPU applications

Jaemin Choi, David Richards, Laxmikant V. Kale, Abhinav Bhatele

Proceedings of the International Conference on Supercomputing. June 2020.

Abstract

With the growing number of GPU-based supercomputing platforms and GPU-enabled applications, the ability to accurately model the performance of such applications is becoming increasingly impor- tant. Most current performance models for GPU-enabled appli- cations are limited to single node performance. In this work, we propose a methodology for end-to-end performance modeling of distributed GPU applications. Our work strives to create perfor- mance models that are both accurate and easily applicable to any distributed GPU application. We combine trace-driven simulation of MPI communication using the TraceR-CODES framework with a profiling-based roofline model for GPU kernels. We make substan- tial modifications to these models to capture the complex effects of both on-node and off-node networks in today’s multi-GPU su- percomputers. We validate our model against empirical data from GPU platforms and also vary tunable parameters of our model to observe how they might affect application performance.

IPDPS 2020

The case of performance variability on dragonfly-based systems

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, David K. Lowenthal

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

IPDPS 2020

Auto-tuning parameter choices in HPC applications using bayesian optimization

Harshitha Menon, Abhinav Bhatele, Todd Gamblin

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract

High performance computing applications, runtimes, and platforms are becoming more configurable to enable ap- plications to obtain better performance. As a result, users are increasingly presented with a multitude of options to configure application-specific as well as platform-level parameters. The combined effect of different parameter choices on application performance is difficult to predict, and an exhaustive evaluation of this combinatorial parameter space is practically infeasible. One approach to parameter selection is a user-guided exploration of a part of the space. However, such an ad hoc exploration of the parameter space can result in suboptimal choices. Therefore, an automatic approach that can efficiently explore the parameter space is needed. In this paper, we propose HiPerBOt, a Bayesian optimization based configuration selection framework to identify application and platform-level parameters that result in high performing configurations. We demonstrate the effectiveness of HiPerBOt in tuning parameters that include compiler flags, runtime settings, and application-level options for several parallel codes, including, Kripke, Hypre, LULESH, and OpenAtom.

HiPC 2019

Evaluating the impact of energy efficient networks on HPC workloads

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

Proceedings of the IEEE International Conference on High Performance Computing, December 2019

Abstract

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation.

SC 2019

Hatchet: Pruning the overgrowth in parallel profiles

Abhinav Bhatele, Stephanie Brink, Todd Gamblin

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2019

Abstract

Performance analysis is critical for eliminating scalability bottlenecks in parallel codes. There are many profiling tools that can instrument codes and gather performance data. However, analytics and visualization tools that are general, easy to use, and programmable are limited. In this paper, we focus on the analytics of structured profiling data, such as that obtained from calling context trees or nested region timers in code. We present a set of techniques and operations that build on the pandas data analysis library to enable analysis of parallel profiles. We have implemented these techniques in a Python-based library called Hatchet that allows structured data to be filtered, aggregated, and pruned. Using performance datasets obtained from profiling parallel codes, we demonstrate performing common performance analysis tasks reproducibly with a few lines of Hatchet code. Hatchet brings the power of modern data science tools to bear on performance analysis.

PhD 2025

On Learning Behaviors of Parallel Code and Systems Across Modalities

Daniel Nichols

Department of Computer Science, University of Maryland, March 2025

Abstract

Performance modeling is an integral part of the research process for computational scientists. It enables them to understand how different factors contribute to the final runtime of an application. This understanding is crucial to developing efficient scientific applications and simulations. While important, performance modeling is difficult as there are a large number of factors that may contribute to final performance. Factors such as the algorithm, problem size, implementation, architecture, and systems software stack all impact performance in an often complex relationship. Analytical models can be employed to study these causal variables and performance, however, they are difficult to scale up to a large number of input variables. Additionally, the relationship between the causal variables and performance may be unknown or complex, making it challenging to derive an analytical model. Fortunately, machine learning (ML) can help address these challenges as ML algorithms excel at modeling unknown and complex relationships. Furthermore, ML-based performance models can handle a large number of input variables, making them ideal for modeling complex scientific codes. By training ML mod- els on historical performance data, computational scientists can develop accurate models that can predict the performance of new applications and simulations under different scenarios. However, current ML-based modeling approaches are limited to modeling one or two sources of performance data, such as hardware counters or application features. This limitation prevents models from making use of all available causal variables that may impact performance. This thesis introduces novel approaches to modeling performance that can make use of all available data sources. Additionally, it introduces performance latent spaces that can be used to model various output metrics, such as runtime or energy consumption, in a unified manner. Finally, a method to integrate these performance models into large language models is introduced to enable modeling and improving the performance of code.

SC 2025

Understanding Communication Bottlenecks in Multi-node LLM Inference

Prajwal Singhania, Siddharth Singh, Lannie Dalton Hough, Ishaan Revankar, Harshitha Menon, Charles Jekel, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2025

SC 2025

Unmasking Performance Variability in GPU Codes on Supercomputers

Cunyang Wei, Keshav Pradeep, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2025

SC 2025

Optimizing Collectives with Large Payloads on GPU-based Supercomputers

Siddharth Singh, Mahua Singh, Keshav Pradeep, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2025

SC 2025

Understanding GPU Utilization Using LDMS Data on Perlmutter

Onur Cankur, Brian Austin, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2025

IPDPS 2025

Scalable Epidemiological Agent-based Modeling with Dynamic Behaviors

Joy Kitson, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium, June 2025

SC 2024

Creating Code LLMs for HPC: It’s LLMs All the Way Down

Aman Chaturvedi, Daniel Nichols, Siddharth Singh, Abhinav Bhatele

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2024

SC 2024

Eve: Less Memory, Same Might

Aditya Tomar, Siddharth Singh, Tom Goldstein, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC '24, November 2024

SC 2023

The impact of process topology on RMA programming models: A study on NERSC Perlmutter

Nikodemos Koutsoheras, Sayan Ghosh, Joshua Suetterlein, Nathan Tallent, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

SC 2023

Evaluating Performance Portability of GPU Programming Models

Joshua H. Davis, Pranav Sivaraman, Isaac Minn, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

SC 2023

Pipit: Simplifying Analysis of Parallel Execution Traces

Alexander Movsesyan, Rakrish Dhakal, Aditya Ranjan, Jordan Marry, Onur Cankur, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

SC 2023

Modeling Parallel Programs using Large Language Models

Daniel Nichols, Aniruddha Marathe, Harshitha Menon, Todd Gamblin, Abhinav Bhatele

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’23, November 2023

IPDPS 2023

AxoNN: Hybrid Asynchronous Algorithms for Parallel Deep Learning

Siddharth Singh, Abhinav Bhatele

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2023

SC 2022