Preprints

arXiv
   

Ian J. Costello, Abhinav Bhatele

Abstract  

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

Papers

ProTools 2020
   

Stephanie Brink, Ian Lumsden, Connor Scully-Allison, Katy Williams, Olga Pearce, Todd Gamblin, Michela Taufer, Katherine E. Isaacs, Abhinav Bhatele

Proceedings of the Workshop on Programming and Performance Visualization Tools. November 2020

Abstract  

Performance analysis is critical for pinpointing bottlenecks in applications. Many different profilers exist to instrument parallel programs on HPC systems, however, there is a lack of tools for analyzing such data programmatically. Hatchet, an open-source Python library, can read profiling data from several tools, and enables the user to perform a variety of analyses on hierarchical performance data. In this paper, we augment Hatchet to support new features: a call path query language for representing call path-related queries, visualizations for displaying and interacting with the structured data, and new operations for performing analysis on multiple datasets. Additionally, we present performance optimizations in Hatchet’s HPCToolkit reader and the unify operation to enable scalable analysis of large profiles.

CLUSTER 2020
   

Sascha Hunold, Abhinav Bhatele, George Bosilca, Peter Knees

Proceedings of the IEEE Cluster Conference. September 2020.

Abstract  

The Message Passing Interface (MPI) defines the semantics of data communication operations, while the imple- menting libraries provide several parameterized algorithms for each operation. Each algorithm of an MPI collective operation may work best on a particular system and may be dependent on the specific communication problem. Internally, MPI libraries employ heuristics to select the best algorithm for a given com- munication problem when being called by an MPI application. The majority of MPI libraries allow users to override the default algorithm selection, enabling the tuning of this selection process. The problem then becomes how to select the best possible algorithm for a specific case automatically. In this paper, we address the algorithm selection problem for MPI collective communication operations. To solve this problem, we propose an auto-tuning framework for collective MPI operations based on machine-learning techniques. First, we execute a set of benchmarks of an MPI library and its entire set of collective algorithms. Second, for each algorithm, we fit a performance model by applying regression learners. Last, we use the regression models to predict the best possible (fastest) algorithm for an unseen communication problem. We evaluate our approach for different MPI libraries and several parallel machines. The experimental results show that our approach outperforms the standard algorithm selection heuristics, which are hard-coded into the MPI libraries, by a significant margin.

ICS 2020
   

Jaemin Choi, David Richards, Laxmikant V. Kale, Abhinav Bhatele

Proceedings of the International Conference on Supercomputing. June 2020.

Abstract  

With the growing number of GPU-based supercomputing platforms and GPU-enabled applications, the ability to accurately model the performance of such applications is becoming increasingly impor- tant. Most current performance models for GPU-enabled appli- cations are limited to single node performance. In this work, we propose a methodology for end-to-end performance modeling of distributed GPU applications. Our work strives to create perfor- mance models that are both accurate and easily applicable to any distributed GPU application. We combine trace-driven simulation of MPI communication using the TraceR-CODES framework with a profiling-based roofline model for GPU kernels. We make substan- tial modifications to these models to capture the complex effects of both on-node and off-node networks in today’s multi-GPU su- percomputers. We validate our model against empirical data from GPU platforms and also vary tunable parameters of our model to observe how they might affect application performance.

IPDPS 2020
   

Abhinav Bhatele, Jayaraman J. Thiagarajan, Taylor Groves, Rushil Anirudh, Staci A. Smith, Brandon Cook, David K. Lowenthal

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract  

Performance of a parallel code running on a large supercomputer can vary significantly from one run to another even when the executable and its input parameters are left unchanged. Such variability can occur due to perturbation of the computation and/or communication in the code. In this paper, we investigate the case of performance variability arising due to network effects on supercomputers that use a dragonfly topology – specifically, Cray XC systems equipped with the Aries interconnect. We perform post-mortem analysis of network hardware counters, profiling output, job queue logs, and placement information, all gathered from periodic representative application runs. We investigate the causes of performance variability using deviation prediction and recursive feature elimination. Additionally, using time-stepped performance data of individual applications, we train machine learning models that can forecast the execution time of future time steps.

IPDPS 2020
   

Harshitha Menon, Abhinav Bhatele, Todd Gamblin

Proceedings of the IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society, May 2020

Abstract  

High performance computing applications, runtimes, and platforms are becoming more configurable to enable ap- plications to obtain better performance. As a result, users are increasingly presented with a multitude of options to configure application-specific as well as platform-level parameters. The combined effect of different parameter choices on application performance is difficult to predict, and an exhaustive evaluation of this combinatorial parameter space is practically infeasible. One approach to parameter selection is a user-guided exploration of a part of the space. However, such an ad hoc exploration of the parameter space can result in suboptimal choices. Therefore, an automatic approach that can efficiently explore the parameter space is needed. In this paper, we propose HiPerBOt, a Bayesian optimization based configuration selection framework to identify application and platform-level parameters that result in high performing configurations. We demonstrate the effectiveness of HiPerBOt in tuning parameters that include compiler flags, runtime settings, and application-level options for several parallel codes, including, Kripke, Hypre, LULESH, and OpenAtom.

HiPC 2019
   

Giorgis Georgakoudis, Nikhil Jain, Takatsugu Ono, Koji Inoue, Shinobu Miwa, Abhinav Bhatele

Proceedings of the IEEE International Conference on High Performance Computing, December 2019

Abstract  

Interconnection networks grow larger as supercomputers include more nodes and require higher bandwidth for performance. This scaling significantly increases the fraction of power consumed by the network, by increasing the number of network components (links and switches). Typically, network links consume power continuously once they are turned on. However, recent proposals for energy efficient interconnects have introduced low-power operation modes for periods when network links are idle. Low-power operation can increase messaging time when switching a link from low-power to active operation. We extend the TraceR-CODES network simulator for power modeling to evaluate the impact of energy efficient networking on power and performance. Our evaluation presents the first study on both single-job and multi-job execution to realistically simulate power consumption and performance under congestion for a large-scale HPC network. Results on several workloads consisting of HPC proxy applications show that single-job and multi-job execution favor different modes of low power operation to have significant power savings at the cost of minimal performance degradation.

SC 2019
   

Abhinav Bhatele, Stephanie Brink, Todd Gamblin

Proceedings of the ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, November 2019

Abstract  

Performance analysis is critical for eliminating scalability bottlenecks in parallel codes. There are many profiling tools that can instrument codes and gather performance data. However, analytics and visualization tools that are general, easy to use, and programmable are limited. In this paper, we focus on the analytics of structured profiling data, such as that obtained from calling context trees or nested region timers in code. We present a set of techniques and operations that build on the pandas data analysis library to enable analysis of parallel profiles. We have implemented these techniques in a Python-based library called Hatchet that allows structured data to be filtered, aggregated, and pruned. Using performance datasets obtained from profiling parallel codes, we demonstrate performing common performance analysis tasks reproducibly with a few lines of Hatchet code. Hatchet brings the power of modern data science tools to bear on performance analysis.

Posters