Controlling the spread of infectious diseases in large populations is an important societal challenge, and one which has been highlighted by current events. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. Our work revolves around the development of Loimos, a highly scalable parallel code written in Charm++ which uses agent-based modeling to simulate disease spread over large, realistic, co-evolving networks of interaction.
HPC Data Analytics
Hundreds to thousands of jobs run simultaneously on HPC systems via batch scheduling. MPI communication and I/O data from all running jobs use shared system resources, which can lead to inter-job interference. This interference can slow down the execution of individual jobs to varying degrees. This slowdown is referred to as performance variability. The figures to the right shows two identical runs of an application (in blue) with the rest of the system differing, yet they experienced a nearly 25% difference in messaging rate. Application-specific data and system-wide monitoring data can be analyzed to identify performance bottlenecks, anomalies and correlations between disparate sources of data. Such analytics of HPC performance data can help mitigate performance variability, and improve application performance and system throughput.
Our research uses data analytics of system-wide monitoring data and “control” jobs data to identify performance bottlenecks, anomalies, and correlations. We use this data to predict variability in future jobs and make resource-aware job schedulers.
Parallel Deep Learning
Deep learning algorithms in fields like computer vision and natural language processing have seen a movement towards increasingly larger neural networks architectures. The largest neural networks being trained today require gigantic amounts of compute and memory, with training often taking several months even on hundreds of GPUs. It has thus become extremely critical to design frameworks that can train these models at scale efficiently.
This research project aims to explore and develop algorithms for parallel deep learning. We are working on improving both the time as well as the memory efficiency for training large neural networks in a distributed setting. We also seek to scale beyond the current state-of-the-art to train even larger architectures. The aim is to develop a robust and user-friendly deep learning framework that makes it extremely easy for the end user to train large neural networks in distributed environments.
This project aims to improve the performance of scientific applications on diverse hardware platforms though performance portability. Heterogeneous, accelerator-based hardware architectures have become the dominant paradigm in the design of parallel clusters, and maintaining separate versions of a complex scientific application for each architecture is highly undesirable for productivity. As a result, portable programming models such as Kokkos, RAJA, and SYCL have emerged, which allow one code to run on multiple hardware platforms, whether powered by NVIDIA, AMD, or Intel GPUs and CPUs. However, how well these models enable not just single-source correctness but single-source performance across all target systems is not well-understood. We are conducting a comprehensive study of applications and mini-apps from a wide range of scientific domains implemented with multiple programming models across hardware architectures and vendors to analyze how well the available programming models enable performance portability.
We develop data analysis and visualization tools for analyzing the performance of large-scale parallel applications.
Hatchet is a Python-based library that allows Pandas DataFrames to be indexed by structured tree and graph data. It is intended for analyzing performance data that has a hierarchy (for example, serial or parallel profiles that represent calling context trees, call graphs, nested regions’ timers, etc.). Hatchet implements various operations to analyze a single hierarchical data set or compare multiple data sets, and its API facilitates analyzing such data programmatically.
Pipit is a Python-based library designed for parallel execution trace analysis, built on top of the Pandas library. It supports various trace file formats such as OTF2, HPCToolkit, Projections, and Nsight, providing a uniform data structure in the form of a Pandas DataFrame. Pipit provides a range of data manipulation operations for aggregating, filtering, and transforming trace events to present the data in different ways. Additionally, it includes several functions for easy and efficient identification of performance issues.
Computational fluid dynamics (CFD) solvers are essential for understanding and predicting turbulent hypersonic flows, providing a critical resource for the timely development of atmospheric and space flight technologies as well as improving climate science. However, the sensitivity of hypersonic turbulence demands a high degree of numerical fidelity in simulations.
Existing approaches have been shown to achieve good performance on CPU-based systems using only MPI, but the emergence of GPU-based supercomputing platforms has created a new opportunity to further improve performance. In addition, adaptive mesh refinement (AMR) can massively decrease the amount of work required to achieve a given level of fidelity. In this project, we have adapted an existing hypersonics CFD code that was MPI-only to include support GPU acceleration and AMR using the AMReX library, adapting our use of AMReX to handle previously-unsupported curvilinear grids in interpolation and data management. This cumulatively results in substantial orders-of-magnitude reductions in time-to-solution on representative benchmarks.
Parallel File Systems
The research project is to build an efficient file system for high-performance computing (HPC) applications. Developing user-level filesystems for specific workloads requires analyzing the I/O behavior of parallel programs, and identifying I/O bottlenecks and limitations. Based on the analysis, strategies can be developed to improve I/O performance. The project involves studying the I/O behavior of several HPC benchmarks and applications. It also involves analyzing the collected data to identify bottlenecks, and then developing strategies to mitigate those bottlenecks.