Controlling the spread of infectious diseases in large populations is an important societal challenge, and one which has been highlighted by current events. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. Our work revolves around the development of Loimos, a highly scalable parallel code written in Charm++ which uses agent-based modeling to simulate disease spread over large, realistic, co-evolving networks of interaction.
HPC Data Analytics
Hundreds to thousands of jobs run simultaneously on HPC systems via batch scheduling. MPI communication and I/O data from all running jobs use shared system resources, which can lead to inter-job interference. This interference can slow down the execution of individual jobs to varying degrees. This slowdown is referred to as performance variability. The figures to the right shows two identical runs of an application (in blue) with the rest of the system differing, yet they experienced a nearly 25% difference in messaging rate. Application-specific data and system-wide monitoring data can be analyzed to identify performance bottlenecks, anomalies and correlations between disparate sources of data. Such analytics of HPC performance data can help mitigate performance variability, and improve application performance and system throughput.
Our research uses data analytics of system-wide monitoring data and “control” jobs data to identify performance bottlenecks, anomalies, and correlations. We use this data to predict variability in future jobs and make resource-aware job schedulers.
With the growing number of GPU-based supercomputing platforms and GPU-enabled applications, the ability to accurately model the performance of such applications is becoming increasingly important. Most current performance models for GPU-enabled applications are limited to single node performance. In this project, we are working on developing performance models that are both accurate and easily applicable to any distributed GPU application. We will use both analytical and empirical approaches to build either qualitative models, which aim at pointing out the bottleneck of an application for further optimization, or quantitative models which can predict the elapse time of applications on given hardware platform.
Parallel Deep Learning
Deep learning algorithms in fields like computer vision and natural language processing have seen a movement towards increasingly larger Neural Networks in terms of the depth and the number of parameters. This creates two major downsides for Deep Learning researchers -
- It takes a lot of time to train these neural networks even on GPUs.
- The memory footprint of these neural networks is so large that they can’t fit on a typical GPU DRAM. This research project aims to explore and develop algorithms for parallel deep learning. We are working on improving both the time as well as the memory efficiency for training large neural networks in a distributed setting. We also seek to scale beyond the current state-of-the-art to train even larger architectures. The aim is to develop a robust and user-friendly deep learning framework that makes it extremely easy for the end user to train large neural networks in distributed environments.
Parallel File Systems
The research project is to build an efficient file system for high-performance computing (HPC) applications. Developing user-level filesystems for specific workloads requires analyzing the I/O behavior of parallel programs, and identifying I/O bottlenecks and limitations. Based on the analysis, strategies can be developed to improve I/O performance. The project involves studying the I/O behavior of several HPC benchmarks and applications. It also involves analyzing the collected data to identify bottlenecks, and then developing strategies to mitigate those bottlenecks.
We develop data analysis and visualization tools for analyzing the performance of large-scale parallel applications.
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing performance data that has a hierarchy (for example, serial or parallel profiles that represent calling context trees, call graphs, nested regions’ timers, etc.). Hatchet implements various operations to analyze a single hierarchical data set or compare multiple data sets, and its API facilitates analyzing such data programmatically.