Controlling the spread of infectious diseases in large populations is an important societal challenge, and one which has been highlighted by current events. Mathematically, the problem is best captured as a certain class of reaction-diffusion processes (referred to as contagion processes) over appropriate synthesized interaction networks. Agent-based models have been successfully used in the recent past to study such contagion processes. Our work revolves around the development of Loimos, a highly scalable parallel code written in Charm++ which uses agent-based modeling to simulate disease spread over large, realistic, co-evolving networks of interaction.
HPC Data Analytics
Hundreds to thousands of jobs run simultaneously on HPC systems via batch scheduling. MPI communication and I/O data from all running jobs use shared system resources, which can lead to inter-job interference. This interference can slow down the execution of individual jobs to varying degrees. This slowdown is referred to as performance variability. The figures to the right shows two identical runs of an application (in blue) with the rest of the system differing, yet they experienced a nearly 25% difference in messaging rate. Application-specific data and system-wide monitoring data can be analyzed to identify performance bottlenecks, anomalies and correlations between disparate sources of data. Such analytics of HPC performance data can help mitigate performance variability, and improve application performance and system throughput.
Our research uses data analytics of system-wide monitoring data and “control” jobs data to identify performance bottlenecks, anomalies, and correlations. We use this data to predict variability in future jobs and make resource-aware job schedulers.
Fluid solvers are essential in simulating and studying physical systems. Parallelizing these solvers is often straightforward for structured problems, however, unstructured domains typically introduce complex data access patterns and load imbalance. These lead to poor performance and resource utilization. To address load imbalance we study the use of adaptive mesh refinement and Charm++ to partition work between processing elements. At a finer granularity we use novel GPU based algorithms to accelerate the numerics in spite of the data access patterns. Our research in this area aims to solve these issues for high-order numerical methods with non-standard stencils.
Parallel Deep Learning
Deep learning algorithms in fields like computer vision and natural language processing have seen a movement towards increasingly larger neural networks architectures. The largest neural networks being trained today require gigantic amounts of compute and memory, with training often taking several months even on hundreds of GPUs. It has thus become extremely critical to design frameworks that can train these models at scale efficiently.
This research project aims to explore and develop algorithms for parallel deep learning. We are working on improving both the time as well as the memory efficiency for training large neural networks in a distributed setting. We also seek to scale beyond the current state-of-the-art to train even larger architectures. The aim is to develop a robust and user-friendly deep learning framework that makes it extremely easy for the end user to train large neural networks in distributed environments.
Parallel File Systems
The research project is to build an efficient file system for high-performance computing (HPC) applications. Developing user-level filesystems for specific workloads requires analyzing the I/O behavior of parallel programs, and identifying I/O bottlenecks and limitations. Based on the analysis, strategies can be developed to improve I/O performance. The project involves studying the I/O behavior of several HPC benchmarks and applications. It also involves analyzing the collected data to identify bottlenecks, and then developing strategies to mitigate those bottlenecks.
We develop data analysis and visualization tools for analyzing the performance of large-scale parallel applications.
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing performance data that has a hierarchy (for example, serial or parallel profiles that represent calling context trees, call graphs, nested regions’ timers, etc.). Hatchet implements various operations to analyze a single hierarchical data set or compare multiple data sets, and its API facilitates analyzing such data programmatically.