PSSG at ACM/IEEE Supercomputing 2024

3 minute read

PSSG will have a strong presence at SC ‘24, the premier international conference for high-performance computing, networking, storage, and analysis. We are excited to be contributing in several impactful ways.

Technical Paper Presentations

Daniel Nichols (final-year PhD student) will be presenting his paper on a probabilistic approach to selecting build configurations in package managers using historical build data and integrating this approach into the Spack package manager using probabilistic answer set programming. [Talk Link]

Siddharth Singh (final-year PhD student) and his team will be presenting their work on scaling AxoNN, a 4D parallel AI training framework, on the Frontier, Perlmutter, and Alps supercomputers. They achieve an impressive 1.423 Exaflop/s on 6,144 NVIDIA H100 GPUs, 1.381 Exaflop/s on 32,768 AMD MI250X GCDs, and 620.1 Petaflop/s on 4,096 NVIDIA A100 GPUs in half-precision (bf16) training of LLMs. This effort allows them to analyze the side effect of scaling AI models via experiments that explore catastrophic memorization, where models are sufficiently large to memorize training data in a single pass, and present a preventative approach. [Talk Link]

Awards

Daniel Nichols is a recipient of the 2024 ACM-IEEE CS George Michael Memorial HPC Fellowship. This award will be presented during the awards ceremony SC ‘24.

The AxoNN team’s submission has been selected as a finalist for the prestigious ACM Gordon Bell Prize at SC ‘24. [Talk Link]

Posters

Aman Chaturvedi (undergraduate student) will be presenting his poster titled “Creating Code LLMs for HPC: It’s LLMs All the Way Down,” about creating HPC-Coder-v2 with synthetic code data. It’s shown to be the best LLM at writing parallel code with less than 30B parameters.

Aditya Tomar (UC Berkeley undergraduate student) will be presenting his poster titled “Eve: Less Memory, Same Might,” about building Eve, an approximation of the AdamW optimizer for LLM training, which uses nearly 25% less memory while providing the same convergence as AdamW.

Tutorials

The AxoNN team, led by Siddharth Singh, will present a tutorial on AxoNN, a highly scalable and easy-to-use parallel framework for AI training. Attendees of the tutorial will learn how to use state-of-the-art parallel algorithms in AxoNN to parallelize their GPU training workloads with minimal code changes. We will demonstrate this using a very popular use case—finetuning open-source LLMs like Llama-3 on instruction finetuning data.

Birds of a Feather Session

Daniel Nichols and others will organize a BoF session titled “Toward Integrating LLMs in HPC Software Development.” LLM-based coding assistants have already proven to be useful tools for aiding software developers, and adapting these tools in HPC software development will greatly improve the quality and time-to-development of scientific codes. This will create an environment where researchers can devote more attention to scientific challenges and less to software development intricacies, driving scientific progress forward. This BoF will provide a place for the community to discuss the use of LLMs for HPC software development. [Session Link] [BoF Website]

UMD Booth Talks

PhD students — Daniel Nichols, Siddharth Singh, Joy Kitson, Josh Davis, and Prajwal Singhania—will be presenting their research work at the UMD Exhibitor Booth at SC24. A schedule of the talks can be found below.

  • Tuesday, November 19
    • 10:10 a.m. “Taking GPU Programming Models to Task for Performance Portability” – Joshua Davis
    • 1:10 p.m. “Eve: Pruning Adam’s State for Scalable Deep Learning” – Aditya Tomar
    • 3:10 p.m. “Strategies for Parallelising an Agent-Based Model of Infectious Disease Spread” – Joy Kitson
  • Wednesday, November 20
    • 10:10 a.m. “Loki: Low-rank Keys for Efficient Sparse Attention” – Prajwal Singhania
    • 1:10 p.m. “Improving Build Likelihood in Package Managers with Probabilistic Constraints” – Daniel Nichols
    • 3:10 p.m. “Insights from Longitudinal GPU Workload Monitoring on Perlmutter” – Onur Cankur
  • Thursday, November 21
    • 10:10 a.m. “Creating Code LLMs for HPC: It’s LLMs All the Way Down” – Aman Chaturvedi
    • 1:10 p.m. “A hybrid tensor-expert-data parallelism approach to optimize mixture-of-experts training” – Siddharth Singh

We look forward to connecting with colleagues and collaborators at SC ‘24. Stay tuned for more updates as the conference approaches!