Optimizing Inference for AI Models

The transformer architecture has revolutionized the field of deep learning. Large Language Models have grown exponentially larger in size since the introduction of the architecture. The largest models of today require gigantic amounts of compute and memory, both during training and when inferencing the trained model. Further, the attention mechanism, which is at the core of the transformer architecture, becomes a significant bottleneck for long sequence inference due to the auto-regressive nature of the models. Unlike training, which is a one-time process, inference happens continuously once the model is deployed. Hence, it becomes critical to design systems and techniques that can reduce the compute and memory cost of inference, translating to a significant dollar cost reduction in running these models. This research project aims to explore and develop algorithms for efficient inference of AI models.

Related Publications

[1] Prajwal Singhania et al, "Loki: Low-rank Keys for Efficient Sparse Attention", Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems