POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference

Aditya K. Kamath, Ramya Prabhu, Jayashree Mohan, Simon Peter 0001, Ramachandran Ramjee, Ashish Panwar. POD-Attention: Unlocking Full Prefill-Decode Overlap for Faster LLM Inference. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, Christopher J. Rossbach, editors, Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2025, Rotterdam, Netherlands, 30 March 2025 - 3 April 2025. pages 897-912, ACM, 2025. [doi]

Abstract

Abstract is missing.