NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference

Xuanlin Jiang, Yang Zhou, Shiyi Cao, Ion Stoica, Minlan Yu. NEO: Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference. In Matei Zaharia, Gauri Joshi, Yingyan (Celine) Lin, editors, Proceedings of the Eighth Conference on Machine Learning and Systems, MLSys 2025, Santa Clara, CA, USA, May 12-15, 2025. OpenReview.net/mlsys.org, 2025. [doi]

Abstract

Abstract is missing.