Deferred Continuous Batching in Resource-Efficient Large Language Model Serving

Yongjun He, Yao Lu, Gustavo Alonso. Deferred Continuous Batching in Resource-Efficient Large Language Model Serving. In Proceedings of the 4th Workshop on Machine Learning and Systems, EuroMLSys 2024, Athens, Greece, 22 April 2024. pages 98-106, ACM, 2024. [doi]

Abstract

Abstract is missing.