Vela: A Virtualized LLM Training System with GPU Direct RoCE

Apoorve Mohan, Robert Walkup, Bengi Karacali, Ming-Hung Chen, Abdullah Kayi, Liran Schour, Shweta Salaria, Sophia Wen, I-Hsin Chung, Abdul Alim, Constantinos Evangelinos, Lixiang Luo, Marc Dombrowa, Laurent Schares, Ali Sydney, Pavlos Maniotis, Sandhya Koteshwara, Brent Tang, Joel Belog, Rei Odaira, Vasily Tarasov, Eran Gampel, Drew Thorstensen, Talia Gershon, Seetharami Seelam. Vela: A Virtualized LLM Training System with GPU Direct RoCE. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, Christopher J. Rossbach, editors, Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2025, Rotterdam, Netherlands, 30 March 2025 - 3 April 2025. pages 1348-1364, ACM, 2025. [doi]

Abstract

Abstract is missing.