Weilin Cai, Le Qin, Jiayi Huang. MoC-System: Efficient Fault Tolerance for Sparse Mixture-of-Experts Model Training. In Lieven Eeckhout, Georgios Smaragdakis, Katai Liang, Adrian Sampson, Martha A. Kim, Christopher J. Rossbach, editors, Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS 2025, Rotterdam, Netherlands, 30 March 2025 - 3 April 2025. pages 655-671, ACM, 2025. [doi]
Abstract is missing.