Understanding Stragglers in Large Model Training Using What-if Analysis

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu 0086, Aurojit Panda, Jinyang Li. Understanding Stragglers in Large Model Training Using What-if Analysis. In Lidong Zhou, Yuanyuan Zhou 0001, editors, 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025. pages 483-498, USENIX Association, 2025. [doi]

Abstract

Abstract is missing.