Understanding Stragglers in Large Model Training Using What-if Analysis

Jinkun Lin, Ziheng Jiang, Zuquan Song, Sida Zhao, Menghan Yu, Zhanghan Wang, Chenyuan Wang, Zuocheng Shi, Xiang Shi, Wei Jia, Zherui Liu, Shuguang Wang, Haibin Lin, Xin Liu 0086, Aurojit Panda, Jinyang Li. Understanding Stragglers in Large Model Training Using What-if Analysis. In Lidong Zhou, Yuanyuan Zhou 0001, editors, 19th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2025, Boston, MA, USA, July 7-9, 2025. pages 483-498, USENIX Association, 2025. [doi]

Authors

Jinkun Lin

This author has not been identified. Look up 'Jinkun Lin' in Google

Ziheng Jiang

This author has not been identified. Look up 'Ziheng Jiang' in Google

Zuquan Song

This author has not been identified. Look up 'Zuquan Song' in Google

Sida Zhao

This author has not been identified. Look up 'Sida Zhao' in Google

Menghan Yu

This author has not been identified. Look up 'Menghan Yu' in Google

Zhanghan Wang

This author has not been identified. Look up 'Zhanghan Wang' in Google

Chenyuan Wang

This author has not been identified. Look up 'Chenyuan Wang' in Google

Zuocheng Shi

This author has not been identified. Look up 'Zuocheng Shi' in Google

Xiang Shi

This author has not been identified. Look up 'Xiang Shi' in Google

Wei Jia

This author has not been identified. Look up 'Wei Jia' in Google

Zherui Liu

This author has not been identified. Look up 'Zherui Liu' in Google

Shuguang Wang

This author has not been identified. Look up 'Shuguang Wang' in Google

Haibin Lin

This author has not been identified. Look up 'Haibin Lin' in Google

Xin Liu 0086

This author has not been identified. Look up 'Xin Liu 0086' in Google

Aurojit Panda

This author has not been identified. Look up 'Aurojit Panda' in Google

Jinyang Li

This author has not been identified. Look up 'Jinyang Li' in Google