BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism

Runzhe Chen, Guandong Lu, Yakai Wang, Rui Zhang, Zheng Hu, Yanming Miao, Zhifang Cai, Jingwen Leng, Minyi Guo. BAFT: bubble-aware fault-tolerant framework for distributed DNN training with hybrid parallelism. Frontiers of Computer Science in China, 19(1):191102, January 2025. [doi]

Abstract

Abstract is missing.