Swift: Expedited Failure Recovery for Large-Scale DNN Training

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, Chuan Wu 0001. Swift: Expedited Failure Recovery for Large-Scale DNN Training. In Maryam Mehri Dehnavi, Milind Kulkarni 0001, Sriram Krishnamoorthy, editors, Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023, Montreal, QC, Canada, 25 February 2023 - 1 March 2023. pages 447-449, ACM, 2023. [doi]

Abstract

Abstract is missing.