Swift: Expedited Failure Recovery for Large-Scale DNN Training

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, Chuan Wu 0001. Swift: Expedited Failure Recovery for Large-Scale DNN Training. In Maryam Mehri Dehnavi, Milind Kulkarni 0001, Sriram Krishnamoorthy, editors, Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023, Montreal, QC, Canada, 25 February 2023 - 1 March 2023. pages 447-449, ACM, 2023. [doi]

Authors

Yuchen Zhong

This author has not been identified. Look up 'Yuchen Zhong' in Google

Guangming Sheng

This author has not been identified. Look up 'Guangming Sheng' in Google

Juncheng Liu

This author has not been identified. Look up 'Juncheng Liu' in Google

Jinhui Yuan

This author has not been identified. Look up 'Jinhui Yuan' in Google

Chuan Wu 0001

This author has not been identified. Look up 'Chuan Wu 0001' in Google