Swift: Expedited Failure Recovery for Large-Scale DNN Training

Yuchen Zhong, Guangming Sheng, Juncheng Liu, Jinhui Yuan, Chuan Wu 0001. Swift: Expedited Failure Recovery for Large-Scale DNN Training. In Maryam Mehri Dehnavi, Milind Kulkarni 0001, Sriram Krishnamoorthy, editors, Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023, Montreal, QC, Canada, 25 February 2023 - 1 March 2023. pages 447-449, ACM, 2023. [doi]

@inproceedings{ZhongSLY023,
  title = {Swift: Expedited Failure Recovery for Large-Scale DNN Training},
  author = {Yuchen Zhong and Guangming Sheng and Juncheng Liu and Jinhui Yuan and Chuan Wu 0001},
  year = {2023},
  doi = {10.1145/3572848.3577510},
  url = {https://doi.org/10.1145/3572848.3577510},
  researchr = {https://researchr.org/publication/ZhongSLY023},
  cites = {0},
  citedby = {0},
  pages = {447-449},
  booktitle = {Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, PPoPP 2023, Montreal, QC, Canada, 25 February 2023 - 1 March 2023},
  editor = {Maryam Mehri Dehnavi and Milind Kulkarni 0001 and Sriram Krishnamoorthy},
  publisher = {ACM},
  isbn = {979-8-4007-0015-6},
}