An Efficient Checkpointing System for Large Machine Learning Model Training

Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato. An Efficient Checkpointing System for Large Machine Learning Model Training. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, November 17-22, 2024. pages 896-900, IEEE, 2024. [doi]

Abstract

Abstract is missing.