An Efficient Checkpointing System for Large Machine Learning Model Training

Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato. An Efficient Checkpointing System for Large Machine Learning Model Training. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, November 17-22, 2024. pages 896-900, IEEE, 2024. [doi]

Authors

Wubiao Xu

This author has not been identified. Look up 'Wubiao Xu' in Google

Xin Huang

This author has not been identified. Look up 'Xin Huang' in Google

Shiman Meng

This author has not been identified. Look up 'Shiman Meng' in Google

Weiping Zhang

This author has not been identified. Look up 'Weiping Zhang' in Google

Luanzheng Guo

This author has not been identified. Look up 'Luanzheng Guo' in Google

Kento Sato

This author has not been identified. Look up 'Kento Sato' in Google