Wubiao Xu, Xin Huang, Shiman Meng, Weiping Zhang, Luanzheng Guo, Kento Sato. An Efficient Checkpointing System for Large Machine Learning Model Training. In SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, Atlanta, GA, USA, November 17-22, 2024. pages 896-900, IEEE, 2024. [doi]
Abstract is missing.