Pengfei Yu 0002, Jingjing Gu, Hao Han, Dazhong Shen, Bao Wen, Yang Liu 0390. Exploring and Mitigating Failure Behavior of Large Language Model Training Workloads in HPC Systems. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2025, St. Louis, MO, USA, November 16-21, 2025. pages 1165-1179, ACM, 2025. [doi]
Abstract is missing.