Ning Lu, Qian Xie, Hao Zhang, Wenyi Fang, Yang Zheng, Zheng Hu, Jiantao Ma. Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems. In 35th IEEE International Symposium on Software Reliability Engineering, ISSRE 2024 - Workshops, Tsukuba, Japan, October 28-31, 2024. pages 391-393, IEEE, 2024. [doi]
Abstract is missing.