L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis - researchr publication

researchr

You are not signed in
Sign in
Sign up

Zhihan Jiang, Junjie Huang 0008, Guangba Yu, Zhuangbin Chen, Yichen Li 0003, Renyi Zhong, Cong Feng, Yongqiang Yang, Zengyin Yang, Michael R. Lyu. L4: Diagnosing Large-scale LLM Training Failures via Automated Log Analysis. In Leonardo Montecchi, Jingyue Li, Denys Poshyvanyk, Dongmei Zhang 0001, editors, Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, FSE Companion 2025, Clarion Hotel Trondheim, Trondheim, Norway, June 23-28, 2025. pages 51-63, ACM, 2025. [doi]

Abstract is missing.

runs on WebDSL