Kiran Gunnam, Pranshu Tapan Mandal, Shivam Khandelwal, Rajesh Bhagwat. Enhancing Training Efficiency: A Novel Approach to Handling GPU Failures in Large-Scale Distributed System for LLM Training. In 7th IEEE International Conference on Artificial Intelligence Circuits and Systems, AICAS 2025, Bordeaux, France, April 28-30, 2025. pages 1-5, IEEE, 2025. [doi]
Abstract is missing.