Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures

Tanmaey Gupta, Sanjeev Krishnan, Rituraj Kumar, Abhishek Vijeev, Bhargav S. Gulavani, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu. Just-In-Time Checkpointing: Low Cost Error Recovery from Deep Learning Training Failures. In Proceedings of the Nineteenth European Conference on Computer Systems, EuroSys 2024, Athens, Greece, April 22-25, 2024. pages 1110-1125, ACM, 2024. [doi]

Abstract

Abstract is missing.