Towards Low-Overhead Resilience for Data Parallel Deep Learning

Bogdan Nicolae, Tanner Hobson, Orcun Yildiz, Tom Peterka, Dmitriy Morozov. Towards Low-Overhead Resilience for Data Parallel Deep Learning. In 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16-19, 2022. pages 336-345, IEEE, 2022. [doi]

Abstract

Abstract is missing.