Towards Low-Overhead Resilience for Data Parallel Deep Learning

Bogdan Nicolae, Tanner Hobson, Orcun Yildiz, Tom Peterka, Dmitriy Morozov. Towards Low-Overhead Resilience for Data Parallel Deep Learning. In 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing, CCGrid 2022, Taormina, Italy, May 16-19, 2022. pages 336-345, IEEE, 2022. [doi]

Authors

Bogdan Nicolae

This author has not been identified. Look up 'Bogdan Nicolae' in Google

Tanner Hobson

This author has not been identified. Look up 'Tanner Hobson' in Google

Orcun Yildiz

This author has not been identified. Look up 'Orcun Yildiz' in Google

Tom Peterka

This author has not been identified. Look up 'Tom Peterka' in Google

Dmitriy Morozov

This author has not been identified. Look up 'Dmitriy Morozov' in Google