Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark C. Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu. Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery. In Alex Smola, Alex Dimakis, Ion Stoica, editors, Proceedings of Machine Learning and Systems 2021, MLSys 2021, virtual, April 5-9, 2021. mlsys.org, 2021. [doi]

Authors

Kiwan Maeng

This author has not been identified. Look up 'Kiwan Maeng' in Google

Shivam Bharuka

This author has not been identified. Look up 'Shivam Bharuka' in Google

Isabel Gao

This author has not been identified. Look up 'Isabel Gao' in Google

Mark C. Jeffrey

This author has not been identified. Look up 'Mark C. Jeffrey' in Google

Vikram Saraph

This author has not been identified. Look up 'Vikram Saraph' in Google

Bor-Yiing Su

This author has not been identified. Look up 'Bor-Yiing Su' in Google

Caroline Trippel

This author has not been identified. Look up 'Caroline Trippel' in Google

Jiyan Yang

This author has not been identified. Look up 'Jiyan Yang' in Google

Mike Rabbat

This author has not been identified. Look up 'Mike Rabbat' in Google

Brandon Lucia

This author has not been identified. Look up 'Brandon Lucia' in Google

Carole-Jean Wu

This author has not been identified. Look up 'Carole-Jean Wu' in Google