Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Junjie Qian, Taeyoon Kim, Myeongjae Jeon. Reliability of Large Scale GPU Clusters for Deep Learning Workloads. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang 0001, Leila Zia, editors, Companion of The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. pages 179-181, ACM / IW3C2, 2021. [doi]

Authors

Junjie Qian

This author has not been identified. Look up 'Junjie Qian' in Google

Taeyoon Kim

This author has not been identified. Look up 'Taeyoon Kim' in Google

Myeongjae Jeon

This author has not been identified. Look up 'Myeongjae Jeon' in Google