Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Junjie Qian, Taeyoon Kim, Myeongjae Jeon. Reliability of Large Scale GPU Clusters for Deep Learning Workloads. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang 0001, Leila Zia, editors, Companion of The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021. pages 179-181, ACM / IW3C2, 2021. [doi]

@inproceedings{QianKJ21,
  title = {Reliability of Large Scale GPU Clusters for Deep Learning Workloads},
  author = {Junjie Qian and Taeyoon Kim and Myeongjae Jeon},
  year = {2021},
  doi = {10.1145/3442442.3452056},
  url = {https://doi.org/10.1145/3442442.3452056},
  researchr = {https://researchr.org/publication/QianKJ21},
  cites = {0},
  citedby = {0},
  pages = {179-181},
  booktitle = {Companion of The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021},
  editor = {Jure Leskovec and Marko Grobelnik and Marc Najork and Jie Tang 0001 and Leila Zia},
  publisher = {ACM / IW3C2},
  isbn = {978-1-4503-8313-4},
}