Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

Shaohuai Shi, Xianhao Zhou, Shutao Song, Xingyao Wang, Zilin Zhu, Xue Huang, Xinan Jiang, Feihu Zhou, Zhenyu Guo, Liqiang Xie, Rui Lan, Xianbin Ouyang, Yan Zhang, Jieqian Wei, Jing Gong, Weiliang Lin, Ping Gao, Peng Meng, Xiaomin Xu, Chenyang Guo, Bo Yang, Zhibo Chen 0006, Yongjian Wu, Xiaowen Chu 0001. Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters. In Alex Smola, Alex Dimakis, Ion Stoica, editors, Proceedings of Machine Learning and Systems 2021, MLSys 2021, virtual, April 5-9, 2021. mlsys.org, 2021. [doi]

Abstract

Abstract is missing.