dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training

Hanpeng Hu, Chenyu Jiang, Yuchen Zhong, Yanghua Peng, Chuan Wu 0001, Yibo Zhu, Haibin Lin, Chuanxiong Guo. dPRO: A Generic Performance Diagnosis and Optimization Toolkit for Expediting Distributed DNN Training. In Diana Marculescu, Yuejie Chi, Carole-Jean Wu, editors, Proceedings of Machine Learning and Systems 2022, MLSys 2022, Santa Clara, CA, USA, August 29 - September 1, 2022. mlsys.org, 2022. [doi]

Abstract

Abstract is missing.