Non-Idle Machine-Aware Worker Placement for Efficient Distributed Training in GPU Clusters

Jin Fang, Gongming Zhao, Hongli Xu, Luyao Luo, Zhen Yao, An Xie. Non-Idle Machine-Aware Worker Placement for Efficient Distributed Training in GPU Clusters. In 32nd IEEE International Conference on Network Protocols, ICNP 2024, Charleroi, Belgium, October 28-31, 2024. pages 1-11, IEEE, 2024. [doi]

Abstract

Abstract is missing.