GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints

Zhuang Wang, Zhen Jia, Shuai Zheng, Zhen Zhang, Xinwei Fu, T. S. Eugene Ng, Yida Wang. GEMINI: Fast Failure Recovery in Distributed Training with In-Memory Checkpoints. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. pages 364-381, ACM, 2023. [doi]

Authors

Zhuang Wang

This author has not been identified. Look up 'Zhuang Wang' in Google

Zhen Jia

This author has not been identified. Look up 'Zhen Jia' in Google

Shuai Zheng

This author has not been identified. Look up 'Shuai Zheng' in Google

Zhen Zhang

This author has not been identified. Look up 'Zhen Zhang' in Google

Xinwei Fu

This author has not been identified. Look up 'Xinwei Fu' in Google

T. S. Eugene Ng

This author has not been identified. Look up 'T. S. Eugene Ng' in Google

Yida Wang

This author has not been identified. Look up 'Yida Wang' in Google