Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates

Insu Jang, Zhenning Yang, Zhen Zhang, Xin Jin, Mosharaf Chowdhury. Oobleck: Resilient Distributed Training of Large Models Using Pipeline Templates. In Jason Flinn, Margo I. Seltzer, Peter Druschel, Antoine Kaufmann, Jonathan Mace, editors, Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October 23-26, 2023. pages 382-395, ACM, 2023. [doi]

Abstract

Abstract is missing.