Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations

Nosayba El-Sayed, Hongyu Zhu, Bianca Schroeder. Learning from Failure Across Multiple Clusters: A Trace-Driven Approach to Understanding, Predicting, and Mitigating Job Terminations. In Kisung Lee, Ling Liu, editors, 37th IEEE International Conference on Distributed Computing Systems, ICDCS 2017, Atlanta, GA, USA, June 5-8, 2017. pages 1333-1344, IEEE Computer Society, 2017. [doi]

Abstract

Abstract is missing.