Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Anne Benoit, Aurélien Cavelan, Franck Cappello, Padma Raghavan, Yves Robert, Hongyang Sun. Coping with silent and fail-stop errors at scale by combining replication and checkpointing. J. Parallel Distrib. Comput., 122:209-225, 2018. [doi]

Abstract

Abstract is missing.