Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery

Rizwan A. Ashraf, Saurabh Hukerikar, Christian Engelmann. Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery. In Ivan Merelli, Pietro Liò, Igor V. Kotenko, editors, 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, United Kingdom, March 21-23, 2018. pages 178-185, IEEE Computer Society, 2018. [doi]

@inproceedings{AshrafHE18-0,
  title = {Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery},
  author = {Rizwan A. Ashraf and Saurabh Hukerikar and Christian Engelmann},
  year = {2018},
  doi = {10.1109/PDP2018.2018.00032},
  url = {http://doi.ieeecomputersociety.org/10.1109/PDP2018.2018.00032},
  researchr = {https://researchr.org/publication/AshrafHE18-0},
  cites = {0},
  citedby = {0},
  pages = {178-185},
  booktitle = {26th Euromicro International Conference on Parallel, Distributed and Network-based Processing, PDP 2018, Cambridge, United Kingdom, March 21-23, 2018},
  editor = {Ivan Merelli and Pietro Liò and Igor V. Kotenko},
  publisher = {IEEE Computer Society},
  isbn = {978-1-5386-4975-6},
}