Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems

Yi He, Yanjing Li. Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems. In IEEE European Test Symposium, ETS 2023, Venezia, Italy, May 22-26, 2023. pages 1-6, IEEE, 2023. [doi]

@inproceedings{HeL23-10,
  title = {Understanding Permanent Hardware Failures in Deep Learning Training Accelerator Systems},
  author = {Yi He and Yanjing Li},
  year = {2023},
  doi = {10.1109/ETS56758.2023.10173972},
  url = {https://doi.org/10.1109/ETS56758.2023.10173972},
  researchr = {https://researchr.org/publication/HeL23-10},
  cites = {0},
  citedby = {0},
  pages = {1-6},
  booktitle = {IEEE European Test Symposium, ETS 2023, Venezia, Italy, May 22-26, 2023},
  publisher = {IEEE},
  isbn = {979-8-3503-3634-4},
}