Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

HPC4E researcher Paolo Rech (UFRGS) gave the talk "Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators" at the HPCA 2017 (High Performance Computer Architecture 2017), that took place in Austin (USA) from 4 to 8 February. The talk was included in session 8A focused on Accelerators and chaired by Akanksha Jain (UT Austin). As Rech explains: "we evaluate through radiation experiment the reliability of both Intel Xeon Phi and NVIDIA K40. We qualify, not just quantify, Silent Data Corruption at the output of representative applications. I have presented data that covers about 91,000 years of natural neutron exposure for each architecture".

Rech also stresses that "HPCA is the major event for computer architecture researchers. Reliability is becoming a mainstream in the community, and with our contribution we showed how to get realistic data using experiments. People from the major industries, universities, and research centers attended the conference and our talk". Some promising collaboration in the reliability enhancement for HPC is already on the horizon and the next future plans include to better analyze the sources of critical SDCs through fault injection and propose specific and efficient hardening strategies for those.