•   When: Tuesday, February 18, 2020 from 11:00 AM to 12:00 PM
  •   Speakers: Chao Chen
  •   Location: Engineering Building 4201
  •   Export to iCal

Abstract:

Transient faults are becoming a significant concern for emerging extreme-scale high performance computing (HPC) systems. This nascent problem is exacerbated by technology trends toward smaller transistor size, higher circuit density and the use of near-threshold voltage techniques to save power. They could corrupt the execution of long-running scientific applications by leading to either SDCs (incorrect values in outputs) or soft failures (abnormal termination, e.g., process crashes). While SDCs harm the confidence in computations and could lead to inaccurate and untrustworthy scientific insights, soft failures degrade system efficiency and performance since they require the impacted jobs to be restarted from their checkpoints and re-executing the lost computations before continuing the normal operation. As a consequence, transient faults detection as well as recovery must be dealt with in the HPC system design for its usability (trust in the output results) and efficiency (speedup and energy efficiency). In particular, solutions must be designed that have very low regular execution overheads, as well as an ability to detect (and potentially recover from) a large set of faults with negligible downtime. 

 

In this talk, I will present two compiler driven resilience techniques, called  LADR and CARE, which are designed respectively for SDC detection and soft failure (SF) recovery. By exploring applications’ knowledge via compiler techniques, they both achieve high fault coverage (~80%), but incur negligible or even zero runtime overheads. I will first describe LADR which detects the SDCs in scientific applications by watching for data anomaly of their state variables (those of scientific interest), and employs compile-time data-flow analysis to minimize the number of monitored variables, thereby reducing runtime and memory overheads. The compiler analysis uses the algebraic properties of the underlying data-flow to select the variables where the fault appears in a magnified manner. The technique is able to maintain a high level of fault coverage with low false positive rates. I will then introduce CARE, a compiler-assisted online recovery technique against soft failures. The advantages of CARE are that it can quickly (with milliseconds) repair the (crashed) process on-the-fly allowing applications to continue their executions instead of being simply terminated and restarted, and incur zero runtime overhead during the normal execution of applications. For recovery, it utilizes the live variables of the program resident in registers and reconstructs the failed computation. Finally, I will conclude my talk by describing future directions towards applying compiler technologies for efficient implementation of the desired system properties. 

  

Bio:  Chao Chen is a Ph.D. candidate in the School of Computer Science at Georgia Tech, advised by Santosh Pande and Greg Eisenhauer. His research interests are broadly in the areas of compilers and systems, with a thesis research on lightweight resilience techniques for HPC applications by exploring applications’ properties. His work appears in top-tier HPC venues, and was nominated for Best Student Paper at SC '19

Posted 3 weeks, 3 days ago