GSRC Student Profile:
Research Overview: Providing Low-cost Reliability Solutions for Multicore Systems Running Multithreaded ApplicationsContinued device scaling is resulting in smaller devices that are increasingly vulnerable to errors from various sources, such as wear-out, high energy particle strikes, etc. As this reliability threat grows, traditional redundancy based solutions will become unsuitable for the main stream computing market owing to their high overheads. A promising approach is using software-level symptoms to detect hardware faults. Researchers have proposed always on monitors that perform such detections at low cost. In the rare event of a fault, a more expensive diagnosis mechanism can be invoked alongside a checkpoint/replay-based recovery procedure. Previous studies, however, were in the context of single threaded applications.My proposed research is to develop a comprehensive reliability solution for multicore systems running multithreaded applications. In particular, the contributions of this work are the following. (1) The previously proposed symptoms are augmented with multicore counterparts, resulting in a high coverage of 99.1%, with a low 0.2% SDC rate for permanent faults. This shows the applicability of symptom-based detection for faults in multicore systems running multi-threaded workloads. (2) Cross-core fault propagation makes diagnosis in the presence of hardware faults a challenge. We proposed a novel mechanism that identifies the faulty core by running the symptom activating trace from each thread in isolation using opportunistic Triple-Modular Redundancy (TMR), with zero performance overhead in fault-free cases. Our results show that this technique of synthesizing TMR to diagnose the faulty core has high diagnosability, at low diagnosis latencies. Once the faulty core is identified, we rely on the previous work to diagnose the faulty microarchitectural unit. This work has been recently accepted for publication in MICRO-09. I have also been involved in the development of a novel fault injection infrastructure that uses hierarchical simulation to study the system-level manifestations of permanent (and transient) gate-level faults. We have shown that this simulation infrastructure incurs a meager average performance overhead of under 2x when compared to pure microarchitectural simulations but, at the same time, having the same accuracy as the gate-level fault simulations. Overall, the aim of this study is to provide a complete comprehensive hardware reliability solution for multicore systems. In the future, I plan to work on developing an FPGA based framework to validate the developed hardware fault detection and diagnosis mechanism for multicore systems. Also, I plan to extend the FPGA framework to studying off-core faults.
|
||||
| You are not logged in |
| ©1998-2010 GSRC |