Search: 
 
 

Tasks
Students
Papers
Demos
Posters
Talks

Resilient System Design

 

Video Introduction
by leader Todd Austin
With further scaling of technology, reliability problems are bound to become the dominant design challenge. Design integrity can be impacted in many ways. Some faults are permanent, others intermittent. Some are due to design or manufacturing artifacts, others due to environmental impact or aging. The list below is just a first order enumeration of potential fault sources:
  • Increasing design complexity, which raises the probability of functional errors dramatically. Traditional verification is just not up to the task.
  • Increasing device parameter variations, caused by manufacturing and lithographic effects as well as the statistical behavior of extremely scaled devices.
  • Reduced noise margins (resulting from supply voltage reduction), combined with non-scaling noise sources such as supply noise and crosstalk.
  • Aggressive deployment - to reduce energy or increase performance, designs are run with smaller or even below zero margins.
  • Soft errors - a direct result of capacitance scaling.
  • Environmental variations, such as temperature gradients.
  • Aging (extensive burn-in will become harder)
It may be possible to address each of these challenges individually - in fact, this is what should be done if a problem is tractable and solvable. Yet, in general this will lead to designs that are over-dimensioned, slow and/or power hungry. As a result, one may raise the question if further technology scaling even makes sense. In contrast, we believe that a holistic and systematic approach to error-resilient design is the preferable option. The central tenet to this approach is that in a world where billions of transistors are available, it makes perfect economic sense to devote a sizable fraction of these transistors to the goal of ensuring that the design performs correctly under all circumstances (hence giving a yield which is close to 100% without giving up in performance or energy efficiency). The main challenge is to use these redundant devices in an effective and versatile fashion. While some errors are best prevented at the circuit level, others are most effectively addressed at the architecture or system level. Systematic reasoning about these trade-off's and a full exploration of novel solutions is only possible in a design framework that allows for the expression and analysis of reliability constraints and requirements at the different abstraction layers.

To support such an error-resilient design methodology, the following research needs must be addressed.

  • System-level design abstractions that allow for expression of reliability requirements and propagation of reliability constraints.
  • Timing strategies that seamlessly support self-adaptation.
  • Built-in self-diagnosis and self-tuning, compensation and recovery schemes for heterogeneous components, including digital, high-speed IO, RF, MEMS, optical, and nano.
  • Self-adapting systems that dynamically adjust or correct for technology and environment variations, so that system-level design specifications are met unconditionally.
  • Reliable system design based on unreliable components, as future technology will have to rely upon devices whose operation cannot be specified deterministically. This challenge has to be addressed in a holistic way from device modeling to system level design.
  • Design solutions and methodologies that encompass "better than worst-case" design, especially, designs that utilize error-resiliency to produce self-adaptive, self-checking, and ultimately self-healing systems that are robust and efficient even when implemented in the fragile fabrication technologies that are envisioned for the near future.
  • On-line X architectures and methodologies, which increasingly move verification, task mapping and test on-line to accommodate process and environmental variations and dynamic errors, and to reduce the verification cost of increasingly complex hard-soft systems.
  • Micro-architectures and system architectures that are "bullet-proof"; that is, integrated systems that continue to operate correctly even in the presence of run-time failures and malfunctions.
Given the fact that many of the integrity challenges will only come into full focus starting from the 35 nm node, the time window addressed in this theme is somewhat further out (5-10 years).
 
You are not logged in
©1998-2008 GSRC