Search: 
 

GSRC Presentations


Select by venue:  


   TLM Platform for Heterogeneous System-on-Chip Integration     [ edit ]   
Pub ID:  2679 Authors:  Michele Petracca, Emilio G. Cota, Luca Carloni
As Moore's law continues to progress, designers can now integrate multiple heterogeneous cores on the same die in order to build increasingly complex Systems-on-Chip (SoC). However, more functionalities bring higher design and verification challenges, which negatively impact time-to-market. Recently High-Level Synthesis (HLS) has finally emerged as a technology that can bridge the gap between complexity and productivity. HLS tools can derive RTL descriptions from behavioral models. In turn, high-level models enable faster simulations and improve design portability and reusability. Transaction Level Modeling (TLM) further enhances the power of HLS by specifying standard interfaces that clearly separate communication and computation. We propose a design methodology and a corresponding TLM platform architecture that allow the designer i) to easily integrate multiple heterogeneous components to build an SoC, ii) to optimize such integration, and, iii) to start early hardware-software co-design and functional verification of the SoC.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Hierarchical Mixed-Signal/RF System Level Test &Validation     [ edit ]   
Pub ID:  2698 Authors:  Shyam Kumar Devarakond , Vishwanath Natarajan, Aritra Banerjee, Debashis Banerjee, Shreyas Sen, Hyun Choi, Abhijit Chatterjee
In this work, a new hierarchical signature driven testing/validation approach for RF systems has been developed. The proposed method determines module level performances from the system level response (signature) to an applied RF diagnostic test using top-down model diagnosis. A comprehensive set of specifications of multiple RF modules chains are computed simultaneously from the observed DUT response using a single data acquisition. A key contribution of this work is in the use of test generation algorithms to determine the optimized test stimulus from which all the DUT specifications including the system-level EVM metric are computed. The proposed concept is applied to MIMO & SISO OFDM WLAN RF systems. Experimental results provided for a 2.4 GHz commercial WLAN transceiver product validates the proposed concept.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Combined Loop Transformation and Hierarchy Allocation for Data Reuse Optimization     [ edit ]   
Pub ID:  2723 Authors:  Jason Cong, Peng Zhang, Yi Zou
External memory bandwidth is a crucial bottleneck in the majority of computation-intensive applications for both performance and power consumption. Data reuse is an important technique for reducing the external memory access by utilizing the memory hierarchy. Loop transformation for data locality and memory hierarchy allocation are two major steps in data reuse optimization flow. But they were carried out independently. This paper presents a combined approach which optimizes loop transformation and memory hierarchy allocation simultaneously to achieve global optimal results on external memory bandwidth and on-chip data reuse buffer size. We develop an efficient and optimal solution to the combined problem by decomposing the solution space into two subspaces with linear and nonlinear constraints respectively. We show that we can significantly prune the solution space without losing its optimality. Experimental results show that our scheme can save up to 31% of on-chip memory size compared to the separated two-step method when the memory hierarchy allocation problem is not trivial. Also, run-time complexity is acceptable for the practical cases.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Platform Viability Theme Overview
Pub ID:  2745 Author:  Kwang‑Ting Cheng
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   RF Real-Time Adaptation for Error Resilience, Low Power and Performance
Pub ID:  2765 Author:  Abhijit Chatterjee
CMOS technology scaling along with the resulting large variability of circuit performance metrics in the presence of manufacturing process variations has made post-silicon circuit built-in test and adaptation/tuning almost a necessity for deeply scaled (DSM) technologies. Currently, circuits are designed to tolerate worst-case process corners. In addition, circuits must be designed for worst case operating conditions as well (e.g. environmental noise). This forces designers to excessively guard-band their products and increasingly more so as technology scales down to the 45nm node and beyond, resulting in unacceptable power-performance-yield tradeoffs. One way to tackle this problem is to design circuits that are "self-aware" and can adapt to environmental operating conditions and process variations to conserve power while maximizing yield and reliability. Such self-awareness involves incorporation of built-in test, diagnosis and tuning/adaptation mechanisms into the circuits and systems concerned. A key issue is that of test, diagnosis and tuning of complex circuit and system-level parameters that must be evaluated and traded off against one another during the adaptation process without access to complex external test instrumentation.This talk summarizes recent results obtained in the design of self-aware/adaptive wireless communications systems and points to directions for future work in this area.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   A Design Framework for Distributed Power Management of Heterogeneous Systems-on-Chip
Pub ID:  2653 Authors:  Alberto Puggelli, Michele Petracca, Pierluigi Nuzzo, Luca Carloni, Alberto Sangiovanni‑Vincentelli
We present a framework to support the design of integrated systems with on-chip distributed power management, targeting fine-grained voltage island architectures and dynamic voltage and frequency scaling techniques. Our flow builds upon the platform-based design methodology, to allow the automated co-design of digital circuits with the corresponding analog power management units. First, we represent system functionality with Transaction-Level Modeling (TLM). Second, we abstract all architectural components as library elements decorated with behavioral and accurately-extracted performance models. Finally, the design is cast as an optimization problem where the tradeoff between power consumption and area is explored, subject to preserving the functionality of the system. Our approach is validated on the design of a JPEG encoder in a 90nm CMOS process: the fully-integrated encoder consumes 25.6% less power with an area penalty of 14.7% with respect to a similar system operated by a single off-chip DCDC converter with fixed output voltage.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Parallel Assertions for Debugging Parallel Programs
Pub ID:  2681 Authors:  Daniel Schwartz‑Narbonne, Sharad Malik, Feng Liu
Abstract: A parallel program must execute correctly even in the presence of unpredictable thread interleavings. This interleaving makes it hard to write correct parallel programs, and also makes it hard to find bugs in incorrect parallel programs. A range of tools have been developed to help debug parallel programs, ranging from atomicity-violation and data-race detectors to model-checkers and theorem provers. One technique that has been successful for debugging sequential programs, but less effective for parallel programs, is running the program using assertion predicates provided by the developer. These assertions allow programmers to specify and check their assumptions. In a multi-threaded program, the programmer’s assumptions include both the current state, and any actions (e.g. access to shared memory) that other, parallel executing threads might take. We introduce parallel assertions which allow programmers to express these assumptions for parallel programs using simple and intuitive syntax and semantics. We present a proof-of-concept implementation, and demonstrate its value by testing a number of benchmark programs using parallel assertions.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Long term video segmentation through pixel level spectral clustering on GPUs
Pub ID:  2699 Authors:  Narayanan Sundaram, Kurt Keutzer
We introduce a new technique for performing video segmentation combining the state-of-the-art image segmentation and optical flow algorithms on GPUs. We avoid pre-clustering into superpixels and probabilistic reasoning, and instead view the problem as a generalization of image segmentation techniques. Utilizing spectral clustering techniques at the pixel level (as opposed to 2D/3D superpixels), we demonstrate video segmentation over hundreds of frames - far beyond what has been achieved through pixel level spectral segmentation techniques before. Our algorithm achieves comparable accuracy as other sparse motion clustering techniques while still maintaining 100% density in segmentation over long time periods. We achieve better accuracy with lower oversegmentation compared to dense video segmentation techniques. We exploit increased computational power made available through parallelism in GPUs and efficient numerical algorithms to achieve these results. We show our results on the motion segmentation dataset (Brox & Malik, 2010). Our technique can also be used to provide good quality 3D superpixels and extended to tasks where the ability to track 3D volumes over time is useful.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Hierarchical Mixed-Signal/RF System Level Test &Validation
Pub ID:  2725 Authors:  Shyam Kumar Devarakond , Vishwanath Natarajan, Debashis Banerjee, Aritra Banerjee, Hyun Choi, Shreyas Sen
In this work, a new hierarchical signature driven testing/validation approach for RF systems has been developed. The proposed method determines module level performances from the system level response (signature) to an applied RF diagnostic test using top-down model diagnosis. A comprehensive set of specifications of multiple RF modules chains are computed simultaneously from the observed DUT response using a single data acquisition. A key contribution of this work is in the use of test generation algorithms to determine the optimized test stimulus from which all the DUT specifications including the system-level EVM metric are computed. The proposed concept is applied to MIMO & SISO OFDM WLAN RF systems. Experimental results provided for a 2.4 GHz commercial WLAN transceiver product validates the proposed concept.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Programming Concurrent Systems Theme Overview
Pub ID:  2746 Author:  Kurt Keutzer
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Stochastic Computing: Principles and Practice
Pub ID:  2766 Author:  Naresh Shanbhag
This presentation describes various principles and practices involved in Stochastic Computing
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Scalable Security Vulnerability Analysis via Sampling
Pub ID:  2654 Authors:  Joseph Greathouse, Ilya Wagner, Valeria Bertacco, Todd Austin

Software is complicated. This complexity makes it difficult to write correct programs, and the increasing performance of personal computers has exacerbated the problem through the drive for more features. Software bugs can cost money and lives, and information about them is now sold on black markets as weapons for cyber-warfare and for use in criminal activities.

Developers commonly use automated tools to hunt down these bugs. Dynamic dataflow analysis, one example of this type of tool, finds software errors by tracking meta-values associated with a program's runtime data, and can find subtle errors that would normally escape notice.

Such tests are more likely to find errors if they observe the program under a multitude of runtime situations. Ideally, a program's users would analyze it, testing situations the developers may never have thought to try. Unfortunately, the orders-of-magnitude slowdowns that accompany these systems limit their use to the development stage; few users would tolerate such overheads.

This poster describes methods of distributing these tests across large populations. We sample these analyses, missing some errors in exchange for keeping slowdowns below a user-defined threshold. Previous sampling methods are inadequate for dynamic dataflow analyses, so we describe a novel sampling mechanism, and describe hardware-assisted and software-only implementations.

In the end, the large populations gained by distributing the analysis can, in aggregate, analyze a larger portion of the program than is possible by any single user running a complete, but slow, analysis.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Row Buffer Locality-Aware Hybrid Memory Caching Policies     [ edit ]   
Pub ID:  2682 Authors:  HanBin Yoon, Justin Meza, Rachata Ausavarungnirun, Rachael Harding, Onur Mutlu
Phase change memory (PCM) is a promising alternative to DRAM, though its high latency and energy costs prohibit its adoption as a drop-in DRAM replacement. Hybrid memory systems comprising DRAM and PCM attempt to achieve the low access latencies of DRAM at the large capacities of PCM. However, known solutions neglect to assess the utility of data placed in DRAM, and hence fail to achieve high performance and energy efficiency. We propose a new DRAM-PCM hybrid memory system that exploits row buffer locality. The main idea is to place data that cause frequent row buffer miss accesses in DRAM, and data that do not in PCM. The key insight behind this approach is that data which generally hit in the row buffer can take advantage of the large memory capacity that PCM has to offer, and still be accessed as quickly as if the data were placed in DRAM. We observe our mechanism (1) effectively mitigates the high access latencies and energy costs of PCM, (2) reduces memory channel bandwidth consumption due to the migration of data between DRAM and PCM, and (3) prevents data that exhibit low reuse from polluting DRAM. We evaluate our row buffer locality-aware scheme and show that it outperforms previously proposed hybrid memory systems over a wide range of multiprogrammed workloads. Across 500 workloads on a 16-core system with 256~MB of DRAM, we find that our scheme improves system performance by 41% over using DRAM as a conventional cache to PCM, while reducing maximum slowdown by 32%. Furthermore, our scheme shows 17% performance gain over a competitive all-PCM memory system, and comes to within 21% of the performance of an unlimited-size all-DRAM memory system.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Coherent 3D Scene Understanding from Images     [ edit ]   
Pub ID:  2700 Authors:  Sid Ying‑Ze Bao, Jason Clemons, Mohit Bagra, Todd Austin, Silvio Savarese
We propose a new framework for jointly recognizing objects as well as reconstructing the underlying 3D geometry of the scene (cameras, points and objects). In our SSFM framework we leverage the intuition that measurements of keypoints and objects must be semantically and geometrically consistent across view points. Our SSFM framework has the unique ability to: i) estimate camera poses from object detections only; ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms; iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Performance-Oriented Mapping of Kernel Parallelism on FPGAs     [ edit ]   
Pub ID:  2726 Authors:  Alex Papakonstantinou, Yun Liang, Karthik Gururaj, John Stratton, Deming Chen, Jason Cong, Wen‑mei Hwu
Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow - starting with high level source code and ending with routed netlist - is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Resilient Systems Theme Overview
Pub ID:  2747 Author:  Valeria Bertacco
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Michigan Visual Sonification System     [ edit ]   
Pub ID:  2655 Authors:  Jason Clemons, Sid Ying‑Ze Bao, Mohit Bagra, Max Seiden, Silvio Savarese, Todd Austin
Visual Sonification is the process of converting visual properties of objects into sound signals. This work describes the Michigan Visual Sonification System (MVSS) that utilizes this process to assist the visually impaired in distinguishing different objects in their surroundings. MVSS uses depth information to first segment and localize salient objects and then represents object appearance using histograms of visual features. A dictionary of invariant visual features (or words) is created in an a-priori off-line learning phase using Bag-of-Words modeling. The histogram of a segmented object is then converted to a sound signal, the volume and 3D placement of which is determined by the relative position of the object with respect to the user. The system then relies on the considerable discriminating power of the human brain to localize and "classify" the sound, thus enabling the user to distinguish between visually distinct object classes. The poster describes the different components of MVSS in detail and presents some promising initial experimental results. It also demonstrates the need for improved computation of computer vision based feature extraction.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Supervised Design Space Exploration of Accelerators and Cores in Heterogeneous SoCs
Pub ID:  2683 Authors:  Hung‑Yi Liu, Michele Petracca, Luca Carloni

Emerging heterogeneous Systems-on-Chip (SoCs) feature a mix of hardware accelerators and programmable cores. In order to raise the design productivity for these SoCs, it is critical both to reuse pre-designed soft IP components and to implement the components with synthesis tools. We present the first top-down design-space exploration framework that characterizes the system's cost-performance trade-off by adaptively synthesizing its components. Our approach minimizes the number of synthesis runs while discovering the desired system's Pareto set.

At the RTL level, we propose an optimal algorithm for Compositional Approximation of Pareto Sets (CAPS). For bi-objective design spaces, CAPS can return an approximate Pareto set that captures the cost-performance tradeoff with guaranteed accuracy in the fewest possible number of synthesis runs. On the other hand, at the system-level, we propose a high-level synthesis (HLS)-driven methodology to enhance design reuse. Starting from libraries of System-C specifications and companion HLS configurations, we present an algorithm for HLS planning that enables an effective parallelization of the HLS runs to construct the system-level Pareto curve.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   DAE2FSM: A Fully Automated Technique for Generating Finite State Machine Abstractions for Digital, Analog, and Mixed-Signal Circuits
Pub ID:  2701 Authors:  Karthik V Aadithya, Chenjie Gu, Jaijeet Roychowdhury
In this project, we consider the problem of automatically generating discrete-time abstractions for digital, analog, and mixed-signal circuits. Specifically, we have developed a tool (called DAE2FSM) that takes as input a transistor level description of a circuit (e.g., a SPICE netlist), and produces as output a Finite State Machine (FSM) abstraction that accurately reflects the circuit dynamics under both ideal and non-ideal operating conditions. DAE2FSM thereby enables highly efficient, symbol-level simulation of hardware modules, which is a key requirement in the design of many digital/analog/mixed-signal sub-systems (e.g., high-speed communication links).
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Fine grained accelerator integration using 3D
Pub ID:  2727 Authors:  Jason Cong, Karthik Gururaj
Customized instructions implemented using reconfigurable functional units have been proposed as a way of improving performance and energy efficiency of software while minimizing cost of designing and verifying accelerators from scratch. However, previous work is limited in the way the custom instructions interact with the rest of the processor pipeline input/output from registers only, no memory operations from within the custom instruction. In this work, we propose an approach by which custom instructions can launch memory operations directly. We extend the system to a 3D architecture combining FPGAs and CMP to obtain higher savings.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Power-Aware Dynamic Control of Error-Resilience Mechanisms
Pub ID:  2748 Author:  Wenchao Li
Aggressive technology scaling has necessitated the development of techniques to ensure resilience to device faults, including soft errors, circuit wearout, variability, and environmental effects. All error resilience techniques employ some form of redundancy, resulting in added cost such as area or power overhead. Existing selective hardening techniques have been focused on identifying the most vulnerable components and then statically hardening them to produce a resilience to overhead tradeoff. This paper proposes a new technique that can further reduce this overhead for error resilience mechanisms that are controllable. The key idea is to generate control predicates that can turn the resilience mechanisms ON and OFF dynamically and at the right time. These predicates are mined using a 0-1 integer linear optimization formulation. An experimental evaluation shows that the proposed approach provides a systematic way to control error-resilience so as to meet reliability targets under a specified power budget. For example, for a chip multiprocessor router, our approach achieves the same amount of soft error resilience with only half of the power overhead compared with the static hardening approach.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Ultra-low power sensing architectures and platforms for intelligent and scalable biomedical monitoring
Pub ID:  2666 Authors:  Kyong Ho Lee, Mohammed Shoaib, Naveen Verma
Intelligent biomedical devices that are capable of interpreting specific physiological state from patient signals are important in scalable biomedical monitoring. The challenge is that these devices are highly energy-constrained (i.e., 1-10mW for wearable devices, 10-100uW for implantable devices). Machine learning is a powerful tool to model correlations in physiological signals, but model complexity in typical biomedical applications makes detection too energy consumptive.

We propose methods to enable low-power scalable biomedical monitoring which includes a low-power IC design employing embedded machine learning support. We also introduce system-level and application-level techniques to reduce communication cost and burden on clinical resources, and enhance system resiliency for low-power devices.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Relyzer: Application Resiliency Analyzer for Transients Faults
Pub ID:  2684 Authors:  Siva Kumar Sastry Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran
Future microprocessors need low-cost reliability solutions to enable reliable operations in the presence of failure-prone devices. The state-of-the-art reliability solutions detect the presence of hardware faults by deploying low-cost software-level symptom monitors. Recently researchers have shown that these detection mechanisms provide high fault coverage with only few faults being undetected. There is a risk that these undetected faults can result in silent data corruptions or SDCs. The SDC rates demonstrated by the state-of-the-art symptom detection mechanisms have been an impressive <0.5% for permanent and transient hardware faults in all hardware units studied except the data-centric FPU. However, a thorough and accurate analysis is needed to evaluate the SDC rate to make these techniques practically viable. This poster presents Relyzer, an approach that analyzes the fault-free execution of applications and performs smart selective fault injection experiments, as opposed to random fault injections. Relyzer can thus provide a tight bound on the SDC rate. Relyzer first lists all the architecture level hardware faults that can possibly affect an application. It then employs a set of novel fault pruning techniques to eliminate a large fraction of them by predicting their outcomes and showing them equivalent to others. The hardware faults that remain after the pruning phase are the only ones that need thorough fault injection experiments. Our results show that Relyzer is capable of pruning about 99.995% of hardware faults for the workloads that we studied.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Post-silicon Validation of Weak Memory Consistency
Pub ID:  2704 Authors:  Biruk Mammo, Debapriya Chatterjee, Valeria Bertacco

Functional verification of modern chip multiprocessors (CMPs) is a challenging task because of increasing complexity and shrinking production schedules. The memory subsystem in CMPs is the the glue that brings the whole system together. Many subtle yet devastating bugs in the memory subsystems of CMPs are being released into final silicon. This calls for an efficient, high-coverage verification solution for identifying bugs in the memory subsystem of modern CMPs.

Our solution, a post-silicon validation techinique that executes directly on prototype hardware and can be disabled before final silicon, targets a set of bugs that cause illegal memory operation ordering in CMPs. The legal ordering of memory operations is specified by a memory consistency model which may dictate order among all memory operations or among some operations and ordering instructions (barriers/fences). Most modern CMPs are designed for weak memory consistency models of the latter type to enable optimizations for higher performance. Our validation solution, when activated, reconfigures a portion of cache storage to log memory accesses and any ordering instructions from the executing program using a compact data-coloring scheme. Periodically, execution of the program is paused and a distributed validation algorithm runs in-situ on the CMP to verify if the memory operation ordering stored in the logs is correct.

The overheads from our solution are only observed during the post-silicon validation phase. It can be disabled before shipment, releasing all cache storage and leaving only a nominal silicon area footprint.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   PROMOTE: PROcess MOnitoring and TEsting of Analog/RF circuits     [ edit ]   
Pub ID:  2728 Authors:  Shyam Kumar Devarakond , Shreyas Sen, Abhijit Chatterjee, Soumendu Bhattacharya
In this paper, a novel process-specification (causeeffect) monitoring approach that allows the effects of process variations and DUT specification variations for Analog/RF systems to be monitored on a per-chip basis is presented. As opposed to existing techniques that rely only on electrical test data gathered across lots of wafers, greater degree of process control monitoring can be achieved through the proposed technique. The method relies on the use of alternate diagnostic tests under which the DUT response (alternate diagnostic signature) exhibits strong simultaneous correlation with its specifications as well as critical spice-level device parameters. This allows both to be predicted accurately from the DUT response with virtually zero extra test-time or testhardware cost. A key consequence is the ability to perform cause-effect analysis, relating specification perturbations to device level anomalies on a per-chip basis to provide essential diagnostics.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Durability and Availability in RAMCloud
Pub ID:  2749 Author:  John Ousterhout
RAMCloud is a DRAM-based storage system that provides inexpensive durability and availability by recovering quickly after crashes, rather than storing replicas in DRAM. RAMCloud scatters backup data across hundreds or thousands of disks, and it harnesses hundreds of servers in parallel to reconstruct lost data. The system uses a log-structured approach for all its data, in DRAM as well as on disk; this provides high performance both during normal operation and during recovery. RAMCloud employs randomized techniques to manage the system in a scalable and decentralized fashion. In a 60-node cluster, RAMCloud recovers 35~GB of data from a failed server in 1.6 seconds. Our measurements suggest that the approach will scale to recover larger memory sizes (64~GB or more) in less time with larger clusters.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   A Systematic Methodology to Develop Resilient Cache Coherence Protocols
Pub ID:  2664 Authors:  Konstantinos Aisopos, Li‑Shiuan Peh
Aggressive transistor scaling continues to increase integration capacity with each new technology node, but technology downscaling also increases the vulnerability of semiconductor devices and causes silicon failures. Thus, fault-tolerant architectures are emerging to guarantee reliable functionality on unreliable silicon. While tolerating faults within a processor core has been extensively researched, the many-core era introduces the challenge of reliable on-chip communication in Chip Multi-Processors (CMPs). In CMP systems, an unreliable interconnection network can lose or corrupt coherence messages, causing the entire chip to deadlock. In this work, we argue for a system-level resiliency solution to tolerate an unreliable underlying Network-on-Chip (NoC). We introduce a systematic methodology to transform a coherence protocol to a resilient one, by extending its Finite State Machine (FSM) with safe states and incorporating additional handshaking messages into transactions. The modified protocol ensures coherent and reliable transactions over any lossy NoC. Our approach is generic and can be applied to a wide range of protocols. It requires minimal hardware modifications and introduces only a slight performance overhead (an average of 0.8% during fault-free operation, and 1.9% even at an aggressive fault rate of one fault per msec).
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   U-QED Tests for Effective Post-Silicon Validation of SoCs
Pub ID:  2685 Authors:  David Lin, Ted Hong, Subhasish Mitra

U-QED, or Uncore Quick Error Detection, is a post-silicon validation technique that significantly reduces error detection latencies of bugs in SoC’s core and uncore components. Long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as a system-level failure, is a major challenge in post-silicon validation because it limits the effectiveness of existing post-silicon bug localization techniques based on trace recording, simulation, and formal analysis. In this work we focus on bugs in both core and uncore components because validation of uncore components, which accounts for a significant portion of modern SoCs, is difficult due to issues such as multiple clock domains and difficulties in observing internal interfaces. U-QED consists of a set of systematic and automatable software only transformations that transform existing validation tests into U-QED validation tests. The U-QED validation tests contain a number of code blocks that significantly reduce the error detection latency and improve coverage. Extensive simulation results on an OpenSPARC T2 like SoC show that U-QED validation tests significantly reduce the error detection latencies of logic bugs in both core and uncore components of SoCs. Results also show that U-QED validation tests significantly improve coverage by detecting bugs that escaped the original validation tests.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Formalizing and Demistifyng the PowerPC Memory Model     [ edit ]   
Pub ID:  2708 Authors:  Sela Mador‑Haim, Rajeev Alur, Milo Martin
PowerPC's memory consistency model is the modest complex hardware-level memory model to date. Lack of store-atomicity, relaxed coherence and many subtle rules for local ordering makes the formalization of a precise yet understandable specification a challenge. Recently, a complex operational model for PowerPC has been published. We present simpler, more abstract axiomatic model which is observationally equivalent to the published operational model.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Dynamic Parallelization of JavaScript Applications     [ edit ]   
Pub ID:  2729 Authors:  Janghaeng Lee, Mojtaba Mehrara, Scott Mahlke
As the web becomes the platform of choice for execution of more complex applications, a growing portion of computation is handed off by developers to the client side to reduce network traffic and improve application responsiveness. Therefore, the client-side component, often written in JavaScript, is becoming larger and more compute-intensive, increasing the demand for high performance JavaScript execution. This has led to many recent efforts to improve the performance of JavaScript engines in the web browsers. Furthermore, considering the wide-spread deployment of multi-cores in today’s computing systems, exploiting parallelism in these applications is a promising approach to meet their performance requirement. However, JavaScript has traditionally been treated as a sequential language with no support for multithreading, limiting its potential to make use of the extra computing power in multicore systems. In this work, to exploit hardware concurrency while retaining traditional sequential programming model, we develop ParaScript, an automatic runtime parallelization system for JavaScript applications on the client’s browser. First, we propose an optimistic runtime scheme for identifying parallelizable regions, generating the parallel code on-the-fly, and speculatively executing it. Second, we introduce an ultra-lightweight software speculation mechanism to manage parallel execution. This speculation engine consists of a selective checkpointing scheme and a novel runtime dependence detection mechanism based on reference counting and range-based array conflict detection. Our system is able to achieve an average of 2.18x speedup over the Firefox browser using 8 threads on commodity multi-core systems, while performing all required analyses and conflict detection dynamically at runtime.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   SAT-based Post-silicon Fault Localization     [ edit ]   
Pub ID:  2750 Authors:  Sharad Malik, Shucheng Zhu, Georg Weissenbacher
The localisation of faults in integrated circuits is a challenging problem and a dominating factor in the overall verification effort. Electrical bugs, in particular, surface only in the fabricated prototypes, leading to behaviour deviating from the golden model. Limited observability complicates their localisation: Logging mechanisms such as trace buffers allow us to retain only a limited execution history. A symbolic analysis of the RTL design can find discrepancies between the values recorded in the trace buffer and the intended behaviour. Contemporary MAX-SAT solvers are then able to identify a maximal subset of the RTL design that is consistent with the observed behaviour. The elements in the complement of this subset represent potential locations of the fault. The scalability of contemporary decision procedures dictates the size of a window of execution cycles which we can analyse using symbolic techniques. Current MAX-SAT-based fault localisation techniques require this window to span the fault as well as the error it causes. To address the scalability issues resulting from large window sizes, we propose to slide a smaller window along the temporal axis, constraining it with the information recorded in the trace buffer for the respective execution cycles. In this scenario, the localisation attempt may fail: The limited information provided by the trace buffer may be insufficient to pin down the exact temporal and spatial location of the fault. We propose to use backbones to identify information that can be propagated across sliding windows. The backbone of a symbolic representation of a circuit is the set of signals that are immutable under the given constraints (e.g., the output and trace buffer values). This additional information has several benefits: Firstly, it may be instrumental in locating the fault. Secondly,it may enable a reduction of the size of the of trace buffers and the sliding window. Our preliminary experimental results demonstrate that the use of backbones allows us to reduce the size of the sliding windows or the trace buffer.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Stochastic Sensor Network-on-a-Chip     [ edit ]   
Pub ID:  2669 Authors:  Eric Kim, Dan Baker, Sriram Narayanan, Naresh Shanbhag, Douglas L. Jones
We present a 256-tap PN code acquisition filter in an 180nm CMOS process employing stochastic sensor network-on-a-chip (SSNOC). Under voltage overscaling (VOS), near constant detection probability (Pdet) above 90% with 5.8X reduction in energy is achieved at a supply voltage 27% below the point of first failure (PoFF) with an error rate (pe) of 0.868. This is an improvement of 5.8X in energy-efficiency over conventional error free designs and 3.79X in energy-efficiency and 2170X in error tolerance over existing error tolerant designs.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Bio-Inspired Sensory Signal Processing
Pub ID:  2686 Authors:  Ping‑Chen Huang, Jan Rabaey
Signal processing tasks such as classification or recognition may benefit from implementation strategies inspired by biological sensory pathways. In this project, we explore the biological system from both the top-down and bottom-up perspectives. From the top down, we explore the functional models in computational neuroscience and seek for architectures that allow the use of low-power and low-precision computational units. From the bottom up, we investigate the strategies that the sensory systems have used to seamless interact with the analog inputs and perform the computation in an asynchronous way. In this poster, an odor recognition architecture is proposed based on the olfactory sensory pathway. This design seamlessly interact with the analog sensor responses and performs distributed computation with an overcomplete number of low-power, low-precision analog components, leading to energy efficiency and resiliency.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Variation Tolerant Optical NoC     [ edit ]   
Pub ID:  2707 Authors:  Peter Lisherness, Ming Gao, Kwang‑Ting Cheng, Yan Zheng, Jock Bovington
This poster describes a series of techniques to overcome extreme optical device variation sensitivity and enable on-chip optical interconnect.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   End-to-End Error Correction and Online Diagnosis for On-Chip Networks
Pub ID:  2733 Authors:  Amirali Ghofrani, Ritesh Parikh, Saeed Shamshiri, Andrew DeOrio, Kwang‑Ting Cheng, Valeria Bertacco
We propose a comprehensive solution for end-to-end (e2e) error correction and online defect diagnosis for on-chip networks. For e2e error correction, we propose an interleaved error-locality-aware code that efficiently corrects both random and burst errors. We demonstrate that for 64-bit wide network links, interleaving four of the proposed code, 2G4L(26,16), each of which supports 16-bit data, can correct as many as two random errors or 16 adjacent errors. In order to maintain the error correction capability of the Error Correcting Code (ECC) for transient and intermittent errors, we further propose an e2e data gathering and online diagnosis approach that locates the defective wires and replaces them with the spare wires embedded in the network. Moreover, packet/flit-counting techniques are used to provide high-resolution control-logic defect diagnosis. Our analytical and experimental studies show that under heavy noise, high escape rate, uncertainty about routing, and many other harmful effects, the diagnostic data collected by the proposed approach are accurate enough for the purpose of passive diagnosis.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Relyzer: Application Resiliency Analyzer for Transient Faults
Pub ID:  2751 Authors:  Siva Kumar Sastry Hari, Sarita Adve, Helia Naeimi, Pradeep Ramachandran
Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains an uncomfortably non-negligible risk that several faults remain undetected and result in silent data corruptions (SDCs). Further, most prior evaluations of symptom-based detectors are based on fault injection campaigns for application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application’s execution. Since the total number of such faults is prohibitive (trillions), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. However, these sampling methods, especially for application sites, have not been validated. These mechanisms also do not provide feedback on the portions of the application that remain vulnerable to SDCs so they could be protected through other means if needed. This talk presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections. Relyzer employs novel fault pruning techniques that prune faults by either predicting their outcomes or showing them equivalent to others.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   ERSA: Error Resilient System Architecture For Probabilistic Applications
Pub ID:  2680 Authors:  Hyungmin Cho, Subhasish Mitra
There is growing concern about the increasing vulnerability of future computing systems to errors in the underlying hardware. Traditional redundancy techniques are expensive for designing energy-efficient systems that are resilient to high error rates. We present Error Resilient System Architecture (ERSA), a robust system architecture capable of ensuring high degrees of resilience at low costs for emerging killer probabilistic applications such as Recognition, Mining and Synthesis (RMS) applications. Using the concept of configurable reliability, ERSA may also be adapted for general-purpose applications that are less resilient to errors (but at higher costs). While resilience of RMS applications to errors in low-order bits of data is well known, execution of such applications on error-prone hardware significantly degrades output quality (due to high-order bit errors and crashes). ERSA achieves high error resilience to high-order bit errors and control flow errors (in addition to low-order bit errors) using a judicious combination of the following key ideas: (1) asymmetric reliability in many-core architectures, (2) error resilient algorithms at the core of probabilistic applications, and (3) intelligent software optimizations. Error injection experiments on a multi-core ERSA hardware prototype demonstrate that, even at very high error rates, ERSA maintains 90% or better accuracy of output results, together with minimal impact on execution time. In addition, we demonstrate the effectiveness of ERSA in tolerating high rates of static memory errors that are characteristic of emerging challenges such as Vccmin problems and erratic bit errors.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   PROMOTE: PROcess MOnitoring and TEsting of Analog/RF circuits     [ edit ]   
Pub ID:  2734 Authors:  Shyam Kumar Devarakond , Shreyas Sen, Bhattacharya Soumendu, Abhijit Chatterjee
A novel process-specification (cause-effect) monitoring approach that allows the effects of process variations and DUT specification variations for analog/RF systems to be monitored on a per-chip basis is presented. As opposed to existing techniques that rely only on electrical test data gathered across lots of wafers, greater degree of process control monitoring can be achieved through the proposed technique. The method relies on the use of alternate diagnostic tests under which the DUT response (alternate diagnostic signature) exhibits strong simultaneous correlation with its specifications as well as critical spice-level device parameters. This allows both, the specifications and critical spice-level device parameters, of the analog/RF circuits to be predicted accurately from the DUT response with lower test-time and test-hardware cost compared to standard testing techniques. A key consequence is the ability to perform cause-effect analysis, relating specification perturbations to device level anomalies on a per-chip basis to provide essential diagnostics.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   TCP/IP Protocol Stack in PARSEC 3.0
Pub ID:  2715 Authors:  Yungang Bao, Kai Li
Despite the popularity of network workloads, there is a lack of studies of systematically benchmarking such workloads from the bottom TCP/IP stack layer to the top application layer. This poster presents a new framework introduced into PARSEC 3.0 for benchmarking network workloads. The framework integrates a user-level TCP/IP stack (u-TCP/IP), which is extracted from FreeBSD kernel and behaves similarly as the original in-kernel stack, into applications for evaluating network workloads. The framework can provide characteristics of both the bottom TCP/IP stack and the top applications. We apply this framework to some existing PARSEC workloads (e.g., dedup) and show the impact of network stack on applications. Experimental results show that TCP/IP stack can cause significant impact on the architectural characteristics of whole system, such as instructions, cache misses and that network stack should be considered for the inclusion in the future network workloads for computer architecture studies. The network benchmark framework of PARSEC 3.0 provides a convenient platform for architecture community to systematically investigate the behavior of emerging network workloads.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Copperhead: Data Parallel Python     [ edit ]   
Pub ID:  2735 Authors:  Bryan Catanzaro, Michael Garland, Patrick Li, Kurt Kuetzer
Modern parallel microprocessors deliver high performance on applications that expose substantial fine-grained data parallelism. Although data parallelism is widely available in many computations, implementing data parallel algorithms in low-level languages is often an unnecessarily difficult task. The characteristics of parallel microprocessors and the limitations of current programming methodologies motivate our design of Copperhead, a high-level data parallel language embedded in Python. The Copperhead programmer describes parallel computations via composition of familiar data parallel primitives supporting both flat and nested data parallel computation on arrays of data. Copperhead programs are expressed in a subset of the widely used Python programming language and interoperate with standard Python modules, including libraries for numeric computation, data visualization, and analysis. Copperhead includes a number of compiler and runtime features that enable it to efficiently execute data parallel code. We define the restricted subset of Python which Copperhead supports and introduce the program analysis techniques necessary for compiling Copperhead code into efficient low-level implementations. We demonstrate the effectiveness of our techniques with several examples targeting the CUDA platform for parallel programming on GPUs. Copperhead code is concise, on average requiring 3.6 times fewer lines of code than CUDA, and the compiler generates efficient code, yielding 45-100% of the performance of hand-crafted, well optimized CUDA code.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Bug Positioning System
Pub ID:  2752 Authors:  Valeria Bertacco, Andrew DeOrio, Daya Shanker Khudia
Presentation on a novel post-silicon bug diagnosis technique to detect bugs that manifest inconsistently over multiple executions.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Functional Correctness for CMP Interconnects
Pub ID:  2671 Authors:  Rawan Abdel‑Khalek, Ritesh Parikh, Andrew DeOrio, Valeria Bertacco
As transistor counts continue to scale, modern designs are transitioning towards large chip multi-processors (CMPs). In order to match the advancing performance of CMPs, on-chip interconnects are becoming increasingly complex, commonly deploying advanced network-on-chip (NoC) structures. Ensuring the correct operation of these system-level infrastructures has become increasingly problematic and, in order to avoid the potential for functional design errors manifesting into the final product, there is a need for mechanisms to safeguard communication integrity at runtime. In this paper, we propose SafeNoC, an end-to-end error detection and recovery solution to ensure the functional correctness of CMP interconnects. SafeNoC augments the existing interconnect with a simple, lightweight checker network that is guaranteed to deliver messages correctly. For each data message sent over the primary NoC, a look-ahead signature is transmitted over the checker network and is used to detect errors in the corresponding data message. If a functional communication bug is detected, a novel recovery algorithm reconstructs the data that was in flight at the time of the error occurrence, ensuring that it reaches the intended destination. In our experiments, we found that SafeNoC can recover from a wide variety of errors, with almost no performance impact in the absence of errors. A lightweight solution, SafeNoC occupies a 2.41% area overhead in a 64-core CMP, 7x smaller than common retransmission-based approaches.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Stochastic Computing on Error Resilient System Architecture (ERSA) Platforms
Pub ID:  2688 Authors:  Rami Abdallah, Hyungmin Cho, Subhasish Mitra, Naresh Shanbhag
Error resiliency has emerged as a promising design approach in the increasingly unreliable nanometer scale processes. Stochastic computation exploits the statistical performance metrics of emerging DSP-heavy applications and matches it to the statistical nature of underlying hardware based on the principles of estimation and detection theory. The application specificity of stochastic computing has limited its robustness and energy benefits to correcting data path errors in ASICs. On the other hand, robust techniques for general purpose/programmable computing, such as Error Resilient System Architecture (ERSA), have resorted to heuristics to achieve efficient error detection and correction. In this work, we map stochastic computing techniques on ERSA platforms to overcome the application specificity of stochastic computation and provide ERSA with a systematic approach for error filtering to increase the overall system reliability. We demonstrate the robustness benefits of ERSA-mapped stochastic computing techniques, in particular algorithmic noise tolerance (ANT), in the design of a discrete-cosine transform (DCT) codec running on a BEE3 FPGA. Results show ERSA-mapped ANT can achieve a peak-signal-noise-ratio (PSNR) greater than 36-dB for high error rate injection (20 errors/flip-flop/sec) while conventional systems fail catastrophically at very low error-rates (< 4 errors/flip-flop/sec) due to uncorrected control and memory errors. This work is jointly done by researchers at University of Illinois and Stanford University and is accompanied by a demo to demonstrate the effectiveness of the proposed approach in an image acquisition application running on a BEE3 FPGA.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Algorithmic Techniques for Fault Tolerance for Sparse Linear Algebra     [ edit ]   
Pub ID:  2710 Authors:  Joseph Sloan, Rakesh Kumar, Greg Bronevetsky
The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that preconditioning techniques can be used to make these applications more amenable to algorithmic checks. The proposed techniques are shown to yield up to 2x in performance improvements over traditional ABFT checks. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Portability and Performance of OpenCL Kernels on GPU and CPU Platforms
Pub ID:  2736 Authors:  Michael Anderson, Bor‑Yiing Su, Kurt Keutzer
This research focuses on understanding the portability and performance of OpenCL on various GPU and CPU platforms. The analysis includes four case studies, matrix matrix multiplication, sparse matrix vector multiplication, flattened data-parallel kernel, and the conjugate gradient solver for large displacement optical flow. The portability studies were conducted by compiling the same code on different platforms. The performance studies were conducted by comparing the performance between OpenCL and native programming languagues on the platforms.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Enabling Advanced Inference in Small-scale Sensors: resilient devices for analyzing physiological signals
Pub ID:  2753 Author:  Naveen Verma
Presentation of work in Application Drivers Theme, Task 5.1.2.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   CrashTest'ing SWAT: Accurate, Gate-Level Evaluation of Symptom-Based Resiliency Solutions     [ edit ]   
Pub ID:  2672 Authors:  Andrea Pellegrini, Robert Smolinski, Lei Chen, Xin Fu, Siva Kumar Sastry Hari, Junhao Jiang, Sarita Adve, Todd Austin, Valeria Bertacco
Current technology scaling is leading to increasingly fragile components, making hardware reliability a primary design consideration. Recently researchers have proposed low-cost reliability solutions that detect hardware faults through software-level symptom monitoring . SWAT (SoftWare Anomaly Treatment), one such solution, demonstrated through microarchitecture-level simulations that it can provide hi gh fault coverage and low Silent Data Corruption (SDC) rate. However, more accurate evaluations of SWAT are needed to test its capabili ty to tackle hardware faults on industry-strength processors running realistic workloads on top of commercial operating systems.

In this poster, we evaluate an FPGA-based analysis platform capable of testing a complete computer system against detailed fault model a ccuracy to verify symptom-based fault detection schemes. With this platform, we performed a gate-level accurate fault campaign of 51,6 30 fault injections across the processor core logic of the OpenSPARC T1 design and five SPECInt 2000 benchmarks. With a conservative ov erall SDC rate of 0.76%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effecti veness of symptom-based software detectors for permanent faults in most microprocessor components.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   ParaFin: A Framework for Parameterized Model Checking of Fine Grained Concurrency
Pub ID:  2689 Authors:  Divjyot Sethi, Murali Talupur, Sharad Malik

We show how the correctness of fine grained concurrent data structures, with a fixed number of elements but an unbounded number of threads accessing them, can be established in a highly automated fashion. Though the technique we use, called the CMP method, was used primarily for message passing protocols, the underlying principles carry over to shared memory systems as well. The CMP method is semi-automatic as it requires user guidance in the form of lemmas. We show how for concurrent data structures most of these lemmas can be automatically generated using off-the-shelf invariant generators. Thus, in contrast to static analysis based on separation logic, our method places very little proof-burden on the user and does not require the user to understand complex logics. Further, in contrast to other model checking based works, which proved correctness of concurrent data structures for a fixed number of threads, we establish correctness for any number of threads. We verified several challenging list based concurrent set data structures, and also found a known bug in one of the data structures. These preliminary results demonstrate the efficacy of our method and establish it as a promising technique for verifying concurrent data structures.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Programming Interfaces for Stochastic Computing
Pub ID:  2711 Authors:  Yavuz Yetim, Sharad Malik, Margaret Martonosi
New fabrics have higher error rates due to smaller transistor sizes and increased process variations. Furthermore, if we operate a circuit at a frequency/voltage setting that it is not designed to run timing errors start to occur and degrade application output. In our work, we define a computation stack that starts from hardware and ends at the application level metrics, and study how errors propagate through these layers. Moreover, we define different levels of accuracy criteria for the control flow and memory accesses, and see the effects of these levels on a representative set of applications from the StreamIt benchmarks. Using this framework, the overall system can be optimized for better utilization of the faulty fabric.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Variability at Low Voltages: SRAM, Error Tolerance     [ edit ]   
Pub ID:  2737 Authors:  Dennis Sylvester, David Blaauw, Dongsuk Jeon, Mingoo Seok, Gyouho Kim, David Fick
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Bio-Inspired Sensory Signal Processing
Pub ID:  2754 Authors:  Ping‑Chen Huang, Jan Rabaey
Signal processing tasks such as classification or recognition may benefit from implementation strategies inspired by biological sensory pathways. In this project, we explore the biological system from both the top-down and bottom-up perspectives. From the top down, we explore the functional models in computational neuroscience and seek for architectures that allow the use of low-power and low-precision computational units. From the bottom up, we investigate the strategies that the sensory systems have used to seamless interact with the analog inputs and perform the computation in an asynchronous way. In this talk, an odor recognition architecture is proposed based on the olfactory sensory pathway. This design seamlessly interact with the analog sensor responses and performs distributed computation with an overcomplete number of low-power, low-precision analog components, leading to energy efficiency and resiliency.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Error Modeling of Electrical Bugs for Post-Silicon Validation
Pub ID:  2673 Authors:  Ming Gao, Peter Lisherness, Kwang‑Ting Cheng
Post-silicon validation for electrical bugs becomes both expensive and challenging with increasing variability and shrinking noise margins. High quality validation resources are needed to address this problem. Currently, the development and selection of validation tests and DfD structures primarily rely on intuition and the brute force of voluminous test stimuli either generated randomly or derived from applications. Although these validation resources can be effective, error models for coverage measure are needed to evaluate the sufficiency of the test suites, to quantify effectiveness of DfD alternatives, and to complement the potential biases of intuition. Effective error models for post-silicon validation have specific criteria: they must be efficiently computable with functional tests, must be sufficiently aware of bug activation conditions, and must account for the limited error observability in silicon. However, no prior error models meet all these goals. While the errors induced by electrical bugs ultimately manifest as an excessive delay, delay faults are not suitable bug models for validation tests. An important reason is that checking error detections at system observability points requires functional sequential fault simulations which are unaffordable. On the other hand, the Random Bit-Flip (RBF) error model was commonly employed for observability evaluation but it can hardly provide any meaningful coverage measure without taking any electrical bug activation conditions into account. Therefore, we introduce COBE, a COnstrained Bit-flip Error model that combines the accuracy of bug activation conditions extracted from low-level circuit model with the efficiency of error observability evaluation at high-level. Experimental results using an Alpha 21264 processor implementation and the OpenRISC SoC design demonstrate that COBE model correlates with electrical bugs significantly stronger than RBF models (correlation factors 0.921 vs. 0.482). It also exposes the shortcomings of the COBE model with only transition error constraints, highlighting the need and opportunities for improvement. We also proposed a "MUX-glitch" error constraint to improve the COBE model by more than six times in accuracy with negligible simulation overhead.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   System Level Metrics for Analog-to-Digital Converter Design
Pub ID:  2690 Authors:  Andrew Bean, Aolin Xu
The usual figures of merit for analog-to-digital converter (ADC) design are more appropriate to signal reconstruction, and do not directly address the needs of communication systems. It is thus our goal to design new ADC performance metrics that directly account for the incorporation of these converters into communication systems. To this end, we consider two metrics for choosing the timing parameters of time-interleaved ADCs, one based on mutual information and the other based on deflection (generalized signal to noise ratio). We also consider BER as a system level metric for the choice of ADC quantization levels, and further demonstrate the potential for significant performance improvement and power reduction by using these techniques.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Post-Silicon Fault Localisation Using SAT and Backbones     [ edit ]   
Pub ID:  2712 Authors:  Shucheng Zhu, Georg Weissenbacher, Sharad Malik
The localisation of faults in integrated circuits is a challenging problem and a dominating factor in the overall verification effort. Electrical bugs, in particular, surface only in the fabricated prototypes, leading to behaviour deviating from the golden model. Limited observability complicates their localisation: Logging mechanisms such as trace buffers allow us to retain only a limited execution history. A symbolic analysis of the RTL design can find discrepancies between the values recorded in the trace buffer and the intended behaviour. Contemporary MAX-SAT solvers are then able to identify a maximal subset of the RTL design that is consistent with the observed behaviour. The elements in the complement of this subset represent potential locations of the fault. The scalability of contemporary decision procedures dictates the size of a window of execution cycles which we can analyse using symbolic techniques. Current MAX-SAT-based fault localisation techniques require this window to span the fault as well as the error it causes. To address the scalability issues resulting from large window sizes, we propose to slide a smaller window along the temporal axis, constraining it with the information recorded in the trace buffer for the respective execution cycles. In this scenario, the localisation attempt may fail: The limited information provided by the trace buffer may be insufficient to pin down the exact temporal and spatial location of the fault. We propose to use backbones to identify information that can be propagated across sliding windows. The backbone of a symbolic representation of a circuit is the set of signals that are immutable under the given constraints (e.g., the output and trace buffer values). This additional information has several benefits: Firstly, it may be instrumental in locating the fault. Secondly,it may enable a reduction of the size of the of trace buffers and the sliding window. Our preliminary experimental results demonstrate that the use of backbones allows us to reduce the size of the sliding windows or the trace buffer.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Abstraction-Based NoC Performance Analysis     [ edit ]   
Pub ID:  2738 Authors:  Daniel E. Holcomb, Bryan A. Brady, Sanjit A. Seshia
We present an approach to formally verify quality-of-service (QoS) properties of network-on-chip (NoC) designs. To tackle industrial- scale designs, we adopt an abstraction-based approach, where only the nodes of interest in the network are precisely modeled and the rest of the network is abstracted away as sources and sinks of traffic. We give an automatic technique to infer a traffic model, comprising formal models of sources and sinks, from simulation traces derived from software benchmarks. Experimental results demonstrate that our abstraction-based approach can efficiently and accurately verify industrial-scale NoC designs.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Resilient Coherence in Many-Core CMPs
Pub ID:  2756 Authors:  Konstantinos Aisopos, Valeria Bertacco, Li‑Shiuan Peh, Andrew DeOrio
The talk introduces two novel techniques to address the recovery of data withing correct protocol specifications in interconnect networks for CMPs facing transistor failures.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   A Case for Locality-Aware Task Management on Many-Core Processors     [ edit ]   
Pub ID:  2641 Authors:  Richard Yoo, Christopher Hughes, Changkyu Kim, Yen‑Kuang Chen, Christos Kozyrakis
As more cores are packed onto the same die, applications must be split into more, finer-grained tasks in order to exploit the available parallelism. However, fine-grained tasks are more sensitive to cache misses due to interference or true communication. The cost of misses is also increasing, as cache hierarchies are becoming deeper and more complex due to the large core counts. Consequently, it is becoming increasingly important to schedule tasks on many-core chips in a locality-aware manner.

This project investigates all aspects of locality-aware management of fine-grained tasks, covering both task scheduling and stealing. We analyze the three key decisions in generating a locality-aware schedule: task grouping, task ordering, and task size, and propose a recursive approach to task scheduling that is generally applicable to any cache hierarchy. Our simulation results on two distinct 32-core systems show average of 1.60x speedup over a randomized schedule and 1.43x speedup over a published, baseline scheduling scheme. The locality-aware schedule reduces energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. The importance of the three decisions are also verified.

We also highlight the importance of locality-aware stealing when the tasks are scheduled in a locality-aware fashion, as we develop a recursive task stealing scheme that preserves the benefits of a locality-aware schedule while load balancing. Proposed stealing scheme shows a speedup for stolen tasks of up to 2.0x over randomized stealing.

Finally, we show how a pattern-based approach could be utilized to reduce scheduling overheads and to improve locality for stealing.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Impact of Stochastic Computing on Minimum Energy Operating Point (MEOP)
Pub ID:  2674 Authors:  Rami Abdallah, Naresh Shanbhag
Error resiliency has demonstrated significant robustness and energy benefits in superthreshold performance-constrained applications. In this work, we study the impact of error resiliency, in particular stochastic computation in subthreshold energy-constrained applications where designs are operated at their minimum-energy operating point (MEOP) and error resiliency is still under-explored. We show that the minimum energy (Emin) at the MEOP in subthreshold designs can be further lowered by employing frequency overscaling (FOS) or voltage overscaling (VOS) and stochastic computation techniques to correct for intermittent timing errors. Simulations results demonstrate a 26% reduction in the Emin of a stochastic computation based filter along with increased robustness to voltage variations. To further verify our results, we design a sub-threshold ECG processor IC in IBM 45nm SOI CMOS employing statistical system-level error compensation. Measurement results show a detection probability Pdet ≥ 86% with 28% reduction in Emin at a supply voltage scaled to 15% below its critical value while tolerating a raw hardware error rate of upto 58%. This is an improvement of 17X in Pdet, and 61X in error tolerance. At 15fJ/cycle/k-gate, this IC has 4.5X better energy-efficiency than state of the art and is robust to variations.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Impact Productivity Tools for Developing High-Performance Throughput-Oriented Applications     [ edit ]   
Pub ID:  2692 Authors:  Nasser Anssari, Isaac Gelado, Christopher Rodrigues, John Stratton, I‑Jui Sung, Wen‑mei Hwu

Heterogeneous systems have become an important building block for modern computing platforms, ranging from supercomputers to mobile devices. New programming languages are being adopted as an interface to a variety of parallel processors. However, writing high-performance parallel code is still too complicated today. The IMPACT group is developing a set of tools to address various aspects of the complexity inherent in developing high-performance parallel applications targeting heterogeneous systems.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Phase-Locked Loop Verification by a Bounded Model Checking Approach     [ edit ]   
Pub ID:  2719 Authors:  Sicun Gao, Ying‑Chih Wang
Recently more analog designs contains digital parts to ease design difficulties brought by further process advancing into nano-meter scale. This trend posts challenges on the verification side because mixed-signal behaviors. We are in progress of investigating the use of hybrid system verification technique to verify the locking time of a phase locked loop. Namely the discrete parts of the design are model by an abstracted (or called epsilon-bisimulated) continuous model as a solution for the possible state explosion problem. The abstracted model will be checked by a hybrid system verification tool which is called dReach with respect to bounded model checking properties or invariants.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Coherent 3D Scene Understanding from Images     [ edit ]   
Pub ID:  2740 Authors:  Sid Ying‑Ze Bao, Jason Clemons, Mohit Bagra, Todd Austin, Silvio Savarese
The one slide overview of the layout estimation part for the visual sonificaiton system.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Network-Driven Chips
Pub ID:  2757 Authors:  Li‑Shiuan Peh, Tushar Krishna
Network-Driven Chips: Towards the ideal NoC
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Energy Benefits of Power Gating on Memory Misses in Multi-Core Systems     [ edit ]   
Pub ID:  2647 Authors:  Andrew B. Kahng, Seokhyeong Kang, Tajana Simunic Rosing, Richard Strong
In mobile systems, the problems of short battery life and increased temperature are exacerbated by wasted leakage power. Leakage power waste can be reduced by power-gating a core while it is stalled waiting for a resource. In this work, we propose and model memory miss power gating (MMPG), a low-overhead technique to enable power-gating of an active core when it stalls during a long memory access. We describe a programmable two-stage power gating switch design that can vary a core’s wakeup delay while maintaining voltage noise limits and leakage power savings. We also model the processor power distribution network and the effect of memory miss power gating on neighboring cores. Last, we apply our power gating technique to actual benchmarks, and examine energy savings and overheads from power gating stalled cores during long memory accesses. Our analyses show the potential for over 38% energy savings given “perfect” power gating on memory misses; we achieve energy savings exceeding 20% for a practical, counter-based implementation.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Towards the Ideal On-chip Fabric for 1-to-Many and Many-to-1 Communication     [ edit ]   
Pub ID:  2675 Authors:  Tushar Krishna, Li‑Shiuan Peh, Bradford M. Beckmann, Steven K. Reinhardt
The prevalence of multicore architectures has accentuated the need for scalable cache coherence solutions. Many of the proposed designs use a mix of 1-to-1, 1-to-many (1-to-M), and many-to-1 (M-to-1) communication to maintain data coherence and consistency. The on-chip network is the communication backbone that needs to handle all these flows efficiently to allow these protocols to scale. However, most research in on-chip networks has focused on optimizing only 1-to-1 traffic. There has been some recent work addressing 1-to-M traffic by proposing the forking of multicast packets within the network at routers, but these techniques incur high packet delays and power penalties. There has been little research in addressing M-to-1 traffic. We propose two in-network techniques, Flow Across Network Over Uncongested Trees (FANOUT) and Flow AggregatioN In-Network (FANIN), which perform efficient 1-to-M forking and M-to-1 aggregation, respectively, such that packets incur only single-cycle delays at most routers along their path, thus approaching an ideal network (one that incurs only wire delay/energy). Full-system simulations on a 64-core CMP with SPLASH-2 and PARSEC benchmarks show that FANOUT and FANIN together reduce runtime by 14.9% and network energy by 40.2%, on average, compared to state-of-the-art networks, operating at just 1% and 9.6% above the runtime and energy of an ideal network.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Integrated Framework Combining Virtual Platform and NoC Synthesis for Heterogeneous Systems-on-Chip
Pub ID:  2693 Authors:  Young Jin Yoon, Nicola Concer, Luca Carloni

Future platform architectures will be empowered by heterogeneous multi-core Systems-on-Chip that integrate an increasing number of processors, specialized accelerators, and memories into a single die. The challenges in SoC design will be primarily in the integration and management of their components: these must be interconnected with a flexible and scalable communication infrastructure and must be activated only when needed since the whole chip must run within very tight power constraints.

We present an integrated framework that enables effective design-space exploration of the communication infrastructure for heterogeneous SoCs by providing fast yet accurate estimation of performance, power and area.

Our framework leverages a time-approximate virtualized platform to execute complex application scenarios running on top of a full Linux environment. In our approach, we first rely on an abstract network to run efficient simulations. This allows us to derive the information necessary to synthesize an optimal network-on-chip that supports the target applications. The resulting NoC is then automatically synthesized at RTL level to obtain accurate power and area analysis. Finally the NoC components are back-annotated with the RTL-synthesis results and the NoC is plugged into the virtualized platform so that the full system can be tested to refine the performance, power, and area analysis.

Our framework is a promising test-bench for the development of advanced communication protocols and supporting network architectures for future heterogeneous SoCs.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Surviving Memory and Concurrency Errors     [ edit ]   
Pub ID:  2720 Authors:  Nitin Gupta, Justin Aquadro, Tongping Liu, Charlie Curtsinger, Emery Berger

Lockdown:
Most applications make extensive use of third party libraries that act as their interface to the outside world (like various image/video/audio codecs, socket library). All of these libraries are convenient and frequent vectors for attack: because they run in the same address space as the application, they have full read and write access to all of the application's memory. Weaknesses in these libraries can allow corruption of the application's stack and heap via buffer overflows, and give an attacker control of the process. We present a runtime system called Lockdown which automatically isolates libraries from the main application and prevents a range of attacks resulting from invalid or unauthorized memory accesses.

Dthreads:
Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with generalpurpose multithreaded programs, or run up to 8.4x slower than pthreads. We present DTHREADS, an efficient deterministic multithreading system for unmodi?ed C/C++ applications that replaces the pthreads library. DTHREADS enforces determinism in the face of data races and deadlocks. DTHREADS works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, DTHREADS has the additional benefit of eliminating false sharing. Experimental results show that DTHREADS substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks evaluated here, matches and occasionally exceeds the performance of pthreads

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   GSRC State of the Center
Pub ID:  2741 Author:  Sharad Malik
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Parallel Assertions for Debugging Parallel Programs
Pub ID:  2758 Author:  Daniel Schwartz‑Narbonne
A parallel program must execute correctly even in the presence of unpredictable thread interleavings. This interleaving makes it hard to write correct parallel programs, and also makes it hard to find bugs in incorrect parallel programs. A range of tools have been developed to help debug parallel programs, ranging from atomicity-violation and data-race detectors to model-checkers and theorem provers. One technique that has been successful for debugging sequential programs, but less effective for parallel programs, is running the program using assertion predicates provided by the developer. These assertions allow programmers to specify and check their assumptions. In a multi-threaded program, the programmer's assumptions include both the current state, and any actions (e.g. access to shared memory) that other, parallel executing threads might take. We introduce parallel assertions which allow programmers to express these assumptions for parallel programs using simple and intuitive syntax and semantics. We present a proof-of-concept implementation, and demonstrate its value by testing a number of benchmark programs using parallel assertions
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Predicting Serializability Violations: SMT-based Search vs. DPOR-based Search
Pub ID:  2646 Authors:  Arnab Sinha, Sharad Malik
In our recent work, we addressed the problem of detecting serializability violations in a concurrent program using predictive analysis, where we used a graph-based method to derive a predictive model from a given test execution. The exploration of the predictive model to check alternate interleavings of events in the execution was performed explicitly, based on stateless model checking using dynamic partial order reduction (DPOR). Although this was effective on some benchmarks, the explicit enumeration was too expensive on other examples. This motivated us to examine alternatives based on symbolic exploration using SMT solvers. In this work, we propose an SMT-based encoding for detecting serializability violations in our predictive model. SMT-based encodings for detecting simpler atomicity violations (with two threads and a single variable) have been used before, but to our knowledge, our work is the first to use them for serializability violations with any number of threads and variables. We also describe details of our DPOR-based explicit search and pruning, and present an experimental evaluation comparing the two search techniques. This provides some insight into the characteristics of the instances when one of these is superior to the other. These characteristics can then be used to predict the preferred technique for a given instance.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Contracts for Correct Composition and System-Level Design of Analog and Mixed-Signal Circuits
Pub ID:  2676 Authors:  Pierluigi Nuzzo, Alberto Puggelli, Alberto Sangiovanni‑Vincentelli
We explore techniques to make analog and mixed-signal circuit abstractions more reliable for system-level design. We develop analog contracts to address the problem of assembling integrated systems out of pre-designed components. Horizontal contracts encode correct composition rules, while vertical contracts define under which conditions an aggregation of components satisfies the requirements posed at a higher level of abstraction. Both contracts enable “valid” system integration, by limiting the design exploration to the regions in which compositions are “legal” and specifications are met. We demonstrate the effectiveness of our approach on the design of an ultra-wide band receiver used in an Intelligent Tire system, an on-vehicle wireless sensor network for active safety applications.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Improving Cache Performance Using Victim Tag Stores
Pub ID:  2694 Authors:  Vivek Seshadri, Onur Mutlu
With increasing pressure on memory bandwidth, there have been a number of proposals that improve the cache replacement policy. These mechanisms monitor the cache blocks while they are in the cache and evict blocks that are deemed to have low temporal locality. However, a majority of these mechanisms are agnostic to the temporal locality of a missed block and follow a single insertion policy for all incoming blocks. There is comparatively very little work on mechanisms to distinguish between missed blocks based on their temporal reuse behavior. Prior work has shown that distinguishing missed blocks based on their temporal locality and choosing the insertion policy on a per-block basis can significantly improve performance. To this end, we propose a new, simple hardware mechanism that predicts the temporal locality of a missed block before inserting it into the cache. The key insight behind the prediction scheme is that if a block with good temporal locality gets prematurely evicted from the cache, it will be accessed soon after eviction. To implement this prediction scheme, our mechanism augments the conventional cache with a structure, victim tag store, that keeps track of addresses of blocks evicted from the cache. We provide a practical, low-complexity hardware implementation of our mechanism using Bloom filters. We qualitatively and quantitatively compare our mechanism to five different cache management mechanisms and show that it provides significant performance improvements.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Digitally Assisted VaST Driven Post-manufacture Adaptation of Analog / RF Systems
Pub ID:  2721 Authors:  Aritra Banerjee, Shyam Kumar Devarakond , Vishwanath Natarajan, Shreyas Sen, Abhijit Chatterjee
Post-manufacture adaptation of RF systems has become a necessity due to the use of scaled CMOS technologies that make analog/RF devices increasingly susceptible to manufacturing process variations. Tuning involves diagnosis of mixed-signal/RF performance parameters at the system and module levels and adjustment of digital and analog “tuning knobs” designed into the mixed-signal/RF and baseband signal processing modules so that overall system level performances are met with the least impact on total power consumption. Three techniques have been proposed for power aware post manufacture tuning of analog / RF systems: (1) Model parameter computation based bottom-up tuning (2) DSP assisted regression based tuning (3) Tuning with the help of on-chip digital circuits
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Alternative Research Theme Overview
Pub ID:  2742 Author:  Naresh Shanbhag
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Running 1000 Threads on a General-Purpose Multi-Core
Pub ID:  2760 Authors:  Daniel Sanchez, Christos Kozyrakis
Scaling chip-multiprocessors (CMPs) to support thousands of threads requires significant innovation across the software-hardware stack. We present a set of software and hardware contributions that tackle these important scalability challenges. First, we enable scalable software by designing runtimes that support rich abstractions for parallelism, heterogeneity, and locality, and perform scheduling dynamically and at fine granularity to avoid load imbalance. Moreover, we introduce flexible hardware support to accelerate fine-grain scheduling, ensuring low scheduler overheads at high thread and core counts. Second, we present a set of techniques that enable scalable coherent cache hierarchies that are highly efficient, provide QoS and are configurable by software. We first design a novel cache array that implements high associativity cheaply and provides analytical guarantees on associativity, and use it to implement a scalable cache partitioning technique (so that thousands of threads can share the cache in a controlled manner, providing QoS guarantees) and scalable cache coherence.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Research on Network-on-Chip Router Modeling
Pub ID:  2648 Author:  Siddhartha Nath

The era of many-core computing demands highly efficient design of Network-on-Chip routers. This in turn requires more accurate, detailed architectural simulators with NoCs. Currently available models for NoC routers, such as Orion2.0, still have gaps in their area and power estimations – e.g., through modeling only the data paths. Our research seeks to improve accuracy of area and power estimation of NoC router simulators. We apply a highly detailed methodology to analyze component blocks (crossbar, switch and virtual channel arbiters, etc.) of academic NoC routers at the gate-netlist and post-layout level, and propose detailed models of data and control paths of each component to provide highly accurate NoC router modeling.

Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Thread Cluster Memory Scheduling     [ edit ]   
Pub ID:  2695 Authors:  Yoongu Kim, Michael Papamichael, Onur Mutlu, Mor Harchol‑Balter
This paper presents a new memory scheduling algorithm that addresses system throughput and fairness separately with the goal of achieving the best of both. The main idea is to divide threads into two separate clusters and employ di fferent memory request scheduling policies in each cluster.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   VaST: Validation Signature Test
Pub ID:  2722 Authors:  Aritra Banerjee, Shreyas Sen, Shyam Kumar Devarakond , Vishwanath Natarajan, Abhijit Chatterjee
Low cost test and validation of analog/RF systems has become an important problem due to increased process variability effects on the performance of devices and the need to ramp-up yield. In this work we propose a model parameter computation based testing method and genetic algorithm driven test stimulus generation approach which produces a compact, deterministic test signal in such a way that the RF DUT model parameters can be computed directly from the DUT response (called the DUT signature). This is achieved through use of a non-linear solver that adjusts the DUT model parameters iteratively until the model response to the applied test matches the observed DUT test response signature. Here we show: Diagnosis of a RF transceiver Testing of a delta-sigma ADC Post-silicon validation using optimized signature
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Application Drivers Theme Overview
Pub ID:  2743 Author:  Todd Austin
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Architecture and Synthesis Support for Accelerator-Rich CMPs
Pub ID:  2762 Author:  Jason Cong
Motivation – Why accelerator-rich CMPs (AXR-CMP)? Accelerator management and virtualization AXR-CMP implementation alternatives Accelerator synthesis and generation Accelerator selection Concluding remarks and future works
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Exploiting the Forgiving Nature of GPU Applications to Reduce Control and Memory Divergence
Pub ID:  2649 Authors:  John Sartori, Rakesh Kumar
Control and memory divergence between threads within the same execution bundle, or warp, have been shown to cause significant performance bottlenecks for GPU applications. In this paper, we observe that many GPU applications produce acceptable outputs even if a small number of threads in a SIMD warp are forced to go down the wrong control path or are forced to load from an incorrect (albeit spatially local) address. We exploit this observation to propose branch and data herding. Branch herding eliminates control divergence by forcing all threads in a warp to take the same control path. Data herding eliminates memory divergence by forcing each thread in warp to load the same memory block. Our software implementation of branch herding on NVIDIA GeForce GTX 480 improves performance by an average of 13% for a suite of NVIDIA CUDA SDK and Parboil benchmarks. Our hardware implementation of branch herding improves performance by an average of 30%. Data herding improves performance by up to 32% (16%, on average). Output quality degradation is minimal for most applications.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Concurrent Autonomous Self-Test and Diagnostics
Pub ID:  2718 Authors:  Yanjing Li, Subhasish Mitra
Concurrent autonomous self-test, or online self-test, allows a system to test itself, concurrently during normal operation, with no system downtime visible to the end-user. Online self-test is important for overcoming major reliability challenges such as early-life failures and circuit aging in future System-on-Chips (SoCs). We present an efficient online self-test technique that is applicable to both process cores and uncore components (e.g., cache controllers, DRAM controllers, and I/O controllers, in addition to processor cores). Implementation of online self-test for uncore components of the open-source OpenSPARC T2 multi-core SoC, achieves high test coverage for the use of both failure prediction and hard failure detection at 3.5% area impact, 2% power impact, and 3.6% system-level performance impact.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Content Aware Channel Adaptive Low Power MIMO System for Video Transmission     [ edit ]   
Pub ID:  2697 Authors:  Debashis Banerjee, Joshua Wells, Shreyas Sen, Shyam Kumar Devarakond , Abhijit Chatterjee
With increasing demand for reliable, fast communication over adverse channel conditions, new technologies have come to the fore to ensure robust error-free data transmission and reception. A key concept that has had a large footprint in the area of reliable wireless communication that of multiple-input-multiple-out (MIMO) wireless systems. With the help of adaptive RF circuits and systems we can trade off performance for power in MIMO Virtually Zero Margin Adaptive RF Front- end(MIMO-VIZOR) receiver when applicable. Such low power techniques must also adapt with process variation. Further, intelligent encoding algorithms could be used for video transmission to save power at the baseband.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Content Aware Channel Adaptive Low Power MIMO System for Video Transmission
Pub ID:  2724 Authors:  Debashis Banerjee, Joshua Wells, Abhijit Chatterjee
With increasing demand for reliable, fast communication over adverse channel conditions, new technologies have come to the fore to ensure robust error-free data transmission and reception. A key concept that has had a large footprint in the area of reliable wireless communication that of multiple-input-multiple-output (MIMO) wireless systems. With the help of adaptive RF circuits and systems we can trade off performance for power in MIMO Virtually Zero Margin Adaptive RF Front- end(MIMO-VIZOR) receiver when applicable.Further, intelligent encoding algorithms could be used for video transmission to save power at the baseband.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Platform Architectures Theme Overview
Pub ID:  2744 Authors:  Margaret Martonosi, Luca Carloni
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Stochastic Communications     [ edit ]   
Pub ID:  2764 Authors:  Andrew singer, naresh shanbhag
Stochastic Communications In this talk we look at non-traditional design of ADCs in an AFE for a communication link We consider the goal of preserving information, rather than the precise waveform (i.e. use BER as a guide, not SNDR) This leads to non-traditional ADCs: where quantization is nonuniform in time or in amplitude This provides a BER gain / power savings We are able to maintain these benefits in the presence of non-ideal circuit models
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Formally Enhanced Runtime Verification for NoCs
Pub ID:  2652 Authors:  Ritesh Parikh, Valeria Bertacco
As silicon technology scales, modern processors and embedded systems are rapidly shifting towards complex chip multi-processor (CMP) and system-on-chip (SoC) designs, comprising several processor cores and IP components communicating via a network-on-chip (NoC). As a side-effect of this trend, ensuring their correctness has become increasingly problematic. In particular, the network-on-chip often includes complex features and components to support the required communication bandwidth among the nodes in the system. In this landscape, it is no wonder that design errors in the NoC may go undetected and escape into the final silicon, with potential detrimental impact on the overall system. In this work, we propose ForEVeR, a solution that complements the use of formal methods and runtime verification to ensure functional correctness in NoCs. Formal verification, due to its scalability limitations, is used to verify the smaller modules, such as individual router components. We complete the protection against escaped design errors with a runtime technique, a network-level error detection and recovery solution, which monitors the traffic in the NoC and protects it against escaped functional bugs that affect the communication paths in the network. To this end, ForEVeR augments the baseline NoC with a lightweight checker network that alerts destination nodes of incoming packets ahead of time. If a bug is detected, flagged by missed packet arrivals, a recovery mechanism delivers the in-flight data safely to the intended destination via the checker network. ForEVeR’s experimental evaluation shows that it can recover from NoC design errors at only 5.9% area cost for an 8x8 mesh interconnect, with a recovery performance cost of less than 30K cycles per functional bug manifestation. Additionally, it incurs no performance overhead in the absence of errors.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Power-Performance Modeling for General-Purpose and Accelerator-Based Systems     [ edit ]   
Pub ID:  2716 Authors:  Chih‑Ning Amanda Tseng, Yakun Sophia Shao, Svilen Kanev, David Brooks
The poster discusses a power-performance modeling framework for general-purpose and accelerator-based systems. Performace-power models for Atom and Nehalem-like microarchitectures are described and evaluated. Using the modeling framework, we explore a large space of potential designs from a cost-benefit viewpoint.
Nov 16, 2011,   GSRC/MuSyC Annual Joint Review

   Characterizing and Improving Last-level Cache Management using Signature-based and Prefetch-aware Approaches
Pub ID:  2717 Author:  Carole‑Jean Wu
Hardware prefetching and last-level caching are two independent mechanisms to mitigate the growing latency to memory. Prefetching improves performance by fetching useful data in advance, but introduces performance variability for applications under different cache management policies. In this talk, I will present a Prefetch-Aware Cache Management (PACMan) proposal for providing better and more predictable performance under the influence of prefetching by modifying the cache insertion and hit promotion policies to treat demand and prefetch requests differently. Then, I will present a novel Signature-based Hit Predictor (SHiP) to learn the re-reference behavior of cache lines. SHiP correlates cache references with unique signatures: memory region, program counter, and instruction history sequence, and uses these signatures to better predict the re-reference intervals of cache references. While using less hardware, SHiP doubles the performance gains of the prior arts. PACMan and SHiP will be presented in MICRO 2011.
Nov 1, 2011,   GSRC e-seminar: Characterizing and Improving Last-level Cache Management using Signature-based and Prefetch-aware Approaches

   Debugging Parallel Programs for the Masses     [ edit ]   
Pub ID:  2639 Authors:  Daniel Schwartz‑Narbonne, Feng Liu, Tarun Pondicherry, David August, Sharad Malik
A parallel program must execute correctly even in the presence of unpredictable thread interleavings. This interleaving makes it hard to write correct parallel programs, and also makes it hard to find bugs in incorrect parallel programs. A range of tools have been developed to help debug parallel programs, ranging from atomicity-violation and data-race detectors to model-checkers and theorem provers. One technique that has been successful for debugging sequential programs, but less effective for parallel programs, is running the program using assertion predicates provided by the developer. These assertions allow programmers to specify and check their assumptions. In a multi-threaded program, the programmer's assumptions include both the current state, and any actions (e.g. access to shared memory) that other, parallel executing threads might take. We introduce parallel assertions which allow programmers to express these assumptions for parallel programs using simple and intuitive syntax and semantics. We present a proof-of-concept implementation, and demonstrate its value by testing a number of benchmark programs using parallel assertions.
Sep 13, 2011,   GSRC e-seminar: Debugging Parallel Programs for the Masses