Hierarchical Mixed-Signal/RF System Level Test &Validation [ edit ]
Combined Loop Transformation and Hierarchy Allocation for Data Reuse Optimization [ edit ]
Platform Viability Theme Overview
RF Real-Time Adaptation for Error Resilience, Low Power and Performance
A Design Framework for Distributed Power Management of Heterogeneous Systems-on-Chip
Parallel Assertions for Debugging Parallel Programs
Long term video segmentation through pixel level spectral clustering on GPUs
Hierarchical Mixed-Signal/RF System Level Test &Validation
Programming Concurrent Systems Theme Overview
Stochastic Computing: Principles and Practice
Scalable Security Vulnerability Analysis via Sampling
Software is complicated. This complexity makes it difficult to write correct programs, and the increasing performance of personal computers has exacerbated the problem through the drive for more features. Software bugs can cost money and lives, and information about them is now sold on black markets as weapons for cyber-warfare and for use in criminal activities.
Developers commonly use automated tools to hunt down these bugs. Dynamic dataflow analysis, one example of this type of tool, finds software errors by tracking meta-values associated with a program's runtime data, and can find subtle errors that would normally escape notice.
Such tests are more likely to find errors if they observe the program under a multitude of runtime situations. Ideally, a program's users would analyze it, testing situations the developers may never have thought to try. Unfortunately, the orders-of-magnitude slowdowns that accompany these systems limit their use to the development stage; few users would tolerate such overheads.
This poster describes methods of distributing these tests across large populations. We sample these analyses, missing some errors in exchange for keeping slowdowns below a user-defined threshold. Previous sampling methods are inadequate for dynamic dataflow analyses, so we describe a novel sampling mechanism, and describe hardware-assisted and software-only implementations.
In the end, the large populations gained by distributing the analysis can, in aggregate, analyze a larger portion of the program than is possible by any single user running a complete, but slow, analysis.
Row Buffer Locality-Aware Hybrid Memory Caching Policies [ edit ]
Coherent 3D Scene Understanding from Images [ edit ]
Performance-Oriented Mapping of Kernel Parallelism on FPGAs [ edit ]
Resilient Systems Theme Overview
Michigan Visual Sonification System [ edit ]
Supervised Design Space Exploration of Accelerators and Cores in Heterogeneous SoCs
Emerging heterogeneous Systems-on-Chip (SoCs) feature a mix of hardware accelerators and programmable cores. In order to raise the design productivity for these SoCs, it is critical both to reuse pre-designed soft IP components and to implement the components with synthesis tools. We present the first top-down design-space exploration framework that characterizes the system's cost-performance trade-off by adaptively synthesizing its components. Our approach minimizes the number of synthesis runs while discovering the desired system's Pareto set.
At the RTL level, we propose an optimal algorithm for Compositional Approximation of Pareto Sets (CAPS). For bi-objective design spaces, CAPS can return an approximate Pareto set that captures the cost-performance tradeoff with guaranteed accuracy in the fewest possible number of synthesis runs. On the other hand, at the system-level, we propose a high-level synthesis (HLS)-driven methodology to enhance design reuse. Starting from libraries of System-C specifications and companion HLS configurations, we present an algorithm for HLS planning that enables an effective parallelization of the HLS runs to construct the system-level Pareto curve.
DAE2FSM: A Fully Automated Technique for Generating Finite State Machine Abstractions for Digital, Analog, and Mixed-Signal Circuits
Fine grained accelerator integration using 3D
Power-Aware Dynamic Control of Error-Resilience Mechanisms
Ultra-low power sensing architectures and platforms for intelligent and scalable biomedical monitoring
We propose methods to enable low-power scalable biomedical monitoring which includes a low-power IC design employing embedded machine learning support. We also introduce system-level and application-level techniques to reduce communication cost and burden on clinical resources, and enhance system resiliency for low-power devices.
Relyzer: Application Resiliency Analyzer for Transients Faults
Post-silicon Validation of Weak Memory Consistency
Functional verification of modern chip multiprocessors (CMPs) is a challenging task because of increasing complexity and shrinking production schedules. The memory subsystem in CMPs is the the glue that brings the whole system together. Many subtle yet devastating bugs in the memory subsystems of CMPs are being released into final silicon. This calls for an efficient, high-coverage verification solution for identifying bugs in the memory subsystem of modern CMPs.
Our solution, a post-silicon validation techinique that executes directly on prototype hardware and can be disabled before final silicon, targets a set of bugs that cause illegal memory operation ordering in CMPs. The legal ordering of memory operations is specified by a memory consistency model which may dictate order among all memory operations or among some operations and ordering instructions (barriers/fences). Most modern CMPs are designed for weak memory consistency models of the latter type to enable optimizations for higher performance. Our validation solution, when activated, reconfigures a portion of cache storage to log memory accesses and any ordering instructions from the executing program using a compact data-coloring scheme. Periodically, execution of the program is paused and a distributed validation algorithm runs in-situ on the CMP to verify if the memory operation ordering stored in the logs is correct.
The overheads from our solution are only observed during the post-silicon validation phase. It can be disabled before shipment, releasing all cache storage and leaving only a nominal silicon area footprint.
PROMOTE: PROcess MOnitoring and TEsting of Analog/RF circuits [ edit ]
Durability and Availability in RAMCloud
A Systematic Methodology to Develop Resilient Cache Coherence Protocols
U-QED Tests for Effective Post-Silicon Validation of SoCs
U-QED, or Uncore Quick Error Detection, is a post-silicon validation technique that significantly reduces error detection latencies of bugs in SoC’s core and uncore components. Long error detection latency, the time elapsed between the occurrence of an error caused by a bug and its manifestation as a system-level failure, is a major challenge in post-silicon validation because it limits the effectiveness of existing post-silicon bug localization techniques based on trace recording, simulation, and formal analysis. In this work we focus on bugs in both core and uncore components because validation of uncore components, which accounts for a significant portion of modern SoCs, is difficult due to issues such as multiple clock domains and difficulties in observing internal interfaces. U-QED consists of a set of systematic and automatable software only transformations that transform existing validation tests into U-QED validation tests. The U-QED validation tests contain a number of code blocks that significantly reduce the error detection latency and improve coverage. Extensive simulation results on an OpenSPARC T2 like SoC show that U-QED validation tests significantly reduce the error detection latencies of logic bugs in both core and uncore components of SoCs. Results also show that U-QED validation tests significantly improve coverage by detecting bugs that escaped the original validation tests.
Formalizing and Demistifyng the PowerPC Memory Model [ edit ]
Dynamic Parallelization of JavaScript Applications [ edit ]
SAT-based Post-silicon Fault Localization [ edit ]
Stochastic Sensor Network-on-a-Chip [ edit ]
Bio-Inspired Sensory Signal Processing
Variation Tolerant Optical NoC [ edit ]
End-to-End Error Correction and Online Diagnosis for On-Chip Networks
Relyzer: Application Resiliency Analyzer for Transient Faults
ERSA: Error Resilient System Architecture For Probabilistic Applications
TCP/IP Protocol Stack in PARSEC 3.0
Copperhead: Data Parallel Python [ edit ]
Bug Positioning System
Functional Correctness for CMP Interconnects
Stochastic Computing on Error Resilient System Architecture (ERSA) Platforms
Algorithmic Techniques for Fault Tolerance for Sparse Linear Algebra [ edit ]
Portability and Performance of OpenCL Kernels on GPU and CPU Platforms
Enabling Advanced Inference in Small-scale Sensors: resilient devices for analyzing physiological signals
CrashTest'ing SWAT: Accurate, Gate-Level Evaluation of Symptom-Based Resiliency Solutions [ edit ]
In this poster, we evaluate an FPGA-based analysis platform capable of testing a complete computer system against detailed fault model a ccuracy to verify symptom-based fault detection schemes. With this platform, we performed a gate-level accurate fault campaign of 51,6 30 fault injections across the processor core logic of the OpenSPARC T1 design and five SPECInt 2000 benchmarks. With a conservative ov erall SDC rate of 0.76%, our results are comparable to previous microarchitecture-level evaluations of SWAT, demonstrating the effecti veness of symptom-based software detectors for permanent faults in most microprocessor components.
ParaFin: A Framework for Parameterized Model Checking of Fine Grained Concurrency
We show how the correctness of fine grained concurrent data structures, with a fixed number of elements but an unbounded number of threads accessing them, can be established in a highly automated fashion. Though the technique we use, called the CMP method, was used primarily for message passing protocols, the underlying principles carry over to shared memory systems as well. The CMP method is semi-automatic as it requires user guidance in the form of lemmas. We show how for concurrent data structures most of these lemmas can be automatically generated using off-the-shelf invariant generators. Thus, in contrast to static analysis based on separation logic, our method places very little proof-burden on the user and does not require the user to understand complex logics. Further, in contrast to other model checking based works, which proved correctness of concurrent data structures for a fixed number of threads, we establish correctness for any number of threads. We verified several challenging list based concurrent set data structures, and also found a known bug in one of the data structures. These preliminary results demonstrate the efficacy of our method and establish it as a promising technique for verifying concurrent data structures.
Programming Interfaces for Stochastic Computing
Variability at Low Voltages: SRAM, Error Tolerance [ edit ]
Error Modeling of Electrical Bugs for Post-Silicon Validation
System Level Metrics for Analog-to-Digital Converter Design
Post-Silicon Fault Localisation Using SAT and Backbones [ edit ]
Abstraction-Based NoC Performance Analysis [ edit ]
Resilient Coherence in Many-Core CMPs
A Case for Locality-Aware Task Management on Many-Core Processors [ edit ]
This project investigates all aspects of locality-aware management of fine-grained tasks, covering both task scheduling and stealing. We analyze the three key decisions in generating a locality-aware schedule: task grouping, task ordering, and task size, and propose a recursive approach to task scheduling that is generally applicable to any cache hierarchy. Our simulation results on two distinct 32-core systems show average of 1.60x speedup over a randomized schedule and 1.43x speedup over a published, baseline scheduling scheme. The locality-aware schedule reduces energy consumption by 55% and 47%, relative to the random and the baseline schedule, respectively. The importance of the three decisions are also verified.
We also highlight the importance of locality-aware stealing when the tasks are scheduled in a locality-aware fashion, as we develop a recursive task stealing scheme that preserves the benefits of a locality-aware schedule while load balancing. Proposed stealing scheme shows a speedup for stolen tasks of up to 2.0x over randomized stealing.
Finally, we show how a pattern-based approach could be utilized to reduce scheduling overheads and to improve locality for stealing.
Impact of Stochastic Computing on Minimum Energy Operating Point (MEOP)
Impact Productivity Tools for Developing High-Performance Throughput-Oriented Applications [ edit ]
Heterogeneous systems have become an important building block for modern computing platforms, ranging from supercomputers to mobile devices. New programming languages are being adopted as an interface to a variety of parallel processors. However, writing high-performance parallel code is still too complicated today. The IMPACT group is developing a set of tools to address various aspects of the complexity inherent in developing high-performance parallel applications targeting heterogeneous systems.
Phase-Locked Loop Verification by a Bounded Model Checking Approach [ edit ]
Network-Driven Chips
Energy Benefits of Power Gating on Memory Misses in Multi-Core Systems [ edit ]
Towards the Ideal On-chip Fabric for 1-to-Many and Many-to-1 Communication [ edit ]
Integrated Framework Combining Virtual Platform and NoC Synthesis for Heterogeneous Systems-on-Chip
Future platform architectures will be empowered by heterogeneous multi-core Systems-on-Chip that integrate an increasing number of processors, specialized accelerators, and memories into a single die. The challenges in SoC design will be primarily in the integration and management of their components: these must be interconnected with a flexible and scalable communication infrastructure and must be activated only when needed since the whole chip must run within very tight power constraints.
We present an integrated framework that enables effective design-space exploration of the communication infrastructure for heterogeneous SoCs by providing fast yet accurate estimation of performance, power and area.
Our framework leverages a time-approximate virtualized platform to execute complex application scenarios running on top of a full Linux environment. In our approach, we first rely on an abstract network to run efficient simulations. This allows us to derive the information necessary to synthesize an optimal network-on-chip that supports the target applications. The resulting NoC is then automatically synthesized at RTL level to obtain accurate power and area analysis. Finally the NoC components are back-annotated with the RTL-synthesis results and the NoC is plugged into the virtualized platform so that the full system can be tested to refine the performance, power, and area analysis.
Our framework is a promising test-bench for the development of advanced communication protocols and supporting network architectures for future heterogeneous SoCs.
Surviving Memory and Concurrency Errors [ edit ]
Lockdown: Most applications make extensive use of third party libraries that act as their interface to the outside world (like various image/video/audio codecs, socket library). All of these libraries are convenient and frequent vectors for attack: because they run in the same address space as the application, they have full read and write access to all of the application's memory. Weaknesses in these libraries can allow corruption of the application's stack and heap via buffer overflows, and give an attacker control of the process. We present a runtime system called Lockdown which automatically isolates libraries from the main application and prevents a range of attacks resulting from invalid or unauthorized memory accesses.
Dthreads: Multithreaded programming is notoriously difficult to get right. A key problem is non-determinism, which complicates debugging, testing, and reproducing errors. One way to simplify multithreaded programming is to enforce deterministic execution, but current deterministic systems for C/C++ are incomplete or impractical. These systems require program modification, do not ensure determinism in the presence of data races, do not work with generalpurpose multithreaded programs, or run up to 8.4x slower than pthreads. We present DTHREADS, an efficient deterministic multithreading system for unmodi?ed C/C++ applications that replaces the pthreads library. DTHREADS enforces determinism in the face of data races and deadlocks. DTHREADS works by exploding multithreaded applications into multiple processes, with private, copy-on-write mappings to shared memory. It uses standard virtual memory protection to track writes, and deterministically orders updates by each thread. By separating updates from different threads, DTHREADS has the additional benefit of eliminating false sharing. Experimental results show that DTHREADS substantially outperforms a state-of-the-art deterministic runtime system, and for a majority of the benchmarks evaluated here, matches and occasionally exceeds the performance of pthreads
GSRC State of the Center
Predicting Serializability Violations: SMT-based Search vs. DPOR-based Search
Contracts for Correct Composition and System-Level Design of Analog and Mixed-Signal Circuits
Improving Cache Performance Using Victim Tag Stores
Digitally Assisted VaST Driven Post-manufacture Adaptation of Analog / RF Systems
Alternative Research Theme Overview
Running 1000 Threads on a General-Purpose Multi-Core
Research on Network-on-Chip Router Modeling
The era of many-core computing demands highly efficient design of Network-on-Chip routers. This in turn requires more accurate, detailed architectural simulators with NoCs. Currently available models for NoC routers, such as Orion2.0, still have gaps in their area and power estimations – e.g., through modeling only the data paths. Our research seeks to improve accuracy of area and power estimation of NoC router simulators. We apply a highly detailed methodology to analyze component blocks (crossbar, switch and virtual channel arbiters, etc.) of academic NoC routers at the gate-netlist and post-layout level, and propose detailed models of data and control paths of each component to provide highly accurate NoC router modeling.
Thread Cluster Memory Scheduling [ edit ]
VaST: Validation Signature Test
Application Drivers Theme Overview
Architecture and Synthesis Support for Accelerator-Rich CMPs
Exploiting the Forgiving Nature of GPU Applications to Reduce Control and Memory Divergence
Concurrent Autonomous Self-Test and Diagnostics
Content Aware Channel Adaptive Low Power MIMO System for Video Transmission [ edit ]
Content Aware Channel Adaptive Low Power MIMO System for Video Transmission
Platform Architectures Theme Overview
Stochastic Communications [ edit ]
Formally Enhanced Runtime Verification for NoCs
Power-Performance Modeling for General-Purpose and Accelerator-Based Systems [ edit ]
Characterizing and Improving Last-level Cache Management using Signature-based and Prefetch-aware Approaches
Debugging Parallel Programs for the Masses [ edit ]