DOE logo

What is Shadow Computing?

Shadow Puppets
Photo by cayusa.

We are proposing a new computational model, called shadow computing, which provides goal-based adaptive resilience using dynamic execution to meet the requirements of complex applications in highly parallelized faulty environments. Adaptive Resilience is the ability of the system to dynamically harness all available resources to achieve the highest level of QoS for a given application. Dynamic execution is the ability to execute an application while being able to change the QoS of that application. For example, we have built systems with the ability to execute processes at variable execution speeds using dynamic voltage and frequency scaling (DVFS), changing the QoS of application response time. The challenge is to maintain applications QoS while minimizing the system resources in spite of systems-level changes, such as failures or the availability of additional system resources. In order to achieve adaptive resilience, the shadow computing model associates a set of shadows to the main execution, which are dynamically instantiated and adjusted in order to address the current state of the system and maintain the application's QoS requirements.

Recent News


    SC 2018 Paper Accepted

    Our recent submission to Supercomputing 2018 (SC18) has been accepted for publication. This paper removes the traditional assumption of uniform node failures in HPC systems and analytically studies the usefulness of partial replication without this assumption. Contributions include a novel result about the optimal selection and pairing of replicas, as well as an in-depth analysis of the scenarios in which partial replication provides the best performance under failures.

    pdf

    ICIN 2018 Paper Accepted

    Our recent submission to ICIN 2018 has been accepted for publication. This paper generalizes the shadow computing model to allow for recovery from silent errors in addition to crash failures, while taking into consideration application QoS constraints.

    HPCC 2017 Paper Accepted

    Our recent submission to HPCC 2017 has been accepted for publication. In this paper we discuss most updated design of the shadow computing model, along with rejuvenation techniques to enhance tolerance to multiple failures. A prototype implementation is presented, and emperical evaluation results are analyzed.

    FSP Paper Accepted

    Our recent submission to Frontiers in Signal Processing (FSP) has been published (link). This work studies the application of Lazy Shadowing for adaptive and power-aware resilience in failure-prone, extreme-scale computing environments.

    DoE presentation

    Lazy Shadowing was showcased at DoE's meeting. Here are the links to the [Handout], [Presentation], [Poster], [Quard Chart].

    ScalCom 2016 Paper Accepted

    Our recent submission to ScalCom 2016 has been accepted for publication. In this paper we devise novel techniques, referred to as shadow collocation and shadow leaping, and integrate them with Shadow Replication to form a more efficient and scalable paradigm that we call Lazy Shadowing.

    Energies 2014 Paper Accepted

    barcelona

    Our recent submission to Energies has been accepted for publication. This is an extension to our CLOSER 2014 paper and it further studies the impact of different coornidation types among distributed applications.

    CLOSER 2014 Paper Accepted

    barcelona

    Our recent submission to CLOSER 2014 has been accepted for publication. In this paper we propose Shadow Replication for the could computing environment. We show that using shadow replication as a fault tolerance scheme for map-reduce applications can both maximize profit and reduce energy consumption.

    PDP 2014 Paper Accepted

    CLoser 2014

    Our recent submission to PDP 2014 has been accepted for publication. This paper was a joint project between University of Pittsburgh and Sandia National Laboratories. In this paper we develop an instance of shadow computing to high performance computing (HPC) and show that we can conserve power consumption while increasing application performance.