Keeping airplanes in the sky since 2006
(or: Well-typed, Trustworthy Computing in the Presence of Transient Faults)
About Project Zap
In recent decades, microprocessor performance has been increasing exponentially, due in large part to smaller and faster transistors enabled by improved fabrication technology. While such transistors yield performance enhancements, their lower threshold voltages and tighter noise margins make them less reliable, rendering processors that use them more susceptible to transient faults caused by energetic particles striking the chip. These faults do not cause permanent hardware damage, but they alter values and corrupt computations. In 2000, Sun Microsystems acknowledged that cosmic rays caused crashes in server systems at major customer sites, including America Online, eBay, and dozens of others. There are hardware-only solutions to the problem of transient faults, but such solutions exact high hardware costs, can't be deployed after-the-fact when transient faults appear in the field, and only provide one-size-fits-all reliability policies.
The goal of Project Zap is to develop a trustworthy, flexible and efficient platform for reliable computing in the presence of transient faults. Our multidisciplinary team is working on an end-to-end solution for producing reliable software involving the following components:
  1. Programming language-level reliability specifications so consumers can dictate the level of reliability they need on a case-by-case and application-by-application basis,
  2. Reliability-directed compilation so compilers can interpret programmer reliability specifications and introduce the redundancy necessary to tolerate the specified faults,
  3. Typed intermediate languages with novel type systems capable of automatically verifying reliability properties of intermediate program representations,
  4. A new framework for type- and reliability-preserving optimizations as many current, common compiler optimizations are, surprisingly, not reliability-preserving,
  5. Automatic, machine-level reliability verification so compiler output can be proven reliable,
  6. New software-modulated fault tolerance techniques at the harware/software boundary to implement the reliability specifications, including both software-only and hybrid hardware/software techniques, to maximize speed and minimize power, offering reliability at the right cost, and
  7. Microarchitectural optimization that explores opportunities to trade reliability of components off against demands for performance, power, and cost.
For more information, see our publications below or the NSF Grant Proposal and the NSF Cyber Trust Meeting Poster.
Project Zap is funded in part by NSF award CNS-0627650 (September 2006 - September 2010). Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF.
People
Publications
Reasoning about Control Flow in the Presence of Transient Faults
Frances Perry and David Walker.
International Static Analysis Symposium. to appear July 2008.
Reasoning about Control Flow in the Presence of Transient Faults (Extended Version)
Frances Perry and David Walker.
Princeton University Technical Report TR-799-07. October 2007.
Online Proof Appendix. October 2007. [proofs]
Fault-tolerant Typed Assembly Language
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker.
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI). June 2007. [slides]
Joint winner of the PLDI 2007 Best Paper Award.
Fault-tolerant Typed Assembly Language (Extended Version)
Frances Perry, Lester Mackey, George A. Reis, Jay Ligatti, David I. August, and David Walker.
Princeton University Technical Report TR-776-07. April 2007.
A Very Modal Model of a Modern, Major, General Type System
Andrew W. Appel, Paul-Andre Mellies, Christopher D. Richards, and Jerome Vouillon.
ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL). January 2007.
Automatic Instruction-Level Software-Only Recovery Methods
George A. Reis, Jonathan Chang, and David I. August.
IEEE Micro, Volume 27, Number 1. January 2007.
IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.
Configurable Transient Fault Detection via Dynamic Binary Translation
George A. Reis, Jonathan Chang, David I. August, Robert Cohn, and Shubhendu S. Mukherjee.
Proceedings of the 2nd Workshop on Architectural Reliability (WAR). December 2006.
Non-Uniform Fault Tolerance
Jonathan Chang, George A. Reis, and David I. August.
Proceedings of the 2nd Workshop on Architectural Reliability (WAR). December 2006.
Static Typing for a Faulty Lambda Calculus
David Walker, Lester Mackey, Jay Ligatti, George Reis, and David August.
ACM SIGPLAN International Conference on Functional Programming (ICFP). September 2006.
Automatic Instruction-Level Software-Only Recovery Methods
Jonathan Chang, George A. Reis, and David I. August.
Proceedings of the International Conference on Dependable Systems and Networks (DSN). June 2006.
Winner of the William C. Carter Award.
Selected for IEEE Micro's "Top Picks" special issue for papers "most relevant to industry and significant in contribution to the field of computer architecture" in 2006.
Updated 10/15/2007.