Science.gov

Sample records for hardware fault tolerance

  1. Hardware and software fault tolerance - A unified architectural approach

    NASA Technical Reports Server (NTRS)

    Lala, Jaynarayan H.; Alger, Linda S.

    1988-01-01

    The loss of hardware fault tolerance which often arises when design diversity is used to improve the fault tolerance of computer software is considered analytically, and a unified design approach is proposed to avoid the problem. The fundamental theory of fault-tolerant (FT) architectures is reviewed; the current status of design-diversity software development is surveyed; and the FT-processor/attached-processor (FTP/AP) architecture developed by Lala et al. (1986) is described in detail and illustrated with diagrams. FTP/AP is shown to permit efficient implementation of N-version FT software while still tolerating random hardware failures with very high coverage; the reliability is found to be significantly higher than that of conventional majority-vote N-version software.

  2. Fault Tolerant Characteristics of Artificial Neural Network Electronic Hardware

    NASA Technical Reports Server (NTRS)

    Zee, Frank

    1995-01-01

    The fault tolerant characteristics of analog-VLSI artificial neural network (with 32 neurons and 532 synapses) chips are studied by exposing them to high energy electrons, high energy protons, and gamma ionizing radiations under biased and unbiased conditions. The biased chips became nonfunctional after receiving a cumulative dose of less than 20 krads, while the unbiased chips only started to show degradation with a cumulative dose of over 100 krads. As the total radiation dose increased, all the components demonstrated graceful degradation. The analog sigmoidal function of the neuron became steeper (increase in gain), current leakage from the synapses progressively shifted the sigmoidal curve, and the digital memory of the synapses and the memory addressing circuits began to gradually fail. From these radiation experiments, we can learn how to modify certain designs of the neural network electronic hardware without using radiation-hardening techniques to increase its reliability and fault tolerance.

  3. Analysis of a hardware and software fault tolerant processor for critical applications

    NASA Technical Reports Server (NTRS)

    Dugan, Joanne B.

    1993-01-01

    Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input-dependent. Software faults are triggered by a particular set of inputs. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. A methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor. The models consider both transient and permanent faults, hardware and software faults, independent and related software faults, automatic recovery, and reconfiguration.

  4. AVR microcontroller simulator for software implemented hardware fault tolerance algorithms research

    NASA Astrophysics Data System (ADS)

    Piotrowski, Adam; Tarnowski, Szymon; Napieralski, Andrzej

    2008-01-01

    Reliability of new, advanced electronic systems becomes a serious problem especially in places like accelerators and synchrotrons, where sophisticated digital devices operate closely to radiation sources. One of the possible solutions to harden the microprocessor-based system is a strict programming approach known as the Software Implemented Hardware Fault Tolerance. Unfortunately, in real environments it is not possible to perform precise and accurate tests of the new algorithms due to hardware limitation. This paper highlights the AVR-family microcontroller simulator project equipped with an appropriate monitoring and the SEU injection systems.

  5. Study of a unified hardware and software fault-tolerant architecture

    NASA Technical Reports Server (NTRS)

    Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart

    1989-01-01

    A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.

  6. A hardware implementation of a provably correct design of a fault-tolerant clock synchronization circuit

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo

    1993-01-01

    A fault-tolerant clock synchronization system was designed to a proven correct formal specification. Formal methods were used in the development of this specification. A description of the system and an analysis of the tests performed are presented. Plots of typical experimental results are included.

  7. Computer hardware fault administration

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-09-14

    Computer hardware fault administration carried out in a parallel computer, where the parallel computer includes a plurality of compute nodes. The compute nodes are coupled for data communications by at least two independent data communications networks, where each data communications network includes data communications links connected to the compute nodes. Typical embodiments carry out hardware fault administration by identifying a location of a defective link in the first data communications network of the parallel computer and routing communications data around the defective link through the second data communications network of the parallel computer.

  8. Relaxed fault-tolerant hardware implementation of neural networks in the presence of multiple transient errors.

    PubMed

    Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro

    2012-08-01

    Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments. PMID:24807519

  9. Hardware Fault Simulator for Microprocessors

    NASA Technical Reports Server (NTRS)

    Hess, L. M.; Timoc, C. C.

    1983-01-01

    Breadboarded circuit is faster and more thorough than software simulator. Elementary fault simulator for AND gate uses three gates and shaft register to simulate stuck-at-one or stuck-at-zero conditions at inputs and output. Experimental results showed hardware fault simulator for microprocessor gave faster results than software simulator, by two orders of magnitude, with one test being applied every 4 microseconds.

  10. Fault recovery characteristics of the fault tolerant multi-processor

    NASA Technical Reports Server (NTRS)

    Padilla, Peter A.

    1990-01-01

    The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.

  11. SFT: Scalable Fault Tolerance

    SciTech Connect

    Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

    2006-04-15

    In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.

  12. The fault-tolerant multiprocessor computer

    NASA Technical Reports Server (NTRS)

    Smith, T. B., III (Editor); Lala, J. H. (Editor); Goldberg, J. (Editor); Kautz, W. H. (Editor); Melliar-Smith, P. M. (Editor); Green, M. W. (Editor); Levitt, K. N. (Editor); Schwartz, R. L. (Editor); Weinstock, C. B. (Editor); Palumbo, D. L. (Editor)

    1986-01-01

    The development and evaluation of fault-tolerant computer architectures and software-implemented fault tolerance (SIFT) for use in advanced NASA vehicles and potentially in flight-control systems are described in a collection of previously published reports prepared for NASA. Topics addressed include the principles of fault-tolerant multiprocessor (FTMP) operation; processor and slave regional designs; FTMP executive, facilities, acceptance-test/diagnostic, applications, and support software; FTM reliability and availability models; SIFT hardware design; and SIFT validation and verification.

  13. Fault tolerant control of spacecraft

    NASA Astrophysics Data System (ADS)

    Godard

    Autonomous multiple spacecraft formation flying space missions demand the development of reliable control systems to ensure rapid, accurate, and effective response to various attitude and formation reconfiguration commands. Keeping in mind the complexities involved in the technology development to enable spacecraft formation flying, this thesis presents the development and validation of a fault tolerant control algorithm that augments the AOCS on-board a spacecraft to ensure that these challenging formation flying missions will fly successfully. Taking inspiration from the existing theory of nonlinear control, a fault-tolerant control system for the RyePicoSat missions is designed to cope with actuator faults whilst maintaining the desirable degree of overall stability and performance. Autonomous fault tolerant adaptive control scheme for spacecraft equipped with redundant actuators and robust control of spacecraft in underactuated configuration, represent the two central themes of this thesis. The developed algorithms are validated using a hardware-in-the-loop simulation. A reaction wheel testbed is used to validate the proposed fault tolerant attitude control scheme. A spacecraft formation flying experimental testbed is used to verify the performance of the proposed robust control scheme for underactuated spacecraft configurations. The proposed underactuated formation flying concept leads to more than 60% savings in fuel consumption when compared to a fully actuated spacecraft formation configuration. We also developed a novel attitude control methodology that requires only a single thruster to stabilize three axis attitude and angular velocity components of a spacecraft. Numerical simulations and hardware-in-the-loop experimental results along with rigorous analytical stability analysis shows that the proposed methodology will greatly enhance the reliability of the spacecraft, while allowing for potentially significant overall mission cost reduction.

  14. A methodology for testing fault-tolerant software

    NASA Technical Reports Server (NTRS)

    Andrews, D. M.; Mahmood, A.; Mccluskey, E. J.

    1985-01-01

    A methodology for testing fault tolerant software is presented. There are problems associated with testing fault tolerant software because many errors are masked or corrected by voters, limiter, or automatic channel synchronization. This methodology illustrates how the same strategies used for testing fault tolerant hardware can be applied to testing fault tolerant software. For example, one strategy used in testing fault tolerant hardware is to disable the redundancy during testing. A similar testing strategy is proposed for software, namely, to move the major emphasis on testing earlier in the development cycle (before the redundancy is in place) thus reducing the possibility that undetected errors will be masked when limiters and voters are added.

  15. Fault tolerant linear actuator

    DOEpatents

    Tesar, Delbert

    2004-09-14

    In varying embodiments, the fault tolerant linear actuator of the present invention is a new and improved linear actuator with fault tolerance and positional control that may incorporate velocity summing, force summing, or a combination of the two. In one embodiment, the invention offers a velocity summing arrangement with a differential gear between two prime movers driving a cage, which then drives a linear spindle screw transmission. Other embodiments feature two prime movers driving separate linear spindle screw transmissions, one internal and one external, in a totally concentric and compact integrated module.

  16. Validated Fault Tolerant Architectures for Space Station

    NASA Technical Reports Server (NTRS)

    Lala, Jaynarayan H.

    1990-01-01

    Viewgraphs on validated fault tolerant architectures for space station are presented. Topics covered include: fault tolerance approach; advanced information processing system (AIPS); and fault tolerant parallel processor (FTPP).

  17. Study of fault-tolerant software technology

    NASA Technical Reports Server (NTRS)

    Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.

    1984-01-01

    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.

  18. The MAFT architecture for distributed fault tolerance

    SciTech Connect

    Kieckhafer, R.M.; Walter, C.J.; Finn, A.M.; Thambidurai, P.M.

    1988-04-01

    This paper describes the Multicomputer Architecture for Fault-Tolerance (MAFT), a distributed system designed to provide extremely reliable computation in real-time control systems. MAFT is based on the physical and functional partitioning of executive functions from application functions. The implementation of the executive functions in a special-purpose hardware processor allows the fault-tolerance functions to be transparent to the application programs and minimizes overhead. Byzantine Agreement and Approximate Agreement algorithms are employed for critical system parameters. MAFT supports the use of multiversion hardware and software to tolerate built-in or generic faults. Graceful degradation and restoration of the application workload is permitted in response to the exclusion and readmission of nodes, respectively.

  19. Software fault tolerance in computer operating systems

    NASA Technical Reports Server (NTRS)

    Iyer, Ravishankar K.; Lee, Inhwan

    1994-01-01

    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.

  20. Fault-tolerant processing system

    NASA Technical Reports Server (NTRS)

    Palumbo, Daniel L. (Inventor)

    1996-01-01

    A fault-tolerant, fiber optic interconnect, or backplane, which serves as a via for data transfer between modules. Fault tolerance algorithms are embedded in the backplane by dividing the backplane into a read bus and a write bus and placing a redundancy management unit (RMU) between the read bus and the write bus so that all data transmitted by the write bus is subjected to the fault tolerance algorithms before the data is passed for distribution to the read bus. The RMU provides both backplane control and fault tolerance.

  1. Fabrication of fault-tolerant systolic array processors

    SciTech Connect

    Golovko, V.A.

    1995-05-01

    Methods for designing fault-tolerant systolic array processors are discussed. Several ways of bypassing faulty elements in configurations, which depend on an input-data flow organization, are suggested. An analysis of the additional hardware costs of providing fault tolerance by various techniques and for various levels of redundancy is presented. Hadamard fault-tolerant processor design was used to illustrate the efficiency of the techniques suggested.

  2. Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

    NASA Astrophysics Data System (ADS)

    Padilla, Peter A.

    1991-03-01

    An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.

  3. Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

    NASA Technical Reports Server (NTRS)

    Padilla, Peter A.

    1991-01-01

    An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.

  4. Fault Tolerant State Machines

    NASA Technical Reports Server (NTRS)

    Burke, Gary R.; Taft, Stephanie

    2004-01-01

    State machines are commonly used to control sequential logic in FPGAs and ASKS. An errant state machine can cause considerable damage to the device it is controlling. For example in space applications, the FPGA might be controlling Pyros, which when fired at the wrong time will cause a mission failure. Even a well designed state machine can be subject to random errors us a result of SEUs from the radiation environment in space. There are various ways to encode the states of a state machine, and the type of encoding makes a large difference in the susceptibility of the state machine to radiation. In this paper we compare 4 methods of state machine encoding and find which method gives the best fault tolerance, as well as determining the resources needed for each method.

  5. Fault Tolerant Homopolar Magnetic Bearings

    NASA Technical Reports Server (NTRS)

    Li, Ming-Hsiu; Palazzolo, Alan; Kenny, Andrew; Provenza, Andrew; Beach, Raymond; Kascak, Albert

    2003-01-01

    Magnetic suspensions (MS) satisfy the long life and low loss conditions demanded by satellite and ISS based flywheels used for Energy Storage and Attitude Control (ACESE) service. This paper summarizes the development of a novel MS that improves reliability via fault tolerant operation. Specifically, flux coupling between poles of a homopolar magnetic bearing is shown to deliver desired forces even after termination of coil currents to a subset of failed poles . Linear, coordinate decoupled force-voltage relations are also maintained before and after failure by bias linearization. Current distribution matrices (CDM) which adjust the currents and fluxes following a pole set failure are determined for many faulted pole combinations. The CDM s and the system responses are obtained utilizing 1D magnetic circuit models with fringe and leakage factors derived from detailed, 3D, finite element field models. Reliability results are presented vs. detection/correction delay time and individual power amplifier reliability for 4, 6, and 7 pole configurations. Reliability is shown for two success criteria, i.e. (a) no catcher bearing contact following pole failures and (b) re-levitation off of the catcher bearings following pole failures. An advantage of the method presented over other redundant operation approaches is a significantly reduced requirement for backup hardware such as additional actuators or power amplifiers.

  6. Implementing fault-tolerant sensors

    NASA Technical Reports Server (NTRS)

    Marzullo, Keith

    1989-01-01

    One aspect of fault tolerance in process control programs is the ability to tolerate sensor failure. A methodology is presented for transforming a process control program that cannot tolerate sensor failures to one that can. Additionally, a hierarchy of failure models is identified.

  7. A fault tolerant 80960 engine controller

    NASA Technical Reports Server (NTRS)

    Reichmuth, D. M.; Gage, M. L.; Paterson, E. S.; Kramer, D. D.

    1993-01-01

    The paper describes the design of the 80960 Fault Tolerant Engine Controller for the supervision of engine operations, which was designed for the NASA Marshall Space Center. Consideration is given to the major electronic components of the controller, including the engine controller, effectors, and the sensors, as well as to the controller hardware, the controller module and the communications module, and the controller software. The architecture of the controller hardware allows modifications to be made to fit the requirements of any new propulsion systems. Multiple flow diagrams are presented illustrating the controller's operations.

  8. FTAPE: A fault injection tool to measure fault tolerance

    NASA Astrophysics Data System (ADS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1994-07-01

    The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.

  9. FTAPE: A fault injection tool to measure fault tolerance

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1994-01-01

    The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.

  10. FTAPE: A fault injection tool to measure fault tolerance

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1995-01-01

    The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.

  11. Fault-tolerant rotary actuator

    DOEpatents

    Tesar, Delbert

    2006-10-17

    A fault-tolerant actuator module, in a single containment shell, containing two actuator subsystems that are either asymmetrically or symmetrically laid out is provided. Fault tolerance in the actuators of the present invention is achieved by the employment of dual sets of equal resources. Dual resources are integrated into single modules, with each having the external appearance and functionality of a single set of resources.

  12. Chip level simulation of fault tolerant computers

    NASA Technical Reports Server (NTRS)

    Armstrong, J. R.

    1982-01-01

    Chip-level modeling techniques in the evaluation of fault tolerant systems were researched. A fault tolerant computer was modeled. An efficient approach to functional fault simulation was developed. Simulation software was also developed.

  13. Fault-tolerant PACS server

    NASA Astrophysics Data System (ADS)

    Cao, Fei; Liu, Brent J.; Huang, H. K.; Zhou, Michael Z.; Zhang, Jianguo; Zhang, X. C.; Mogel, Greg T.

    2002-05-01

    Failure of a PACS archive server could cripple an entire PACS operation. Last year we demonstrated that it was possible to design a fault-tolerant (FT) server with 99.999% uptime. The FT design was based on a triple modular redundancy with a simple majority vote to automatically detect and mask a faulty module. The purpose of this presentation is to report on its continuous developments in integrating with external mass storage devices, and to delineate laboratory failover experiments. An FT PACS Simulator with generic PACS software has been used in the experiment. To simulate a PACS clinical operation, image examinations are transmitted continuously from the modality simulator to the DICOM gateway and then to the FT PACS server and workstations. The hardware failures in network, FT server module, disk, RAID, and DLT are manually induced to observe the failover recovery of the FT PACS to resume its normal data flow. We then test and evaluate the FT PACS server in its reliability, functionality, and performance.

  14. Fault detection and fault tolerance in robotics

    NASA Technical Reports Server (NTRS)

    Visinsky, Monica; Walker, Ian D.; Cavallaro, Joseph R.

    1992-01-01

    Robots are used in inaccessible or hazardous environments in order to alleviate some of the time, cost and risk involved in preparing men to endure these conditions. In order to perform their expected tasks, the robots are often quite complex, thus increasing their potential for failures. If men must be sent into these environments to repair each component failure in the robot, the advantages of using the robot are quickly lost. Fault tolerant robots are needed which can effectively cope with failures and continue their tasks until repairs can be realistically scheduled. Before fault tolerant capabilities can be created, methods of detecting and pinpointing failures must be perfected. This paper develops a basic fault tree analysis of a robot in order to obtain a better understanding of where failures can occur and how they contribute to other failures in the robot. The resulting failure flow chart can also be used to analyze the resiliency of the robot in the presence of specific faults. By simulating robot failures and fault detection schemes, the problems involved in detecting failures for robots are explored in more depth.

  15. Locating hardware faults in a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-04-13

    Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

  16. Interstitial fault tolerance-a technique for making systolic arrays fault tolerant

    SciTech Connect

    Kuhn, R.H.

    1983-01-01

    Systolic arrays are a popular model for the implementation of highly parallel VLSI systems. In this paper interstitial fault tolerance (IFT), a technique for incorporating fault tolerance into systolic arrays in a natural manner, is discussed. IFT can be used for reliable computation or for yield enhancement. Previous fault tolerance techniques for reliable computation on SIMD systems have employed redundant hardware. IFT on the other hand employs time redundancy. Previous wafer scale integration techniques for yield enhancement have been proposed only for linear processing element arrays. Ift is effective for both linear and two dimensional arrays. The time redundancy to achieve IFT is shown to be bounded by a factor of 3, allowing no processor redundancy. Results of monte carlo simulation of ift are presented. 19 references.

  17. Intelligent fault-tolerant controllers

    NASA Technical Reports Server (NTRS)

    Huang, Chien Y.

    1987-01-01

    A system with fault tolerant controls is one that can detect, isolate, and estimate failures and perform necessary control reconfiguration based on this new information. Artificial intelligence (AI) is concerned with semantic processing, and it has evolved to include the topics of expert systems and machine learning. This research represents an attempt to apply AI to fault tolerant controls, hence, the name intelligent fault tolerant control (IFTC). A generic solution to the problem is sought, providing a system based on logic in addition to analytical tools, and offering machine learning capabilities. The advantages are that redundant system specific algorithms are no longer needed, that reasonableness is used to quickly choose the correct control strategy, and that the system can adapt to new situations by learning about its effects on system dynamics.

  18. Fault-tolerance - The survival attribute of digital systems

    NASA Technical Reports Server (NTRS)

    Avizienis, A.

    1978-01-01

    Fault-tolerance is the architectural attribute of a digital system that keeps the logic machine doing its specified tasks when its host, the physical system, suffers various kinds of failures of its components. A more general concept of fault-tolerance also includes human mistakes committed during software and hardware implementation and during man/machine interaction among the causes of faults that are to be tolerated by the logic machine. This paper discusses the concept of fault-tolerance, the reasons for its inclusion in digital system architecture, and the methods of its implementation. A chronological view of the evolution of fault-tolerant systems and an outline of some goals for its further development conclude the presentation.

  19. Software Fault Tolerance: A Tutorial

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo

    2000-01-01

    Because of our present inability to produce error-free software, software fault tolerance is and will continue to be an important consideration in software systems. The root cause of software design errors is the complexity of the systems. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. After a brief overview of the software development processes, we note how hard-to-detect design faults are likely to be introduced during development and how software faults tend to be state-dependent and activated by particular input sequences. Although component reliability is an important quality measure for system level analysis, software reliability is hard to characterize and the use of post-verification reliability estimates remains a controversial issue. For some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing catastrophes. Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Multiversion techniques are based on the assumption that software built differently should fail differently and thus, if one of the redundant versions fails, it is expected that at least one of the other versions will provide an acceptable output. Recovery blocks, N-version programming, and other multiversion techniques are reviewed.

  20. Fault tolerance analysis of the class of rearrangeable interconnection networks

    SciTech Connect

    Pakzad, S. . Dept. of Electrical Engineering)

    1989-08-01

    This paper analyzes the fault tolerance characteristics of a range or rearrangeable {beta}-networks based on the concepts and the framework developed by S. Pakzad and S. Lakshmivarahan. These rearrangeable {beta}-networks include the Benes network, the Waksman network, the Joel network, and the serial network. In addition, this paper presents a comparative analysis of the aforementioned networks according to their hardware cost, performance, and degree of fault tolerance.

  1. CMOS processor element for a fault-tolerant SVD array

    NASA Astrophysics Data System (ADS)

    Kota, Kishore; Cavallaro, Joseph R.

    1993-11-01

    This paper describes the VLSI implementation of a CORDIC based processor element for use in a fault-reconfigurable systolic array to compute the singular value decomposition (SVD) of a matrix. The chip implements a time redundant fault tolerance scheme, which allows processors adjacent to a faulty processor to act as computation backup during the systolic idle time. Also, processors around a fault collaborate to reroute data around the faulty processor. This form of time redundancy is attractive when tolerance to a few faults needs to be achieved with little hardware overhead.

  2. Fault-Tolerant Heat Exchanger

    NASA Technical Reports Server (NTRS)

    Izenson, Michael G.; Crowley, Christopher J.

    2005-01-01

    A compact, lightweight heat exchanger has been designed to be fault-tolerant in the sense that a single-point leak would not cause mixing of heat-transfer fluids. This particular heat exchanger is intended to be part of the temperature-regulation system for habitable modules of the International Space Station and to function with water and ammonia as the heat-transfer fluids. The basic fault-tolerant design is adaptable to other heat-transfer fluids and heat exchangers for applications in which mixing of heat-transfer fluids would pose toxic, explosive, or other hazards: Examples could include fuel/air heat exchangers for thermal management on aircraft, process heat exchangers in the cryogenic industry, and heat exchangers used in chemical processing. The reason this heat exchanger can tolerate a single-point leak is that the heat-transfer fluids are everywhere separated by a vented volume and at least two seals. The combination of fault tolerance, compactness, and light weight is implemented in a unique heat-exchanger core configuration: Each fluid passage is entirely surrounded by a vented region bridged by solid structures through which heat is conducted between the fluids. Precise, proprietary fabrication techniques make it possible to manufacture the vented regions and heat-conducting structures with very small dimensions to obtain a very large coefficient of heat transfer between the two fluids. A large heat-transfer coefficient favors compact design by making it possible to use a relatively small core for a given heat-transfer rate. Calculations and experiments have shown that in most respects, the fault-tolerant heat exchanger can be expected to equal or exceed the performance of the non-fault-tolerant heat exchanger that it is intended to supplant (see table). The only significant disadvantages are a slight weight penalty and a small decrease in the mass-specific heat transfer.

  3. Measuring fault tolerance with the FTAPE fault injection tool

    NASA Astrophysics Data System (ADS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1995-05-01

    This paper describes FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The major parts of the tool include a system-wide fault-injector, a workload generator, and a workload activity measurement tool. The workload creates high stress conditions on the machine. Using stress-based injection, the fault injector is able to utilize knowledge of the workload activity to ensure a high level of fault propagation. The errors/fault ratio, performance degradation, and number of system crashes are presented as measures of fault tolerance.

  4. Measuring fault tolerance with the FTAPE fault injection tool

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1995-01-01

    This paper describes FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The major parts of the tool include a system-wide fault-injector, a workload generator, and a workload activity measurement tool. The workload creates high stress conditions on the machine. Using stress-based injection, the fault injector is able to utilize knowledge of the workload activity to ensure a high level of fault propagation. The errors/fault ratio, performance degradation, and number of system crashes are presented as measures of fault tolerance.

  5. Fault tolerant software modules for SIFT

    NASA Technical Reports Server (NTRS)

    Hecht, M.; Hecht, H.

    1982-01-01

    The implementation of software fault tolerance is investigated for critical modules of the Software Implemented Fault Tolerance (SIFT) operating system to support the computational and reliability requirements of advanced fly by wire transport aircraft. Fault tolerant designs generated for the error reported and global executive are examined. A description of the alternate routines, implementation requirements, and software validation are included.

  6. Fault-tolerant architectures for superconducting qubits

    NASA Astrophysics Data System (ADS)

    DiVincenzo, David P.

    2009-12-01

    In this short review, I draw attention to new developments in the theory of fault tolerance in quantum computation that may give concrete direction to future work in the development of superconducting qubit systems. The basics of quantum error-correction codes, which I will briefly review, have not significantly changed since their introduction 15 years ago. But an interesting picture has emerged of an efficient use of these codes that may put fault-tolerant operation within reach. It is now understood that two-dimensional surface codes, close relatives of the original toric code of Kitaev, can be adapted as shown by Raussendorf and Harrington to effectively perform logical gate operations in a very simple planar architecture, with error thresholds for fault-tolerant operation simulated to be 0.75%. This architecture uses topological ideas in its functioning, but it is not 'topological quantum computation'—there are no non-abelian anyons in sight. I offer some speculations on the crucial pieces of superconducting hardware that could be demonstrated in the next couple of years that would be clear stepping stones towards this surface-code architecture.

  7. A fault-tolerant intelligent robotic control system

    NASA Technical Reports Server (NTRS)

    Marzwell, Neville I.; Tso, Kam Sing

    1993-01-01

    This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines.

  8. Fault tree models for fault tolerant hypercube multiprocessors

    NASA Technical Reports Server (NTRS)

    Boyd, Mark A.; Tuazon, Jezus O.

    1991-01-01

    Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.

  9. Parallel fault-tolerant robot control

    NASA Technical Reports Server (NTRS)

    Hamilton, D. L.; Bennett, J. K.; Walker, I. D.

    1992-01-01

    A shared memory multiprocessor architecture is used to develop a parallel fault-tolerant robot controller. Several versions of the robot controller are developed and compared. A robot simulation is also developed for control observation. Comparison of a serial version of the controller and a parallel version without fault tolerance showed the speedup possible with the coarse-grained parallelism currently employed. The performance degradation due to the addition of processor fault tolerance was demonstrated by comparison of these controllers with their fault-tolerant versions. Comparison of the more fault-tolerant controller with the lower-level fault-tolerant controller showed how varying the amount of redundant data affects performance. The results demonstrate the trade-off between speed performance and processor fault tolerance.

  10. Fault model development for fault tolerant VLSI design

    NASA Astrophysics Data System (ADS)

    Hartmann, C. R.; Lala, P. K.; Ali, A. M.; Visweswaran, G. S.; Ganguly, S.

    1988-05-01

    Fault models provide systematic and precise representations of physical defects in microcircuits in a form suitable for simulation and test generation. The current difficulty in testing VLSI circuits can be attributed to the tremendous increase in design complexity and the inappropriateness of traditional stuck-at fault models. This report develops fault models for three different types of common defects that are not accurately represented by the stuck-at fault model. The faults examined in this report are: bridging faults, transistor stuck-open faults, and transient faults caused by alpha particle radiation. A generalized fault model could not be developed for the three fault types. However, microcircuit behavior and fault detection strategies are described for the bridging, transistor stuck-open, and transient (alpha particle strike) faults. The results of this study can be applied to the simulation and analysis of faults in fault tolerant VLSI circuits.

  11. Hardware fault insertion and instrumentation system: Mechanization and validation

    NASA Technical Reports Server (NTRS)

    Benson, J. W.

    1987-01-01

    Automated test capability for extensive low-level hardware fault insertion testing is developed. The test capability is used to calibrate fault detection coverage and associated latency times as relevant to projecting overall system reliability. Described are modifications made to the NASA Ames Reconfigurable Flight Control System (RDFCS) Facility to fully automate the total test loop involving the Draper Laboratories' Fault Injector Unit. The automated capability provided included the application of sequences of simulated low-level hardware faults, the precise measurement of fault latency times, the identification of fault symptoms, and bulk storage of test case results. A PDP-11/60 served as a test coordinator, and a PDP-11/04 as an instrumentation device. The fault injector was controlled by applications test software in the PDP-11/60, rather than by manual commands from a terminal keyboard. The time base was especially developed for this application to use a variety of signal sources in the system simulator.

  12. Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    NASA Technical Reports Server (NTRS)

    Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

    1992-01-01

    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions.

  13. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

    SciTech Connect

    Hursey, Joshua J; Naughton, III, Thomas J; Vallee, Geoffroy R; Graham, Richard L

    2011-01-01

    The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

  14. Fault-tolerant parallel processor

    SciTech Connect

    Harper, R.E.; Lala, J.H. )

    1991-06-01

    This paper addresses issues central to the design and operation of an ultrareliable, Byzantine resilient parallel computer. Interprocessor connectivity requirements are met by treating connectivity as a resource that is shared among many processing elements, allowing flexibility in their configuration and reducing complexity. Redundant groups are synchronized solely by message transmissions and receptions, which aslo provide input data consistency and output voting. Reliability analysis results are presented that demonstrate the reduced failure probability of such a system. Performance analysis results are presented that quantify the temporal overhead involved in executing such fault-tolerance-specific operations. Empirical performance measurements of prototypes of the architecture are presented. 30 refs.

  15. Single event upset tests of a RISC-based fault-tolerant computer

    SciTech Connect

    Kimbrough, J.R.; Butner, D.N.; Colella, N.J.; Kaschmitter, J.L.; Shaeffer, D.L.; McKnett, C.L.; Coakley, P.G.; Casteneda, C.

    1996-03-23

    The project successfully demonstrated that dual lock-step comparison of commercial RISC processors is a viable fault-tolerant approach to handling SEU in space environment. The fault tolerant approach on orbit error rate was 38 times less than the single processor error rate. The random nature of the upsets and appearance in critical code section show it is essential to incorporate both hardware and software in the design and operation of fault-tolerant computers.

  16. Parametric Modeling and Fault Tolerant Control

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva; Ju, Jianhong

    2000-01-01

    Fault tolerant control is considered for a nonlinear aircraft model expressed as a linear parameter-varying system. By proper parameterization of foreseeable faults, the linear parameter-varying system can include fault effects as additional varying parameters. A recently developed technique in fault effect parameter estimation allows us to assume that estimates of the fault effect parameters are available on-line. Reconfigurability is calculated for this model with respect to the loss of control effectiveness to assess the potentiality of the model to tolerate such losses prior to control design. The control design is carried out by applying a polytopic method to the aircraft model. An error bound on fault effect parameter estimation is provided, within which the Lyapunov stability of the closed-loop system is robust. Our simulation results show that as long as the fault parameter estimates are sufficiently accurate, the polytopic controller can provide satisfactory fault-tolerance.

  17. Design study of Software-Implemented Fault-Tolerance (SIFT) computer

    NASA Technical Reports Server (NTRS)

    Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.

    1982-01-01

    Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.

  18. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

    NASA Technical Reports Server (NTRS)

    Smith, T. B., Jr.; Lala, J. H.

    1983-01-01

    The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.

  19. Fault-tolerant software - Experiment with the sift operating system. [Software Implemented Fault Tolerance computer

    NASA Technical Reports Server (NTRS)

    Brunelle, J. E.; Eckhardt, D. E., Jr.

    1985-01-01

    Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.

  20. Robot Position Sensor Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Aldridge, Hal A.

    1997-01-01

    Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. A new method is proposed that utilizes analytical redundancy to allow for continued operation during joint position sensor failure. Joint torque sensors are used with a virtual passive torque controller to make the robot joint stable without position feedback and improve position tracking performance in the presence of unknown link dynamics and end-effector loading. Two Cartesian accelerometer based methods are proposed to determine the position of the joint. The joint specific position determination method utilizes two triaxial accelerometers attached to the link driven by the joint with the failed position sensor. The joint specific method is not computationally complex and the position error is bounded. The system wide position determination method utilizes accelerometers distributed on different robot links and the end-effector to determine the position of sets of multiple joints. The system wide method requires fewer accelerometers than the joint specific method to make all joint position sensors fault tolerant but is more computationally complex and has lower convergence properties. Experiments were conducted on a laboratory manipulator. Both position determination methods were shown to track the actual position satisfactorily. A controller using the position determination methods and the virtual passive torque controller was able to servo the joints to a desired position during position sensor failure.

  1. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

    SciTech Connect

    Bernstein, P.A.

    1988-02-01

    The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user. However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.

  2. Adding Fault Tolerance to NPB Benchmarks Using ULFM

    SciTech Connect

    Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J; Engelmann, Christian; Bernholdt, David E; Scott, Stephen L

    2016-01-01

    In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In this work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.

  3. Fault-tolerant dynamic task graph scheduling

    SciTech Connect

    Kurt, Mehmet C.; Krishnamoorthy, Sriram; Agrawal, Kunal; Agrawal, Gagan

    2014-11-16

    In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.

  4. Reconfigurable fault tolerant avionics system

    NASA Astrophysics Data System (ADS)

    Ibrahim, M. M.; Asami, K.; Cho, Mengu

    This paper presents the design of a reconfigurable avionics system based on modern Static Random Access Memory (SRAM)-based Field Programmable Gate Array (FPGA) to be used in future generations of nano satellites. A major concern in satellite systems and especially nano satellites is to build robust systems with low-power consumption profiles. The system is designed to be flexible by providing the capability of reconfiguring itself based on its orbital position. As Single Event Upsets (SEU) do not have the same severity and intensity in all orbital locations, having the maximum at the South Atlantic Anomaly (SAA) and the polar cusps, the system does not have to be fully protected all the time in its orbit. An acceptable level of protection against high-energy cosmic rays and charged particles roaming in space is provided within the majority of the orbit through software fault tolerance. Check pointing and roll back, besides control flow assertions, is used for that level of protection. In the minority part of the orbit where severe SEUs are expected to exist, a reconfiguration for the system FPGA is initiated where the processor systems are triplicated and protection through Triple Modular Redundancy (TMR) with feedback is provided. This technique of reconfiguring the system as per the level of the threat expected from SEU-induced faults helps in reducing the average dynamic power consumption of the system to one-third of its maximum. This technique can be viewed as a smart protection through system reconfiguration. The system is built on the commercial version of the (XC5VLX50) Xilinx Virtex5 FPGA on bulk silicon with 324 IO. Simulations of orbit SEU rates were carried out using the SPENVIS web-based software package.

  5. Concatenated codes for fault tolerant quantum computing

    SciTech Connect

    Knill, E.; Laflamme, R.; Zurek, W.

    1995-05-01

    The application of concatenated codes to fault tolerant quantum computing is discussed. We have previously shown that for quantum memories and quantum communication, a state can be transmitted with error {epsilon} provided each gate has error at most c{epsilon}. We show how this can be used with Shor`s fault tolerant operations to reduce the accuracy requirements when maintaining states not currently participating in the computation. Viewing Shor`s fault tolerant operations as a method for reducing the error of operations, we give a concatenated implementation which promises to propagate the reduction hierarchically. This has the potential of reducing the accuracy requirements in long computations.

  6. An aircraft sensor fault tolerant system

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Lancraft, R. E.

    1982-01-01

    The design of a sensor fault tolerant system which uses analytical redundancy for the Terminal Configured Vehicle (TCV) research aircraft in a Microwave Landing System (MLS) environment was studied. The fault tolerant system provides reliable estimates for aircraft position, velocity, and attitude in the presence of possible failures in navigation aid instruments and onboard sensors. The estimates, provided by the fault tolerant system, are used by the automated guidance and control system to land the aircraft along a prescribed path. Sensor failures are identified by utilizing the analytic relationship between the various sensor outputs arising from the aircraft equations of motion.

  7. Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Akamine, Robert L.; Hodson, Robert F.; LaMeres, Brock J.; Ray, Robert E.

    2011-01-01

    Fault tolerant systems require the ability to detect and recover from physical damage caused by the hardware s environment, faulty connectors, and system degradation over time. This ability applies to military, space, and industrial computing applications. The integrity of Point-to-Point (P2P) communication, between two microcontrollers for example, is an essential part of fault tolerant computing systems. In this paper, different methods of fault detection and recovery are presented and analyzed.

  8. An empirical comparison of software fault tolerance and fault elimination

    NASA Technical Reports Server (NTRS)

    Shimeall, Timothy J.; Leveson, Nancy G.

    1991-01-01

    Reliability is an important concern in the development of software for modern systems. Some researchers have hypothesized that particular fault-handling approaches or techniques are so effective that other approaches or techniques are superfluous. The authors have performed a study that compares two major approaches to the improvement of software, software fault elimination and software fault tolerance, by examination of the fault detection obtained by five techniques: run-time assertions, multi-version voting, functional testing augmented by structural testing, code reading by stepwise abstraction, and static data-flow analysis. This study has focused on characterizing the sets of faults detected by the techniques and on characterizing the relationships between these sets of faults. The results of the study show that none of the techniques studied is necessarily redundant to any combination of the others. Further results reveal strengths and weakness in the fault detection by the techniques studied and suggest directions for future research.

  9. Fault-tolerant communication channel structures

    NASA Technical Reports Server (NTRS)

    Alkalai, Leon (Inventor); Chau, Savio N. (Inventor); Tai, Ann T. (Inventor)

    2006-01-01

    Systems and techniques for implementing fault-tolerant communication channels and features in communication systems. Selected commercial-off-the-shelf devices can be integrated in such systems to reduce the cost.

  10. Performance Analysis on Fault Tolerant Control System

    NASA Technical Reports Server (NTRS)

    Shin, Jong-Yeob; Belcastro, Christine

    2005-01-01

    In a fault tolerant control (FTC) system, a parameter varying FTC law is reconfigured based on fault parameters estimated by fault detection and isolation (FDI) modules. FDI modules require some time to detect fault occurrences in aero-vehicle dynamics. In this paper, an FTC analysis framework is provided to calculate the upper bound of an induced-L(sub 2) norm of an FTC system with existence of false identification and detection time delay. The upper bound is written as a function of a fault detection time and exponential decay rates and has been used to determine which FTC law produces less performance degradation (tracking error) due to false identification. The analysis framework is applied for an FTC system of a HiMAT (Highly Maneuverable Aircraft Technology) vehicle. Index Terms fault tolerant control system, linear parameter varying system, HiMAT vehicle.

  11. The cost of software fault tolerance

    NASA Technical Reports Server (NTRS)

    Migneault, G. E.

    1982-01-01

    The proposed use of software fault tolerance techniques as a means of reducing software costs in avionics and as a means of addressing the issue of system unreliability due to faults in software is examined. A model is developed to provide a view of the relationships among cost, redundancy, and reliability which suggests strategies for software development and maintenance which are not conventional.

  12. Fault-free performance validation of fault-tolerant multiprocessors

    NASA Technical Reports Server (NTRS)

    Czeck, Edward W.; Feather, Frank E.; Grizzaffi, Ann Marie; Segall, Zary Z.; Siewiorek, Daniel P.

    1987-01-01

    A validation methodology for testing the performance of fault-tolerant computer systems was developed and applied to the Fault-Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB facility. This methodology was claimed to be general enough to apply to any ultrareliable computer system. The goal of this research was to extend the validation methodology and to demonstrate the robustness of the validation methodology by its more extensive application to NASA's Fault-Tolerant Multiprocessor System (FTMP) and to the Software Implemented Fault-Tolerance (SIFT) Computer System. Furthermore, the performance of these two multiprocessors was compared by conducting similar experiments. An analysis of the results shows high level language instruction execution times for both SIFT and FTMP were consistent and predictable, with SIFT having greater throughput. At the operating system level, FTMP consumes 60% of the throughput for its real-time dispatcher and 5% on fault-handling tasks. In contrast, SIFT consumes 16% of its throughput for the dispatcher, but consumes 66% in fault-handling software overhead.

  13. A verified design of a fault-tolerant clock synchronization circuit: Preliminary investigations

    NASA Technical Reports Server (NTRS)

    Miner, Paul S.

    1992-01-01

    Schneider demonstrates that many fault tolerant clock synchronization algorithms can be represented as refinements of a single proven correct paradigm. Shankar provides mechanical proof that Schneider's schema achieves Byzantine fault tolerant clock synchronization provided that 11 constraints are satisfied. Some of the constraints are assumptions about physical properties of the system and cannot be established formally. Proofs are given that the fault tolerant midpoint convergence function satisfies three of the constraints. A hardware design is presented, implementing the fault tolerant midpoint function, which is shown to satisfy the remaining constraints. The synchronization circuit will recover completely from transient faults provided the maximum fault assumption is not violated. The initialization protocol for the circuit also provides a recovery mechanism from total system failure caused by correlated transient faults.

  14. The Design of a Fault-Tolerant COTS-Based Bus Architecture

    NASA Technical Reports Server (NTRS)

    Chau, Savio N.; Alkalai, Leon; Burt, John B.; Tai, Ann T.

    1999-01-01

    In this paper, we report our experiences and findings on the design of a fault-tolerant bus architecture comprised of two COTS buses, the IEEE 1394 and the 12C. This fault-tolerant bus is the backbone system bus for the avionics architecture of the X2000 program at the Jet Propulsion Laboratory. COTS buses are attractive because of the availability of low cost commercial products. However, they are not specifically designed for highly reliable applications such as long-life deep-space missions. The X2000 design team has devised a multi-level fault tolerance approach to compensate for this shortcoming of COTS buses. First, the approach enhances the fault tolerance capabilities of the IEEE 1394 and 12 C buses by adding a layer of fault handling hardware and software. Second, algorithms are developed to enable the IEEE 1394 and the 12 C buses assist each other to isolate and recovery from faults. Third, the set of IEEE 1394 and 12 C buses is duplicated to further enhance system reliability. The X2000 design team has paid special attention to guarantee that all fault tolerance provisions will not cause the bus design to deviate from the commercial standard specifications. Otherwise, the economic attractiveness of using COTS will be diminished. The hardware and software design of the X2000 fault-tolerant bus are being implemented and flight hardware will be delivered to the ST4 and Europa Orbiter missions.

  15. Fault-tolerant multichannel demultiplexer subsystems

    NASA Technical Reports Server (NTRS)

    Redinbo, Robert

    1991-01-01

    Fault tolerance in future processing and switching communication satellites is addressed by showing new methods for detecting hardware failures in the first major subsystem, the multichannel demultiplexer. An efficient method for demultiplexing frequency slotted channels uses multirate filter banks which contain fast Fourier transform processing. All numerical processing is performed at a lower rate commensurate with the small bandwidth of each bandbase channel. The integrity of the demultiplexing operations is protected by using real number convolutional codes to compute comparable parity values which detect errors at the data sample level. High rate, systematic convolutional codes produce parity values at a much reduced rate, and protection is achieved by generating parity values in two ways and comparing them. Parity values corresponding to each output channel are generated in parallel by a subsystem, operating even slower and in parallel with the demultiplexer that is virtually identical to the original structure. These parity calculations may be time shared with the same processing resources because they are so similar.

  16. Fault Injection Campaign for a Fault Tolerant Duplex Framework

    NASA Technical Reports Server (NTRS)

    Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.

    2007-01-01

    Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.

  17. Software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1991-01-01

    Twenty independently developed but functionally equivalent software versions were used to investigate and compare empirically some properties of N-version programming, Recovery Block, and Consensus Recovery Block, using the majority and consensus voting algorithms. This was also compared with another hybrid fault-tolerant scheme called Acceptance Voting, using dynamic versions of consensus and majority voting. Consensus voting provides adaptation of the voting strategy to varying component reliability, failure correlation, and output space characteristics. Since failure correlation among versions effectively reduces the cardinality of the space in which the voter make decisions, consensus voting is usually preferable to simple majority voting in any fault-tolerant system. When versions have considerably different reliabilities, the version with the best reliability will perform better than any of the fault-tolerant techniques.

  18. Model-Based Fault Tolerant Control

    NASA Technical Reports Server (NTRS)

    Kumar, Aditya; Viassolo, Daniel

    2008-01-01

    The Model Based Fault Tolerant Control (MBFTC) task was conducted under the NASA Aviation Safety and Security Program. The goal of MBFTC is to develop and demonstrate real-time strategies to diagnose and accommodate anomalous aircraft engine events such as sensor faults, actuator faults, or turbine gas-path component damage that can lead to in-flight shutdowns, aborted take offs, asymmetric thrust/loss of thrust control, or engine surge/stall events. A suite of model-based fault detection algorithms were developed and evaluated. Based on the performance and maturity of the developed algorithms two approaches were selected for further analysis: (i) multiple-hypothesis testing, and (ii) neural networks; both used residuals from an Extended Kalman Filter to detect the occurrence of the selected faults. A simple fusion algorithm was implemented to combine the results from each algorithm to obtain an overall estimate of the identified fault type and magnitude. The identification of the fault type and magnitude enabled the use of an online fault accommodation strategy to correct for the adverse impact of these faults on engine operability thereby enabling continued engine operation in the presence of these faults. The performance of the fault detection and accommodation algorithm was extensively tested in a simulation environment.

  19. Fault-tolerant adaptive FIR filters using variable detection threshold

    NASA Astrophysics Data System (ADS)

    Lin, L. K.; Redinbo, G. R.

    1994-10-01

    Adaptive filters are widely used in many digital signal processing applications, where tap weight of the filters are adjusted by stochastic gradient search methods. Block adaptive filtering techniques, such as block least mean square and block conjugate gradient algorithm, were developed to speed up the convergence as well as improve the tracking capability which are two important factors in designing real-time adaptive filter systems. Even though algorithm-based fault tolerance can be used as a low-cost high level fault-tolerant technique to protect the aforementioned systems from hardware failures with minimal hardware overhead, the issue of choosing a good detection threshold remains a challenging problem. First of all, the systems usually only have limited computational resources, i.e., concurrent error detection and correction is not feasible. Secondly, any prior knowledge of input data is very difficult to get in practical settings. We propose a checksum-based fault detection scheme using two-level variable detection thresholds that is dynamically dependent on the past syndromes. Simulations show that the proposed scheme reduces the possibility of false alarms and has a high degree of fault coverage in adaptive filter systems.

  20. A fault-tolerant clock

    NASA Technical Reports Server (NTRS)

    Daley, W. P.; Mckenna, J. F., Jr.

    1973-01-01

    Computers must operate correctly even though one or more of components have failed. Electronic clock has been designed to be insensitive to occurrence of faults; it is substantial advance over any known clock.

  1. Fault tolerant operation of switched reluctance machine

    NASA Astrophysics Data System (ADS)

    Wang, Wei

    The energy crisis and environmental challenges have driven industry towards more energy efficient solutions. With nearly 60% of electricity consumed by various electric machines in industry sector, advancement in the efficiency of the electric drive system is of vital importance. Adjustable speed drive system (ASDS) provides excellent speed regulation and dynamic performance as well as dramatically improved system efficiency compared with conventional motors without electronics drives. Industry has witnessed tremendous grow in ASDS applications not only as a driving force but also as an electric auxiliary system for replacing bulky and low efficiency auxiliary hydraulic and mechanical systems. With the vast penetration of ASDS, its fault tolerant operation capability is more widely recognized as an important feature of drive performance especially for aerospace, automotive applications and other industrial drive applications demanding high reliability. The Switched Reluctance Machine (SRM), a low cost, highly reliable electric machine with fault tolerant operation capability, has drawn substantial attention in the past three decades. Nevertheless, SRM is not free of fault. Certain faults such as converter faults, sensor faults, winding shorts, eccentricity and position sensor faults are commonly shared among all ASDS. In this dissertation, a thorough understanding of various faults and their influence on transient and steady state performance of SRM is developed via simulation and experimental study, providing necessary knowledge for fault detection and post fault management. Lumped parameter models are established for fast real time simulation and drive control. Based on the behavior of the faults, a fault detection scheme is developed for the purpose of fast and reliable fault diagnosis. In order to improve the SRM power and torque capacity under faults, the maximum torque per ampere excitation are conceptualized and validated through theoretical analysis and

  2. Analysis of fault-tolerant neurocontrol architectures

    NASA Technical Reports Server (NTRS)

    Troudet, T.; Merrill, W.

    1992-01-01

    The fault-tolerance of analog parallel distributed implementations of a multivariable aircraft neurocontroller is analyzed by simulating weight and neuron failures in a simplified scheme of analog processing based on the functional architecture of the ETANN chip (Electrically Trainable Artificial Neural Network). The neural information processing is found to be only partially distributed throughout the set of weights of the neurocontroller synthesized with the backpropagation algorithm. Although the degree of distribution of the neural processing, and consequently the fault-tolerance of the neurocontroller, could be enhanced using Locally Distributed Weight and Neuron Approaches, a satisfactory level of fault-tolerance could only be obtained by retraining the degrated VLSI neurocontroller. The possibility of maintaining neurocontrol performance and stability in the presence of single weight of neuron failures was demonstrated through an automated retraining procedure of the neurocontroller based on a pre-programmed choice and sequence of the training parameters.

  3. A Primer on Architectural Level Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.

    2008-01-01

    This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition.

  4. Experiments in fault tolerant software reliability

    NASA Technical Reports Server (NTRS)

    Mcallister, David F.; Tai, K. C.; Vouk, Mladen A.

    1987-01-01

    The reliability of voting was evaluated in a fault-tolerant software system for small output spaces. The effectiveness of the back-to-back testing process was investigated. Version 3.0 of the RSDIMU-ATS, a semi-automated test bed for certification testing of RSDIMU software, was prepared and distributed. Software reliability estimation methods based on non-random sampling are being studied. The investigation of existing fault-tolerance models was continued and formulation of new models was initiated.

  5. A general model for the study of fault tolerance and diagnosis.

    NASA Technical Reports Server (NTRS)

    Meyer, J. F.

    1973-01-01

    The concept of a 'system with faults' is introduced as a suggested point of departure for the theoretical study of fault tolerance and diagnosis in systems. The model is defined relative to a general representation scheme for systems and, depending on the choice of representation, can be used to investigate either hardware or software faults that occur during either the design or use of a system.

  6. Reliability modeling of fault-tolerant computer based systems

    NASA Technical Reports Server (NTRS)

    Bavuso, Salvatore J.

    1987-01-01

    Digital fault-tolerant computer-based systems have become commonplace in military and commercial avionics. These systems hold the promise of increased availability, reliability, and maintainability over conventional analog-based systems through the application of replicated digital computers arranged in fault-tolerant configurations. Three tightly coupled factors of paramount importance, ultimately determining the viability of these systems, are reliability, safety, and profitability. Reliability, the major driver affects virtually every aspect of design, packaging, and field operations, and eventually produces profit for commercial applications or increased national security. However, the utilization of digital computer systems makes the task of producing credible reliability assessment a formidable one for the reliability engineer. The root of the problem lies in the digital computer's unique adaptability to changing requirements, computational power, and ability to test itself efficiently. Addressed here are the nuances of modeling the reliability of systems with large state sizes, in the Markov sense, which result from systems based on replicated redundant hardware and to discuss the modeling of factors which can reduce reliability without concomitant depletion of hardware. Advanced fault-handling models are described and methods of acquiring and measuring parameters for these models are delineated.

  7. Distributed execution of recovery blocks - An approach for uniform treatment of hardware and software faults in real-time applications

    NASA Technical Reports Server (NTRS)

    Kim, K. H.; Welch, Howard O.

    1989-01-01

    The concept of distributed execution of recovery blocks is examined as an approach for uniform treatment of hardware and software faults. A useful characteristic of the approach is the relatively small time cost it requires. The approach is thus suitable for incorporation into real-time computer systems. A specific formulation of the approach that is aimed at minimizing the recovery time is presented, called the distributed recovery block (DRB) scheme. The DRB scheme is capable of effecting forward recovery while handling both hardware and software faults in a uniform manner. An approach to incorporating the capability for multiprocessing scheme is also discussed. Two experiments aimed at testing the execution efficiency of the scheme in real-time applications have been conducted on two different multimicrocomputer networks. The results clearly indicate the feasibility of achieving tolerance of hardware and software faults in a broad range of real-time computer systems by use of the schemes for distributed execution of recovery blocks.

  8. Towards fault-tolerant optimal control

    NASA Technical Reports Server (NTRS)

    Chizeck, H. J.; Willsky, A. S.

    1979-01-01

    The paper considers the design of fault-tolerant controllers that may endow systems with dynamic reliability. Results for jump linear quadratic Gaussian control problems are extended to include random jump costs, trajectory discontinuities, and a simple case of non-Markovian mode transitions.

  9. Fault-tolerant electrical power system

    NASA Astrophysics Data System (ADS)

    Mehdi, Ishaque S.; Weimer, Joseph A.

    1987-10-01

    An electrical system that will meet the requirements of a 1990s two-engine fighter is being developed in the Fault-Tolerant Electrical Power System (FTEPS) program, sponsored by the AFWAL Aero Propulsion Laboratory. FTEPS will demonstrate the generation and distribution of fault-tolerant, reliable, electrical power required for future aircraft. The system incorporates MIL-STD-1750A digital processors and MIL-STD-1553B data buses for control and communications. Electrical power is distributed through electrical load management centers by means of solid-state power controllers for fault protection and individual load control. The system will provide uninterruptible power to flight-critical loads such as the flight control and mission computers with sealed lead-acid batteries. Primary power is provided by four 60 kVA variable speed constant frequency generators. Buildup and testing of the FTEPS demonstrator is expected to be complete by May 1988.

  10. Development and Evaluation of Fault-Tolerant Flight Control Systems

    NASA Technical Reports Server (NTRS)

    Song, Yong D.; Gupta, Kajal (Technical Monitor)

    2004-01-01

    The research is concerned with developing a new approach to enhancing fault tolerance of flight control systems. The original motivation for fault-tolerant control comes from the need for safe operation of control elements (e.g. actuators) in the event of hardware failures in high reliability systems. One such example is modem space vehicle subjected to actuator/sensor impairments. A major task in flight control is to revise the control policy to balance impairment detectability and to achieve sufficient robustness. This involves careful selection of types and parameters of the controllers and the impairment detecting filters used. It also involves a decision, upon the identification of some failures, on whether and how a control reconfiguration should take place in order to maintain a certain system performance level. In this project new flight dynamic model under uncertain flight conditions is considered, in which the effects of both ramp and jump faults are reflected. Stabilization algorithms based on neural network and adaptive method are derived. The control algorithms are shown to be effective in dealing with uncertain dynamics due to external disturbances and unpredictable faults. The overall strategy is easy to set up and the computation involved is much less as compared with other strategies. Computer simulation software is developed. A serious of simulation studies have been conducted with varying flight conditions.

  11. Fly-By-Light/Power-By-Wire Fault-Tolerant Fiber-Optic Backplane

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2002-01-01

    The design and development of a fault-tolerant fiber-optic backplane to demonstrate feasibility of such architecture is presented. The simulation results of test cases on the backplane in the advent of induced faults are presented, and the fault recovery capability of the architecture is demonstrated. The architecture was designed, developed, and implemented using the Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL). The architecture was synthesized and implemented in hardware using Field Programmable Gate Arrays (FPGA) on multiple prototype boards.

  12. Investigation of an advanced fault tolerant integrated avionics system

    NASA Technical Reports Server (NTRS)

    Dunn, W. R.; Cottrell, D.; Flanders, J.; Javornik, A.; Rusovick, M.

    1986-01-01

    Presented is an advanced, fault-tolerant multiprocessor avionics architecture as could be employed in an advanced rotorcraft such as LHX. The processor structure is designed to interface with existing digital avionics systems and concepts including the Army Digital Avionics System (ADAS) cockpit/display system, navaid and communications suites, integrated sensing suite, and the Advanced Digital Optical Control System (ADOCS). The report defines mission, maintenance and safety-of-flight reliability goals as might be expected for an operational LHX aircraft. Based on use of a modular, compact (16-bit) microprocessor card family, results of a preliminary study examining simplex, dual and standby-sparing architectures is presented. Given the stated constraints, it is shown that the dual architecture is best suited to meet reliability goals with minimum hardware and software overhead. The report presents hardware and software design considerations for realizing the architecture including redundancy management requirements and techniques as well as verification and validation needs and methods.

  13. Using certification trails to achieve software fault tolerance

    NASA Technical Reports Server (NTRS)

    Sullivan, Gregory F.; Masson, Gerald M.

    1993-01-01

    A conceptually novel and powerful technique to achieve fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance was formalized and it was illustrated by applying it to the fundamental problem of finding a minimum spanning tree. Cases in which the second phase can be run concorrectly with the first and act as a monitor are discussed. The certification trail approach was compared to other approaches to fault tolerance. Because of space limitations we have omitted examples of our technique applied to the Huffman tree, and convex hull problems. These can be found in the full version of this paper.

  14. Sequential behavior and its inherent tolerance to memory faults.

    NASA Technical Reports Server (NTRS)

    Meyer, J. F.

    1972-01-01

    Representation of a memory fault of a sequential machine M by a function mu on the states of M and the result of the fault by an appropriately determined machine M(mu). Given some sequential behavior B, its inherent tolerance to memory faults can then be measured in terms of the minimum memory redundancy required to realize B with a state-assigned machine having fault tolerance type tau and fault tolerance level t. A behavior having maximum inherent tolerance is exhibited, and it is shown that behaviors of the same size can have different inherent tolerance.

  15. Fault tolerant GPS/Inertial System design

    NASA Astrophysics Data System (ADS)

    Brown, Alison K.; Sturza, Mark A.; Deangelis, Franco; Lukaszewski, David A.

    The use of a GPS/Inertial integrated system in future launch vehicles motivates the described design of the present fault-tolerant system. The robustness of the navigation system is enhanced by integrating the GPS with an inertial fault-tolerant system. Three layers of failure detection and isolation are incorporated to determine the nature of flaws in the inertial instruments, the GPS receivers, or the integrated navigation solution. The layers are based on: (1) a high-rate parity algorithm for instrument failures; (2) a similar parity algorithm for GPS satellite or receiver failures; and (3) a GPS navigation solution to monitor inertial navigation failures. Dual failures of any system component can occur in any system component without affecting the performance of launch-vehicle navigation or guidance.

  16. Reinitialization issues in fault tolerant systems

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Lancraft, R. E.

    1983-01-01

    This paper is concerned with the reinitialization of fault tolerant systems in which detection and isolation (FDI) techniques are used, on-line, to identify and compensate for system failures. Specifically, it will focus on FDI techniques which utilize analytic redundancy, arising from a knowledge of the plant dynamics, by analyzing the residuals of a no-fail filter designed on the assumption of no failures. In these types of fault tolerant systems, system failures have to propagate through the no-fail filter dynamics in order to get detected. Therefore, the no-fail filter must be reinitialized after the isolation of a failure so that the accumulated effects of the failure are removed. In this paper, various approaches to this reinitialization problem will be discussed.

  17. A Unified Fault-Tolerance Protocol

    NASA Technical Reports Server (NTRS)

    Miner, Paul; Gedser, Alfons; Pike, Lee; Maddalon, Jeffrey

    2004-01-01

    Davies and Wakerly show that Byzantine fault tolerance can be achieved by a cascade of broadcasts and middle value select functions. We present an extension of the Davies and Wakerly protocol, the unified protocol, and its proof of correctness. We prove that it satisfies validity and agreement properties for communication of exact values. We then introduce bounded communication error into the model. Inexact communication is inherent for clock synchronization protocols. We prove that validity and agreement properties hold for inexact communication, and that exact communication is a special case. As a running example, we illustrate the unified protocol using the SPIDER family of fault-tolerant architectures. In particular we demonstrate that the SPIDER interactive consistency, distributed diagnosis, and clock synchronization protocols are instances of the unified protocol.

  18. A dual, fault-tolerant aerospace actuator

    NASA Technical Reports Server (NTRS)

    Siebert, C. J.

    1985-01-01

    The requirements for mechanisms used in the Space Transportation System (STS) are to provide dual fault tolerance, and if the payload equipment violates the Shuttle bay door envelope, these deployment/restow mechanisms must have independent primary and backup features. The research and development of an electromechanical actuator that meets these requirements and will be used on the Transfer Orbit Stage (TOS) program is described.

  19. Reconfigurable Fault Tolerance for FPGAs

    NASA Technical Reports Server (NTRS)

    Shuler, Robert, Jr.

    2010-01-01

    The invention allows a field-programmable gate array (FPGA) or similar device to be efficiently reconfigured in whole or in part to provide higher capacity, non-redundant operation. The redundant device consists of functional units such as adders or multipliers, configuration memory for the functional units, a programmable routing method, configuration memory for the routing method, and various other features such as block RAM, I/O (random access memory, input/output) capability, dedicated carry logic, etc. The redundant device has three identical sets of functional units and routing resources and majority voters that correct errors. The configuration memory may or may not be redundant, depending on need. For example, SRAM-based FPGAs will need some type of radiation-tolerant configuration memory, or they will need triple-redundant configuration memory. Flash or anti-fuse devices will generally not need redundant configuration memory. Some means of loading and verifying the configuration memory is also required. These are all components of the pre-existing redundant FPGA. This innovation modifies the voter to accept a MODE input, which specifies whether ordinary voting is to occur, or if redundancy is to be split. Generally, additional routing resources will also be required to pass data between sections of the device created by splitting the redundancy. In redundancy mode, the voters produce an output corresponding to the two inputs that agree, in the usual fashion. In the split mode, the voters select just one input and convey this to the output, ignoring the other inputs. In a dual-redundant system (as opposed to triple-redundant), instead of a voter, there is some means to latch or gate a state update only when both inputs agree. In this case, the invention would require modification of the latch or gate so that it would operate normally in redundant mode, and would separately latch or gate the inputs in non-redundant mode.

  20. Novel neural networks-based fault tolerant control scheme with fault alarm.

    PubMed

    Shen, Qikun; Jiang, Bin; Shi, Peng; Lim, Cheng-Chew

    2014-11-01

    In this paper, the problem of adaptive active fault-tolerant control for a class of nonlinear systems with unknown actuator fault is investigated. The actuator fault is assumed to have no traditional affine appearance of the system state variables and control input. The useful property of the basis function of the radial basis function neural network (NN), which will be used in the design of the fault tolerant controller, is explored. Based on the analysis of the design of normal and passive fault tolerant controllers, by using the implicit function theorem, a novel NN-based active fault-tolerant control scheme with fault alarm is proposed. Comparing with results in the literature, the fault-tolerant control scheme can minimize the time delay between fault occurrence and accommodation that is called the time delay due to fault diagnosis, and reduce the adverse effect on system performance. In addition, the FTC scheme has the advantages of a passive fault-tolerant control scheme as well as the traditional active fault-tolerant control scheme's properties. Furthermore, the fault-tolerant control scheme requires no additional fault detection and isolation model which is necessary in the traditional active fault-tolerant control scheme. Finally, simulation results are presented to demonstrate the efficiency of the developed techniques. PMID:25014982

  1. An Integrated Fault Tolerant Robotic Controller System for High Reliability and Safety

    NASA Technical Reports Server (NTRS)

    Marzwell, Neville I.; Tso, Kam S.; Hecht, Myron

    1994-01-01

    This paper describes the concepts and features of a fault-tolerant intelligent robotic control system being developed for applications that require high dependability (reliability, availability, and safety). The system consists of two major elements: a fault-tolerant controller and an operator workstation. The fault-tolerant controller uses a strategy which allows for detection and recovery of hardware, operating system, and application software failures.The fault-tolerant controller can be used by itself in a wide variety of applications in industry, process control, and communications. The controller in combination with the operator workstation can be applied to robotic applications such as spaceborne extravehicular activities, hazardous materials handling, inspection and maintenance of high value items (e.g., space vehicles, reactor internals, or aircraft), medicine, and other tasks where a robot system failure poses a significant risk to life or property.

  2. [Advanced Development for Space Robotics With Emphasis on Fault Tolerance Technology

    NASA Technical Reports Server (NTRS)

    Tesar, Delbert

    1997-01-01

    This report describes work developing fault tolerant redundant robotic architectures and adaptive control strategies for robotic manipulator systems which can dynamically accommodate drastic robot manipulator mechanism, sensor or control failures and maintain stable end-point trajectory control with minimum disturbance. Kinematic designs of redundant, modular, reconfigurable arms for fault tolerance were pursued at a fundamental level. The approach developed robotic testbeds to evaluate disturbance responses of fault tolerant concepts in robotic mechanisms and controllers. The development was implemented in various fault tolerant mechanism testbeds including duality in the joint servo motor modules, parallel and serial structural architectures, and dual arms. All have real-time adaptive controller technologies to react to mechanism or controller disturbances (failures) to perform real-time reconfiguration to continue the task operations. The developments fall into three main areas: hardware, software, and theoretical.

  3. Coordinated Fault Tolerance for High-Performance Computing

    SciTech Connect

    Dongarra, Jack; Bosilca, George; et al.

    2013-04-08

    Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.

  4. The X-38 Spacecraft Fault-Tolerant Avionics System

    NASA Technical Reports Server (NTRS)

    Kouba,Coy; Buscher, Deborah; Busa, Joseph

    2003-01-01

    In 1995 NASA began an experimental program to develop a reusable crew return vehicle (CRV) for the International Space Station. The purpose of the CRV was threefold: (i) to bring home an injured or ill crewmember; (ii) to bring home the entire crew if the Shuttle fleet was grounded; and (iii) to evacuate the crew in the case of an imminent Station threat (i.e., fire, decompression, etc). Built at the Johnson Space Center, were two approach and landing prototypes and one spacecraft demonstrator (called V201). A series of increasingly complex ground subsystem tests were completed, and eight successful high-altitude drop tests were achieved to prove the design concept. In this program, an unprecedented amount of commercial-off-the-shelf technology was utilized in this first crewed spacecraft NASA has built since the Shuttle program. Unfortunately, in 2002 the program was canceled due to changing Agency priorities. The vehicle was 80% complete and the program was shut down in such a manner as to preserve design, development, test and engineering data. This paper describes the X-38 V201 fault-tolerant avionics system. Based on Draper Laboratory's Byzantine-resilient fault-tolerant parallel processing system and their "network element" hardware, each flight computer exchanges information on a strict timescale to process input data, compare results, and issue voted vehicle output commands. Major accomplishments achieved in this development include: (i) a space qualified two-fault tolerant design using mostly COTS (hardware and operating system); (ii) a single event upset tolerant network element board, (iii) on-the-fly recovery of a failed processor; (iv) use of synched cache; (v) realignment of memory to bring back a failed channel; (vi) flight code automatically generated from the master measurement list; and (vii) built in-house by a team of civil servants and support contractors. This paper will present an overview of the avionics system and the hardware

  5. Fault tolerant massively parallel processing architecture

    SciTech Connect

    Balasubramanian, V.; Banerjee, P.

    1987-08-01

    This paper presents two massively parallel processing architectures suitable for solving a wide variety of algorithms of divide-and-conquer type for problems such as the discrete Fourier transform, production systems, design automation, and others. The first architecture, called the Chain-structured Butterfly ARchitecture (CBAR), consists of a two-dimensional array of N-L . (log/sub 2/(L)+1) processing elements (PE) organized as L levels of log/sub 2/(L)+1 stages, and which has the butterfly connection between PEs in consecutive stages with straight-through feedback between PEs in the last and first stages. This connection system has the desirable property of allowing thousands of PEs to be connected with O(N) connection cost, O(log/sub 2/(N/log/sub 2/N)) communication paths, and a small number (=4) of I/O ports per PE. However, this architecture is not fault tolerant. The authors, therefore, propose a second architecture, called the REconfigurable Chain-structured Butterfly ARchitecture (RECBAR), which is a modified version of the CBAR. The RECBAR possesses all the desirable features of the CBAR, with the number of I/O ports per PE increased to six, and uses O(log/sub 2/N)/N) overhead in PEs and approximately 50% overhead in links to achieve single-level fault tolerance. Reliability improvements of the RECBAR over the CBAR are studied. This paper also presents a distributed diagnostic and structuring algorithm for the RECBAR that enables the architecture to detect faults and structure itself accordingly within 2 . log/sub 2/(L)+1 time steps, thus making it a truly fault tolerant architecture.

  6. Real-time optimal torque control of fault-tolerant permanent magnet brushless machines

    NASA Astrophysics Data System (ADS)

    Max, L.; Wang, J.; Atallah, K.; Howe, D.

    2005-05-01

    The paper describes issues that are pertinent to control system hardware and software design for the real-time implementation of an optimal torque control strategy for fault-tolerant permanent magnet brushless ac drives, and reports experimental results. The influence of the current control loop bandwidth and pulse width modulation on the torque ripple are investigated and quantified.

  7. Fault Tolerant Magnetic Bearing for Turbomachinery

    NASA Technical Reports Server (NTRS)

    Choi, Benjamin; Provenza, Andrew

    2001-01-01

    NASA Glenn Research Center (GRC) has developed a Fault-Tolerant Magnetic Bearing Suspension rig to enhance the bearing system safety. It successfully demonstrated that using only two active poles out of eight redundant poles from each radial bearing (that is, simply 12 out of 16 poles dead) levitated the rotor and spun it without losing stability and desired position up to the maximum allowable speed of 20,000 rpm. In this paper, it is demonstrated that as far as the summation of force vectors of the attracting poles and rotor weight is zero, a fault-tolerant magnetic bearing system maintained the rotor at the desired position without losing stability even at the maximum rotor speed. A proportional-integral-derivative (PID) controller generated autonomous corrective actions with no operator's input for the fault situations without losing load capacity in terms of rotor position. This paper also deals with a centralized modal controller to better control the dynamic behavior over system modes.

  8. Software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1993-01-01

    Strategies and tools for the testing, risk assessment and risk control of dependable software-based systems were developed. Part of this project consists of studies to enable the transfer of technology to industry, for example the risk management techniques for safety-concious systems. Theoretical investigations of Boolean and Relational Operator (BRO) testing strategy were conducted for condition-based testing. The Basic Graph Generation and Analysis tool (BGG) was extended to fully incorporate several variants of the BRO metric. Single- and multi-phase risk, coverage and time-based models are being developed to provide additional theoretical and empirical basis for estimation of the reliability and availability of large, highly dependable software. A model for software process and risk management was developed. The use of cause-effect graphing for software specification and validation was investigated. Lastly, advanced software fault-tolerance models were studied to provide alternatives and improvements in situations where simple software fault-tolerance strategies break down.

  9. Trends in reliability modeling technology for fault tolerant systems

    NASA Technical Reports Server (NTRS)

    Bavuso, S. J.

    1979-01-01

    Reliability modeling for fault tolerant avionic computing systems was developed. The modeling of large systems involving issues of state size and complexity, fault coverage, and practical computation was discussed. A novel technique which provides the tool for studying the reliability of systems with nonconstant failure rates is presented. The fault latency which may provide a method of obtaining vital latent fault data is measured.

  10. Fault tolerant architecture for artificial olfactory system

    NASA Astrophysics Data System (ADS)

    Lotfivand, Nasser; Nizar Hamidon, Mohd; Abdolzadeh, Vida

    2015-05-01

    In this paper, to cover and mask the faults that occur in the sensing unit of an artificial olfactory system, a novel architecture is offered. The proposed architecture is able to tolerate failures in the sensors of the array and the faults that occur are masked. The proposed architecture for extracting the correct results from the output of the sensors can provide the quality of service for generated data from the sensor array. The results of various evaluations and analysis proved that the proposed architecture has acceptable performance in comparison with the classic form of the sensor array in gas identification. According to the results, achieving a high odor discrimination based on the suggested architecture is possible.

  11. Fault-tolerant software for the FIMP

    NASA Technical Reports Server (NTRS)

    Hecht, H.; Hecht, M.

    1984-01-01

    The work reported here provides protection against software failures in the task dispatcher of the FTMP, a particularly critical portion of the system software. Faults in other system modules and application programs can be handled by similar techniques but are not covered in this effort. Goals of the work reported here are: (1) to develop provisions in the software design that will detect and mitigate software failures in the dispatcher portion of the FTMP Executive and, (2) to propose the implementation of specific software reliability measures in other parts of the system. Beyond the specific support to the FTMP project, the work reported here represents a considerable advance in the practical application of the recovery block methodology for fault tolerant software design.

  12. Fault-Tolerant Coding for State Machines

    NASA Technical Reports Server (NTRS)

    Naegle, Stephanie Taft; Burke, Gary; Newell, Michael

    2008-01-01

    Two reliable fault-tolerant coding schemes have been proposed for state machines that are used in field-programmable gate arrays and application-specific integrated circuits to implement sequential logic functions. The schemes apply to strings of bits in state registers, which are typically implemented in practice as assemblies of flip-flop circuits. If a single-event upset (SEU, a radiation-induced change in the bit in one flip-flop) occurs in a state register, the state machine that contains the register could go into an erroneous state or could hang, by which is meant that the machine could remain in undefined states indefinitely. The proposed fault-tolerant coding schemes are intended to prevent the state machine from going into an erroneous or hang state when an SEU occurs. To ensure reliability of the state machine, the coding scheme for bits in the state register must satisfy the following criteria: 1. All possible states are defined. 2. An SEU brings the state machine to a known state. 3. There is no possibility of a hang state. 4. No false state is entered. 5. An SEU exerts no effect on the state machine. Fault-tolerant coding schemes that have been commonly used include binary encoding and "one-hot" encoding. Binary encoding is the simplest state machine encoding and satisfies criteria 1 through 3 if all possible states are defined. Binary encoding is a binary count of the state machine number in sequence; the table represents an eight-state example. In one-hot encoding, N bits are used to represent N states: All except one of the bits in a string are 0, and the position of the 1 in the string represents the state. With proper circuit design, one-hot encoding can satisfy criteria 1 through 4. Unfortunately, the requirement to use N bits to represent N states makes one-hot coding inefficient.

  13. FPGA-Based, Self-Checking, Fault-Tolerant Computers

    NASA Technical Reports Server (NTRS)

    Some, Raphael; Rennels, David

    2004-01-01

    A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing

  14. Software fault tolerance using data diversity

    NASA Technical Reports Server (NTRS)

    Knight, John C.

    1991-01-01

    Research on data diversity is discussed. Data diversity relies on a different form of redundancy from existing approaches to software fault tolerance and is substantially less expensive to implement. Data diversity can also be applied to software testing and greatly facilitates the automation of testing. Up to now it has been explored both theoretically and in a pilot study, and has been shown to be a promising technique. The effectiveness of data diversity as an error detection mechanism and the application of data diversity to differential equation solvers are discussed.

  15. Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

    NASA Technical Reports Server (NTRS)

    Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

    1984-01-01

    SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.

  16. Method and system for environmentally adaptive fault tolerant computing

    NASA Technical Reports Server (NTRS)

    Copenhaver, Jason L. (Inventor); Jeremy, Ramos (Inventor); Wolfe, Jeffrey M. (Inventor); Brenner, Dean (Inventor)

    2010-01-01

    A method and system for adapting fault tolerant computing. The method includes the steps of measuring an environmental condition representative of an environment. An on-board processing system's sensitivity to the measured environmental condition is measured. It is determined whether to reconfigure a fault tolerance of the on-board processing system based in part on the measured environmental condition. The fault tolerance of the on-board processing system may be reconfigured based in part on the measured environmental condition.

  17. Architectural issues in fault-tolerant, secure computing systems

    SciTech Connect

    Joseph, M.K.

    1988-01-01

    This dissertation explores several facets of the applicability of fault-tolerance techniques to secure computer design, these being: (1) how fault-tolerance techniques can be used on unsolved problems in computer security (e.g., computer viruses, and denial-of-service); (2) how fault-tolerance techniques can be used to support classical computer-security mechanisms in the presence of accidental and deliberate faults; and (3) the problems involved in designing a fault-tolerant, secure computer system (e.g., how computer security can degrade along with both the computational and fault-tolerance capabilities of a computer system). The approach taken in this research is almost as important as its results. It is different from current computer-security research in that a design paradigm for fault-tolerant computer design is used. This led to an extensive fault and error classification of many typical security threats. Throughout this work, a fault-tolerance perspective is taken. However, the author did not ignore basic computer-security technology. For some problems he investigated how to support and extend basic-security mechanism (e.g., trusted computing base), instead of trying to achieve the same result with purely fault-tolerance techniques.

  18. Design and validation of fault-tolerant flight systems

    NASA Technical Reports Server (NTRS)

    Finelli, George B.; Palumbo, Daniel L.

    1987-01-01

    NASA has undertaken the development of a methodology for the design of easily validated fault-tolerant systems which emphasizes validation processes that can be directly incorporated into the design process. Attention is presently given to the statistical issues arising in the validation of highly reliable fault-tolerant systems. Structured specification and design methodologies, mathematical proof techniques, analytical modeling, simulation/emulation, and physical testing, are all discussed. Important design factors associated with fault-tolerance are noted; synchronization and 'Byzantine resilience' must accompany fault tolerance.

  19. A fault-tolerant network architecture for integrated avionics

    NASA Technical Reports Server (NTRS)

    Butler, Bryan; Adams, Stuart

    1991-01-01

    The Army Fault-Tolerant Architecture (AFTA) under construction at the Charles Stark Draper Laboratory is an example of a highly integrated critical avionics system. The AFTA system must connect to other redundant and nonredundant systems, as well as to input/output devices. A fault-tolerant data bus (FTDB) is being developed to provide highly reliable communication between the AFTA computer and other network stations. The FTDB is being designed for Byzantine resilience and is probably capable of tolerating any single arbitrary fault. The author describes a prototype architecture for the fault-tolerant data bus.

  20. Fault tolerant architectures for integrated aircraft electronics systems

    NASA Technical Reports Server (NTRS)

    Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

    1983-01-01

    Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered.

  1. Study on fault-tolerant processors for advanced launch system

    NASA Technical Reports Server (NTRS)

    Shin, Kang G.; Liu, Jyh-Charn

    1990-01-01

    Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.

  2. MIL-M-38510/470 test vectors: Fault detection efficiency measurement via hardware fault simulation. [rca 1802 microprocessor

    NASA Technical Reports Server (NTRS)

    Timoc, C. C.

    1980-01-01

    The stuck fault detection efficiency of the test vectors developed for the MIL-M-38510/470 NASA was measured using a hardware stuck fault simulator for the 1802 microprocessor. Thirty-nine stuck faults were not detected out of a total of 874 injected into the combinatorial and sequential parts of the microprocessor. Since undetected faults can create catastrophic errors in equipment designed for high reliability applications, it is recommended that the MIL-M-38510/470 NASA be enhanced with additional test vectors so as to achieve 100% stuck fault detection efficiency.

  3. Rule-based fault diagnosis of hall sensors and fault-tolerant control of PMSM

    NASA Astrophysics Data System (ADS)

    Song, Ziyou; Li, Jianqiu; Ouyang, Minggao; Gu, Jing; Feng, Xuning; Lu, Dongbin

    2013-07-01

    Hall sensor is widely used for estimating rotor phase of permanent magnet synchronous motor(PMSM). And rotor position is an essential parameter of PMSM control algorithm, hence it is very dangerous if Hall senor faults occur. But there is scarcely any research focusing on fault diagnosis and fault-tolerant control of Hall sensor used in PMSM. From this standpoint, the Hall sensor faults which may occur during the PMSM operating are theoretically analyzed. According to the analysis results, the fault diagnosis algorithm of Hall sensor, which is based on three rules, is proposed to classify the fault phenomena accurately. The rotor phase estimation algorithms, based on one or two Hall sensor(s), are initialized to engender the fault-tolerant control algorithm. The fault diagnosis algorithm can detect 60 Hall fault phenomena in total as well as all detections can be fulfilled in 1/138 rotor rotation period. The fault-tolerant control algorithm can achieve a smooth torque production which means the same control effect as normal control mode (with three Hall sensors). Finally, the PMSM bench test verifies the accuracy and rapidity of fault diagnosis and fault-tolerant control strategies. The fault diagnosis algorithm can detect all Hall sensor faults promptly and fault-tolerant control algorithm allows the PMSM to face failure conditions of one or two Hall sensor(s). In addition, the transitions between health-control and fault-tolerant control conditions are smooth without any additional noise and harshness. Proposed algorithms can deal with the Hall sensor faults of PMSM in real applications, and can be provided to realize the fault diagnosis and fault-tolerant control of PMSM.

  4. Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

    SciTech Connect

    Lumsdaine, Andrew

    2013-03-08

    The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.

  5. Exploiting data representation for fault tolerance

    SciTech Connect

    Hoemmen, Mark Frederick; Elliott, J.; Mueller, F.

    2015-01-06

    Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

  6. Exploiting data representation for fault tolerance

    DOE PAGESBeta

    Hoemmen, Mark Frederick; Elliott, J.; Sandia National Lab.; Mueller, F.

    2015-01-06

    Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product'smore » result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.« less

  7. Parameter Transient Behavior Analysis on Fault Tolerant Control System

    NASA Technical Reports Server (NTRS)

    Belcastro, Christine (Technical Monitor); Shin, Jong-Yeob

    2003-01-01

    In a fault tolerant control (FTC) system, a parameter varying FTC law is reconfigured based on fault parameters estimated by fault detection and isolation (FDI) modules. FDI modules require some time to detect fault occurrences in aero-vehicle dynamics. This paper illustrates analysis of a FTC system based on estimated fault parameter transient behavior which may include false fault detections during a short time interval. Using Lyapunov function analysis, the upper bound of an induced-L2 norm of the FTC system performance is calculated as a function of a fault detection time and the exponential decay rate of the Lyapunov function.

  8. Analysis of typical fault-tolerant architectures using HARP

    NASA Technical Reports Server (NTRS)

    Bavuso, Salvatore J.; Bechta Dugan, Joanne; Trivedi, Kishor S.; Rothmann, Elizabeth M.; Smith, W. Earl

    1987-01-01

    Difficulties encountered in the modeling of fault-tolerant systems are discussed. The Hybrid Automated Reliability Predictor (HARP) approach to modeling fault-tolerant systems is described. The HARP is written in FORTRAN, consists of nearly 30,000 lines of codes and comments, and is based on behavioral decomposition. Using the behavioral decomposition, the dependability model is divided into fault-occurrence/repair and fault/error-handling models; the characteristics and combining of these two models are examined. Examples in which the HARP is applied to the modeling of some typical fault-tolerant systems, including a local-area network, two fault-tolerant computer systems, and a flight control system, are presented.

  9. Abstractions for Fault-Tolerant Distributed System Verification

    NASA Technical Reports Server (NTRS)

    Pike, Lee S.; Maddalon, Jeffrey M.; Miner, Paul S.; Geser, Alfons

    2004-01-01

    Four kinds of abstraction for the design and analysis of fault tolerant distributed systems are discussed. These abstractions concern system messages, faults, fault masking voting, and communication. The abstractions are formalized in higher order logic, and are intended to facilitate specifying and verifying such systems in higher order theorem provers.

  10. Method and apparatus for fault tolerance

    NASA Technical Reports Server (NTRS)

    Masson, Gerald M. (Inventor); Sullivan, Gregory F. (Inventor)

    1993-01-01

    A method and apparatus for achieving fault tolerance in a computer system having at least a first central processing unit and a second central processing unit. The method comprises the steps of first executing a first algorithm in the first central processing unit on input which produces a first output as well as a certification trail. Next, executing a second algorithm in the second central processing unit on the input and on at least a portion of the certification trail which produces a second output. The second algorithm has a faster execution time than the first algorithm for a given input. Then, comparing the first and second outputs such that an error result is produced if the first and second outputs are not the same. The step of executing a first algorithm and the step of executing a second algorithm preferably takes place over essentially the same time period.

  11. Fault Tolerance and Parallel Processing for NGST

    NASA Astrophysics Data System (ADS)

    Sengupta, R.; Offenberg, J. D.; Fixsen, D. J.; Nieto-Santisteban, M. A.; Hanisch, R. J.; Stockman, H. S.; Mather, J. C.

    1999-12-01

    The Next Generation Space Telescope (NGST) Image Processing Group is developing scalable cosmic ray rejection and data compression algorithms for parallel processors as part of NASA's Remote Exploration and Experimentation (REE) Project. The primary intention of the REE project is to use commercial-off-the shelf (COTS) technology to develop scalable, low-power, fault tolerant, high performance computers in space. NGST is one of the applications selected to demonstrate the benefit of having on-board supercomputing power. Real-time cosmic ray rejection would enable us to reduce the downlink data volume by as much as two orders of magnitude by combining multiple read-outs on the spacecraft rather than downlinking them separately. The combined read-outs can be further reduced in size by applying lossy and/or lossless data compression algorithms. This work is funded by NASA's REE project, managed by JPL.

  12. Quantum fault-tolerant thresholds for universal concatenated schemes

    NASA Astrophysics Data System (ADS)

    Chamberland, Christopher; Jochym-O'Connor, Tomas; Laflamme, Raymond

    Fault-tolerant quantum computation uses ancillary qubits in order to protect logical data qubits while allowing for the manipulation of the quantum information without severe losses in coherence. While different models for fault-tolerant quantum computation exist, determining the ancillary qubit overhead for competing schemes remains a challenging theoretical problem. In this work, we study the fault-tolerance threshold rates of different models for universal fault-tolerant quantum computation. Namely, we provide different threshold rates for the 105-qubit concatenated coding scheme for universal computation without the need for state distillation. We study two error models: adversarial noise and depolarizing noise and provide lower bounds for the threshold in each of these error regimes. Establishing the threshold rates for the concatenated coding scheme will allow for a physical quantum resource comparison between our fault-tolerant universal quantum computation model and the traditional model using magic state distillation.

  13. FTMP (Fault Tolerant Multiprocessor) programmer's manual

    NASA Technical Reports Server (NTRS)

    Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

    1986-01-01

    The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.

  14. Machine-checked proofs of the design and implementation of a fault-tolerant circuit

    NASA Technical Reports Server (NTRS)

    Bevier, William R.; Young, William D.

    1990-01-01

    A formally verified implementation of the 'oral messages' algorithm of Pease, Shostak, and Lamport is described. An abstract implementation of the algorithm is verified to achieve interactive consistency in the presence of faults. This abstract characterization is then mapped down to a hardware level implementation which inherits the fault-tolerant characteristics of the abstract version. All steps in the proof were checked with the Boyer-Moore theorem prover. A significant results is the demonstration of a fault-tolerant device that is formally specified and whose implementation is proved correct with respect to this specification. A significant simplifying assumption is that the redundant processors behave synchronously. A mechanically checked proof that the oral messages algorithm is 'optimal' in the sense that no algorithm which achieves agreement via similar message passing can tolerate a larger proportion of faulty processor is also described.

  15. A second generation experiment in fault-tolerant software

    NASA Technical Reports Server (NTRS)

    Knight, J. C.

    1986-01-01

    The primary goal was to determine whether the application of fault tolerance to software increases its reliability if the cost of production is the same as for an equivalent nonfault tolerance version derived from the same requirements specification. Software development protocols are discussed. The feasibility of adapting to software design fault tolerance the technique of N-fold Modular Redundancy with majority voting was studied.

  16. Focused fault injection testing of software implemented fault tolerance mechanisms of Voltan TMR nodes

    NASA Astrophysics Data System (ADS)

    Tao, S.; Ezhilchelvan, P. D.; Shrivastava, S. K.

    1995-03-01

    One way of gaining confidence in the adequacy of fault tolerance mechanisms of a system is to test the system by injecting faults and see how the system performs under faulty conditions. This paper presents an application of the focused fault injection method that has been developed for testing software implemented fault tolerance mechanisms of distributed systems. The method exploits the object oriented approach of software implementation to support the injection of specific classes of faults. With the focused fault injection method, the system tester is able to inject specific classes of faults (including malicious ones) such that the fault tolerance mechanisms of a target system can be tested adequately. The method has been applied to test the design and implementation of voting, clock synchronization, and ordering modules of the Voltan TMR (triple modular redundant) node. The tests performed uncovered three flaws in the system software.

  17. The IEEE eighteenth international symposium on fault-tolerant computing (Digest of Papers)

    SciTech Connect

    Not Available

    1988-01-01

    These proceedings collect papers on fault detection and computers. Topics include: software failure behavior, fault tolerant distributed programs, parallel simulation of faults, concurrent built-in self-test techniques, fault-tolerant parallel processor architectures, probabilistic fault diagnosis, fault tolerances in hypercube processors and cellular automation modeling.

  18. Fault-tolerant interconnection network and image-processing applications for the PASM parallel processing system

    SciTech Connect

    Adams, G.B. III

    1984-01-01

    The demand for very high speed data processing coupled with falling hardware costs has made large-scale parallel and distributed computer systems both desirable and feasible. Two modes of parallel processing are single instruction stream-multiple data stream (SIMD) and multiple instruction stream-multiple data stream (MIMD). PASM, a partitionable SIMD/MIMD system, is a reconfigurable multimicroprocessor system being designed for image processing and pattern recognition. An important component of these systems is the interconnection network, the mechanism for communication among the computation nodes and memories. Assuring high reliability for such complex systems is a significant task. Thus, a crucial practical aspect of an interconnection network is fault tolerance. In answer to this need, the Extra Stage Cube (ESC), a fault-tolerant, multistage cube-type interconnection network, is define. The fault tolerance of the ESC is explored for both single and multiple faults, routing tags are defined, and consideration is given to permuting data and partitioning the ESC in the presence of faults. The ESC is compared with other fault-tolerant multistage networks. Finally, reliability of the ESC and an enhanced version of it are investigated.

  19. Fault tolerant architectures for integrated aircraft electronics systems, task 2

    NASA Technical Reports Server (NTRS)

    Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

    1984-01-01

    The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported.

  20. Application-Specific Fault Tolerance via Data Access Characterization

    SciTech Connect

    Ali, Nawab; Krishnamoorthy, Sriram; Govind, Niranjan; Kowalski, Karol; Sadayappan, Ponnuswamy

    2011-08-30

    Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.

  1. Steps toward fault-tolerant quantum chemistry.

    SciTech Connect

    Taube, Andrew Garvin

    2010-05-01

    Developing quantum chemistry programs on the coming generation of exascale computers will be a difficult task. The programs will need to be fault-tolerant and minimize the use of global operations. This work explores the use a task-based model that uses a data-centric approach to allocate work to different processes as it applies to quantum chemistry. After introducing the key problems that appear when trying to parallelize a complicated quantum chemistry method such as coupled-cluster theory, we discuss the implications of that model as it pertains to the computational kernel of a coupled-cluster program - matrix multiplication. Also, we discuss the extensions that would required to build a full coupled-cluster program using the task-based model. Current programming models for high-performance computing are fault-intolerant and use global operations. Those properties are unsustainable as computers scale to millions of CPUs; instead one must recognize that these systems will be hierarchical in structure, prone to constant faults, and global operations will be infeasible. The FAST-OS HARE project is introducing a scale-free computing model to address these issues. This model is hierarchical and fault-tolerant by design, allows for the clean overlap of computation and communication, reducing the network load, does not require checkpointing, and avoids the complexity of many HPC runtimes. Development of an algorithm within this model requires a change in focus from imperative programming to a data-centric approach. Quantum chemistry (QC) algorithms, in particular electronic structure methods, are an ideal test bed for this computing model. These methods describe the distribution of electrons in a molecule, which determine the properties of the molecule. The computational cost of these methods is high, scaling quartically or higher in the size of the molecule, which is why QC applications are major users of HPC resources. The complexity of these algorithms means that

  2. Predeployment validation of fault-tolerant systems through software-implemented fault insertion

    NASA Technical Reports Server (NTRS)

    Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.

    1989-01-01

    Fault injection-based automated testing (FIAT) environment, which can be used to experimentally characterize and evaluate distributed realtime systems under fault-free and faulted conditions is described. A survey is presented of validation methodologies. The need for fault insertion based on validation methodologies is demonstrated. The origins and models of faults, and motivation for the FIAT concept are reviewed. FIAT employs a validation methodology which builds confidence in the system through first providing a baseline of fault-free performance data and then characterizing the behavior of the system with faults present. Fault insertion is accomplished through software and allows faults or the manifestation of faults to be inserted by either seeding faults into memory or triggering error detection mechanisms. FIAT is capable of emulating a variety of fault-tolerant strategies and architectures, can monitor system activity, and can automatically orchestrate experiments involving insertion of faults. There is a common system interface which allows ease of use to decrease experiment development and run time. Fault models chosen for experiments on FIAT have generated system responses which parallel those observed in real systems under faulty conditions. These capabilities are shown by two example experiments each using a different fault-tolerance strategy.

  3. SIFT - Design and analysis of a fault-tolerant computer for aircraft control. [Software Implemented Fault Tolerant systems

    NASA Technical Reports Server (NTRS)

    Wensley, J. H.; Lamport, L.; Goldberg, J.; Green, M. W.; Levitt, K. N.; Melliar-Smith, P. M.; Shostak, R. E.; Weinstock, C. B.

    1978-01-01

    SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replication of tasks among processing units. The main processing units are off-the-shelf minicomputers, with standard microcomputers serving as the interface to the I/O system. Fault isolation is achieved by using a specially designed redundant bus system to interconnect the processing units. Error detection and analysis and system reconfiguration are performed by software. Iterative tasks are redundantly executed, and the results of each iteration are voted upon before being used. Thus, any single failure in a processing unit or bus can be tolerated with triplication of tasks, and subsequent failures can be tolerated after reconfiguration. Independent execution by separate processors means that the processors need only be loosely synchronized, and a novel fault-tolerant synchronization method is described.

  4. Fault-tolerance for exascale systems.

    SciTech Connect

    Riesen, Rolf E.; Varela, Maria Ruiz; Ferreira, Kurt Brian

    2010-08-01

    Periodic, coordinated, checkpointing to disk is the most prevalent fault tolerance method used in modern large-scale, capability class, high-performance computing (HPC) systems. Previous work has shown that as the system grows in size, the inherent synchronization of coordinated checkpoint/restart (CR) limits application scalability; at large node counts the application spends most of its time checkpointing instead of executing useful work. Furthermore, a single component failure forces an application restart from the last correct checkpoint. Suggested alternatives to coordinated CR include uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes. Each of these alternatives have differing overheads that are dependent on both the scale and communication characteristics of the application. In this work, using the Structural Simulation Toolkit (SST) simulator, we compare the performance characteristics of each of these resilience methods for a number of HPC application patterns on a number of proposed exascale machines. The result of this work provides valuable guidance on the most efficient resilience methods for exascale systems.

  5. Fault tolerant filtering and fault detection for quantum systems driven by fields in single photon states

    NASA Astrophysics Data System (ADS)

    Gao, Qing; Dong, Daoyi; Petersen, Ian R.; Rabitz, Herschel

    2016-06-01

    The purpose of this paper is to solve the fault tolerant filtering and fault detection problem for a class of open quantum systems driven by a continuous-mode bosonic input field in single photon states when the systems are subject to stochastic faults. Optimal estimates of both the system observables and the fault process are simultaneously calculated and characterized by a set of coupled recursive quantum stochastic differential equations.

  6. Fault tolerance in a supercomputer through dynamic repartitioning

    DOEpatents

    Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

    2007-02-27

    A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.

  7. RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

    NASA Technical Reports Server (NTRS)

    Dunn, W. R.

    1980-01-01

    RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.

  8. Final Project Report. Scalable fault tolerance runtime technology for petascale computers

    SciTech Connect

    Krishnamoorthy, Sriram; Sadayappan, P

    2015-06-16

    With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been a considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.

  9. Programming fault-tolerant distributed systems in Ada

    NASA Technical Reports Server (NTRS)

    Voigt, Susan J.

    1985-01-01

    Viewgraphs on the topic of programming fault-tolerant distributed systems in the Ada programming language are presented. Topics covered include project goals, Ada difficulties and solutions, testbed requirements, and virtual processors.

  10. Fault-tolerant interconnection networks for multiprocessor systems

    SciTech Connect

    Nassar, H.M.

    1989-01-01

    Interconnection networks represent the backbone of multiprocessor systems. A failure in the network, therefore, could seriously degrade the system performance. For this reason, fault tolerance has been regarded as a major consideration in interconnection network design. This thesis presents two novel techniques to provide fault tolerance capabilities to three major networks: the Beneline network and the Clos network. First, the Simple Fault Tolerance Technique (SFT) is presented. The SFT technique is in fact the result of merging two widely known interconnection mechanisms: a normal interconnection network and a shared bus. This technique is most suitable for networks with small switches, such as the Baseline network and the Benes network. For the Clos network, whose switches may be large for the SFT, another technique is developed to produce the Fault-Tolerant Clos (FTC) network. In the FTC, one switch is added to each stage. The two techniques are described and thoroughly analyzed.

  11. Fault tolerant programmable digital attitude control electronics study

    NASA Technical Reports Server (NTRS)

    Sorensen, A. A.

    1974-01-01

    The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation.

  12. On the design of fault-tolerant robotic manipulator systems

    NASA Astrophysics Data System (ADS)

    Tesar, Delbert

    1993-02-01

    Robotic systems are finding increasing use in space applications. Many of these devices are going to be operational on board the Space Station Freedom. Fault tolerance has been deemed necessary because of the criticality of the tasks and the inaccessibility of the systems to maintenance and repair. Design for fault tolerance in manipulator systems is an area within robotics that is without precedence in the literature. In this paper, we will attempt to lay down the foundations for such a technology. Design for fault tolerance demands new and special approaches to design, often at considerable variance from established design practices. These design aspects, together with reliability evaluation and modeling tools, are presented. Mechanical architectures that employ protective redundancies at many levels and have a modular architecture are then studied in detail. Once a mechanical architecture for fault tolerance has been derived, the chronological stages of operational fault tolerance are investigated. Failure detection, isolation, and estimation methods are surveyed, and such methods for robot sensors and actuators are derived. Failure recovery methods are also presented for each of the protective layers of redundancy. Failure recovery tactics often span all of the layers of a control hierarchy. Thus, a unified framework for decision-making and control, which orchestrates both the nominal redundancy management tasks and the failure management tasks, has been derived. The well-developed field of fault-tolerant computers is studied next, and some design principles relevant to the design of fault-tolerant robot controllers are abstracted. Conclusions are drawn, and a road map for the design of fault-tolerant manipulator systems is laid out with recommendations for a 10 DOF arm with dual actuators at each joint.

  13. Optimal Management of Redundant Control Authority for Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva; Ju, Jianhong

    2000-01-01

    This paper is intended to demonstrate the feasibility of a solution to a fault tolerant control problem. It explains, through a numerical example, the design and the operation of a novel scheme for fault tolerant control. The fundamental principle of the scheme was formalized in [5] based on the notion of normalized nonspecificity. The novelty lies with the use of a reliability criterion for redundancy management, and therefore leads to a high overall system reliability.

  14. Advanced development for space robotics with emphasis on fault tolerance

    NASA Technical Reports Server (NTRS)

    Tesar, D.; Chladek, J.; Hooper, R.; Sreevijayan, D.; Kapoor, C.; Geisinger, J.; Meaney, M.; Browning, G.; Rackers, K.

    1995-01-01

    This paper describes the ongoing work in fault tolerance at the University of Texas at Austin. The paper describes the technical goals the group is striving to achieve and includes a brief description of the individual projects focusing on fault tolerance. The ultimate goal is to develop and test technology applicable to all future missions of NASA (lunar base, Mars exploration, planetary surveillance, space station, etc.).

  15. Experiments in fault tolerant software reliability

    NASA Technical Reports Server (NTRS)

    Mcallister, David F.; Vouk, Mladen A.

    1989-01-01

    Twenty functionally equivalent programs were built and tested in a multiversion software experiment. Following unit testing, all programs were subjected to an extensive system test. In the process sixty-one distinct faults were identified among the versions. Less than 12 percent of the faults exhibited varying degrees of positive correlation. The common-cause (or similar) faults spanned as many as 14 components. However, a majority of these faults were trivial, and easily detected by proper unit and/or system testing. Only two of the seven similar faults were difficult faults, and both were caused by specification ambiguities. One of these faults exhibited variable identical-and-wrong response span, i.e. response span which varied with the testing conditions and input data. Techniques that could have been used to avoid the faults are discussed. For example, it was determined that back-to-back testing of 2-tuples could have been used to eliminate about 90 percent of the faults. In addition, four of the seven similar faults could have been detected by using back-to-back testing of 5-tuples. It is believed that most, if not all, similar faults could have been avoided had the specifications been written using more formal notation, the unit testing phase was subject to more stringent standards and controls, and better tools for measuring the quality and adequacy of the test data (e.g. coverage) were used.

  16. Formal Techniques for Synchronized Fault-Tolerant Systems

    NASA Technical Reports Server (NTRS)

    DiVito, Ben L.; Butler, Ricky W.

    1992-01-01

    We present the formal verification of synchronizing aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization is based on an extended state machine model incorporating snapshots of local processors clocks.

  17. Architectural concepts and redundancy techniques in fault-tolerant computers

    NASA Technical Reports Server (NTRS)

    Rennels, D. A.

    1974-01-01

    This paper presents a description of redundancy techniques employed in the design of fault-tolerant computers, and a discussion of the effects of functional requirements, technology constraints, and cost considerations which enter into the choice of these techniques. The STAR computer, developed at the Jet Propulsion Laboratory for long-duration planetary spacecraft missions, is discussed along with several later fault-tolerant computer designs. The class of computers described in this paper employs dynamic redundancy, i.e., the machine is divided into a set of submodules, each with standby spares; a special hard core monitor unit detects and diagnoses faults, and effects automated recovery by replacing failed parts.

  18. Chip level simulation of fault tolerant computers

    NASA Technical Reports Server (NTRS)

    Armstrong, J. R.

    1983-01-01

    Chip level modeling techniques, functional fault simulation, simulation software development, a more efficient, high level version of GSP, and a parallel architecture for functional simulation are discussed.

  19. ROBUS-2: A Fault-Tolerant Broadcast Communication System

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.

    2005-01-01

    The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault-tolerant integrated modular architecture currently under development at NASA Langley Research Center. The ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, clock synchronization, and distributed diagnosis (group membership). The ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 is tolerant to internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. This version of the ROBUS is intended for laboratory experimentation and demonstrations of the capability to reintegrate failed nodes, dynamically update the communication schedule, and tolerate and recover from correlated transient faults.

  20. Reconfigurable tree architectures using subtree oriented fault tolerance

    NASA Technical Reports Server (NTRS)

    Lowrie, Matthew B.

    1987-01-01

    An approach to the design of reconfigurable tree architecture is presented in which spare processors are allocated at the leaves. The approach is unique in that spares are associated with subtrees and sharing of spares between these subtrees can occur. The Subtree Oriented Fault Tolerance (SOFT) approach is more reliable than previous approaches capable of tolerating link and switch failures for both single chip and multichip tree implementations while reducing redundancy in terms of both spare processors and links. VLSI layout is 0(n) for binary trees and is directly extensible to N-ary trees and fault tolerance through performance degradation.

  1. Fault-tolerant building-block computer study

    NASA Technical Reports Server (NTRS)

    Rennels, D. A.

    1978-01-01

    Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.

  2. Block QCA Fault-Tolerant Logic Gates

    NASA Technical Reports Server (NTRS)

    Firjany, Amir; Toomarian, Nikzad; Modarres, Katayoon

    2003-01-01

    Suitably patterned arrays (blocks) of quantum-dot cellular automata (QCA) have been proposed as fault-tolerant universal logic gates. These block QCA gates could be used to realize the potential of QCA for further miniaturization, reduction of power consumption, increase in switching speed, and increased degree of integration of very-large-scale integrated (VLSI) electronic circuits. The limitations of conventional VLSI circuitry, the basic principle of operation of QCA, and the potential advantages of QCA-based VLSI circuitry were described in several NASA Tech Briefs articles, namely Implementing Permutation Matrices by Use of Quantum Dots (NPO-20801), Vol. 25, No. 10 (October 2001), page 42; Compact Interconnection Networks Based on Quantum Dots (NPO-20855) Vol. 27, No. 1 (January 2003), page 32; Bit-Serial Adder Based on Quantum Dots (NPO-20869), Vol. 27, No. 1 (January 2003), page 35; and Hybrid VLSI/QCA Architecture for Computing FFTs (NPO-20923), which follows this article. To recapitulate the principle of operation (greatly oversimplified because of the limitation on space available for this article): A quantum-dot cellular automata contains four quantum dots positioned at or between the corners of a square cell. The cell contains two extra mobile electrons that can tunnel (in the quantummechanical sense) between neighboring dots within the cell. The Coulomb repulsion between the two electrons tends to make them occupy antipodal dots in the cell. For an isolated cell, there are two energetically equivalent arrangements (denoted polarization states) of the extra electrons. The cell polarization is used to encode binary information. Because the polarization of a nonisolated cell depends on Coulomb-repulsion interactions with neighboring cells, universal logic gates and binary wires could be constructed, in principle, by arraying QCA of suitable design in suitable patterns. Heretofore, researchers have recognized two major obstacles to realization of QCA

  3. Fault tolerant hypercube computer system architecture

    NASA Technical Reports Server (NTRS)

    Madan, Herb S. (Inventor); Chow, Edward (Inventor)

    1989-01-01

    A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node

  4. Evaluation of reliability modeling tools for advanced fault tolerant systems

    NASA Technical Reports Server (NTRS)

    Baker, Robert; Scheper, Charlotte

    1986-01-01

    The Computer Aided Reliability Estimation (CARE III) and Automated Reliability Interactice Estimation System (ARIES 82) reliability tools for application to advanced fault tolerance aerospace systems were evaluated. To determine reliability modeling requirements, the evaluation focused on the Draper Laboratories' Advanced Information Processing System (AIPS) architecture as an example architecture for fault tolerance aerospace systems. Advantages and limitations were identified for each reliability evaluation tool. The CARE III program was designed primarily for analyzing ultrareliable flight control systems. The ARIES 82 program's primary use was to support university research and teaching. Both CARE III and ARIES 82 were not suited for determining the reliability of complex nodal networks of the type used to interconnect processing sites in the AIPS architecture. It was concluded that ARIES was not suitable for modeling advanced fault tolerant systems. It was further concluded that subject to some limitations (the difficulty in modeling systems with unpowered spare modules, systems where equipment maintenance must be considered, systems where failure depends on the sequence in which faults occurred, and systems where multiple faults greater than a double near coincident faults must be considered), CARE III is best suited for evaluating the reliability of advanced tolerant systems for air transport.

  5. Fault-tolerant control of heavy-haul trains

    NASA Astrophysics Data System (ADS)

    Zhuan, Xiangtao; Xia, Xiaohua

    2010-06-01

    The fault-tolerant control (FTC) of heavy-haul trains is discussed on the basis of the speed regulation proposed in previous works. The fault modes of trains are assumed and the corresponding fault detection and isolation (FDI) are studied. The FDI of sensor faults is based on a geometric approach for residual generators. The FDI of a braking system is based on the observation of the steady-state speed. From the difference of the steady-state speeds between the fault system and the faultless system, one can get fault information. Simulation tests were conducted on the suitability of the FDIs and the redesigned speed regulators. It is shown that the proposed FTC does not explicitly worsen the performance of the speed regulator in the case of a faultless system, while it obviously improves the performance of the speed regulator in the case of a faulty system.

  6. Dataflow models for fault-tolerant control systems

    NASA Technical Reports Server (NTRS)

    Papadopoulos, G. M.

    1984-01-01

    Dataflow concepts are used to generate a unified hardware/software model of redundant physical systems which are prone to faults. Basic results in input congruence and synchronization are shown to reduce to a simple model of data exchanges between processing sites. Procedures are given for the construction of congruence schemata, the distinguishing features of any correctly designed redundant system.

  7. Fault-tolerant wait-free shared objects

    NASA Technical Reports Server (NTRS)

    Jayanti, Prasad; Chandra, Tushar D.; Toueg, Sam

    1992-01-01

    A concurrent system consists of processes communicating via shared objects, such as shared variables, queues, etc. The concept of wait-freedom was introduced to cope with process failures: each process that accesses a wait-free object is guaranteed to get a response even if all the other processes crash. However, if a wait-free object 'crashes,' all the processes that access that object are prevented from making progress. In this paper, we introduce the concept of fault-tolerant wait-free objects, and study the problem of implementing them. We give a universal method to construct fault-tolerant wait-free objects, for all types of 'responsive' failures (including one in which faulty objects may 'lie'). In sharp contrast, we prove that many common and interesting types (such as queues, sets, and test&set) have no fault-tolerant wait-free implementations even under the most benign of the 'non-responsive' types of failure. We also introduce several concepts and techniques that are central to the design of fault-tolerant concurrent systems: the concepts of self-implementation and graceful degradation, and techniques to automatically increase the fault-tolerance of implementations. We prove matching lower bounds on the resource complexity of most of our algorithms.

  8. Problems related to the integration of fault tolerant aircraft electronic systems

    NASA Technical Reports Server (NTRS)

    Bannister, J. A.; Adlakha, V.; Triyedi, K.; Alspaugh, T. A., Jr.

    1982-01-01

    Problems related to the design of the hardware for an integrated aircraft electronic system are considered. Taxonomies of concurrent systems are reviewed and a new taxonomy is proposed. An informal methodology intended to identify feasible regions of the taxonomic design space is described. Specific tools are recommended for use in the methodology. Based on the methodology, a preliminary strawman integrated fault tolerant aircraft electronic system is proposed. Next, problems related to the programming and control of inegrated aircraft electronic systems are discussed. Issues of system resource management, including the scheduling and allocation of real time periodic tasks in a multiprocessor environment, are treated in detail. The role of software design in integrated fault tolerant aircraft electronic systems is discussed. Conclusions and recommendations for further work are included.

  9. A benchmark for fault tolerant flight control evaluation

    NASA Astrophysics Data System (ADS)

    Smaili, H.; Breeman, J.; Lombaerts, T.; Stroosma, O.

    2013-12-01

    A large transport aircraft simulation benchmark (REconfigurable COntrol for Vehicle Emergency Return - RECOVER) has been developed within the GARTEUR (Group for Aeronautical Research and Technology in Europe) Flight Mechanics Action Group 16 (FM-AG(16)) on Fault Tolerant Control (2004 2008) for the integrated evaluation of fault detection and identification (FDI) and reconfigurable flight control strategies. The benchmark includes a suitable set of assessment criteria and failure cases, based on reconstructed accident scenarios, to assess the potential of new adaptive control strategies to improve aircraft survivability. The application of reconstruction and modeling techniques, based on accident flight data, has resulted in high-fidelity nonlinear aircraft and fault models to evaluate new Fault Tolerant Flight Control (FTFC) concepts and their real-time performance to accommodate in-flight failures.

  10. Performance of fault-tolerant diagnostics in the hypercube systems

    SciTech Connect

    Ghafoor, A.; Sole, P.

    1989-08-01

    In this paper, they introduce the concept of fault-tolerant self-diagnosis for distributed systems and show that there exists a performance tradeoff between the complexity of a self-diagnostic algorithm and the level of fault tolerance inherited by the algorithm. For the study, they select hypercube systems and show that designing an optimal algorithm for such systems has an equivalent coding theory formulation which belongs to the class of NP-hard problems. Subsequently, they propose an ''efficient'' diagnostic scheme for these systems and study the performance tradeoff of the proposed algorithm which is based on a combinatorial structure called Hadamard matrix. The authors make an essential use of its properties of symmetrical partitioning and covering in hypercube networks. Using known translate weight distributions, they evaluated the tradeoff between the fault tolerance and traffic complexity of the proposed diagnostic algorithm for hypercubes of small sizes. An interesting compromise is exhibited for the hypercube with an arbitrary size.

  11. Multiple Embedded Processors for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

    2005-01-01

    A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

  12. Garbage collection: an exercise in distributed, fault-tolerant programming

    SciTech Connect

    Vestal, S.C.

    1987-01-01

    Two garbage-collection algorithms are presented to reclaim unused storage in object-oriented systems implemented on local area networks. The algorithms are fault-tolerant and allowed parallel, incremental collection in an object address space distributed throughout the system. The two approaches allow multiple collectors, so some unused storage can be reclaimed in partitioned networks. The first method makes use of fault-tolerant reference counts together with an algorithm to collect cycles of objects that would otherwise remain unclaimed. The second method adapts a parallel collector so that it can be used to collect subspaces of the entire network address space. Throughout this work concern is with a methodology for developing distributed, parallel, fault-tolerant programs. Also, there is concern with the suitability of object-oriented systems for such applications.

  13. Measurement and analysis of operating system fault tolerance

    NASA Technical Reports Server (NTRS)

    Lee, I.; Tang, D.; Iyer, R. K.

    1992-01-01

    This paper demonstrates a methodology to model and evaluate the fault tolerance characteristics of operational software. The methodology is illustrated through case studies on three different operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Measurements are made on these systems for substantial periods to collect software error and recovery data. In addition to investigating basic dependability characteristics such as major software problems and error distributions, we develop two levels of models to describe error and recovery processes inside an operating system and on multiple instances of an operating system running in a distributed environment. Based on the models, reward analysis is conducted to evaluate the loss of service due to software errors and the effect of the fault-tolerance techniques implemented in the systems. Software error correlation in multicomputer systems is also investigated.

  14. Hardware

    NASA Technical Reports Server (NTRS)

    1999-01-01

    The full complement of EDOMP investigations called for a broad spectrum of flight hardware ranging from commercial items, modified for spaceflight, to custom designed hardware made to meet the unique requirements of testing in the space environment. In addition, baseline data collection before and after spaceflight required numerous items of ground-based hardware. Two basic categories of ground-based hardware were used in EDOMP testing before and after flight: (1) hardware used for medical baseline testing and analysis, and (2) flight-like hardware used both for astronaut training and medical testing. To ensure post-landing data collection, hardware was required at both the Kennedy Space Center (KSC) and the Dryden Flight Research Center (DFRC) landing sites. Items that were very large or sensitive to the rigors of shipping were housed permanently at the landing site test facilities. Therefore, multiple sets of hardware were required to adequately support the prime and backup landing sites plus the Johnson Space Center (JSC) laboratories. Development of flight hardware was a major element of the EDOMP. The challenges included obtaining or developing equipment that met the following criteria: (1) compact (small size and light weight), (2) battery-operated or requiring minimal spacecraft power, (3) sturdy enough to survive the rigors of spaceflight, (4) quiet enough to pass acoustics limitations, (5) shielded and filtered adequately to assure electromagnetic compatibility with spacecraft systems, (6) user-friendly in a microgravity environment, and (7) accurate and efficient operation to meet medical investigative requirements.

  15. Design methods for fault-tolerant finite state machines

    NASA Technical Reports Server (NTRS)

    Niranjan, Shailesh; Frenzel, James F.

    1993-01-01

    VLSI electronic circuits are increasingly being used in space-borne applications where high levels of radiation may induce faults, known as single event upsets. In this paper we review the classical methods of designing fault tolerant digital systems, with an emphasis on those methods which are particularly suitable for VLSI-implementation of finite state machines. Four methods are presented and will be compared in terms of design complexity, circuit size, and estimated circuit delay.

  16. Fault Tolerance Middleware for a Multi-Core System

    NASA Technical Reports Server (NTRS)

    Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

    2012-01-01

    Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the

  17. Fault-tolerant three-level inverter

    DOEpatents

    Edwards, John; Xu, Longya; Bhargava, Brij B.

    2006-12-05

    A method for driving a neutral point clamped three-level inverter is provided. In one exemplary embodiment, DC current is received at a neutral point-clamped three-level inverter. The inverter has a plurality of nodes including first, second and third output nodes. The inverter also has a plurality of switches. Faults are checked for in the inverter and predetermined switches are automatically activated responsive to a detected fault such that three-phase electrical power is provided at the output nodes.

  18. Single-Shot Fault-Tolerant Quantum Error Correction

    NASA Astrophysics Data System (ADS)

    Bombín, Héctor

    2015-07-01

    Conventional quantum error correcting codes require multiple rounds of measurements to detect errors with enough confidence in fault-tolerant scenarios. Here, I show that for suitable topological codes, a single round of local measurements is enough. This feature is generic and is related to self-correction and confinement phenomena in the corresponding quantum Hamiltonian model. Three-dimensional gauge color codes exhibit this single-shot feature, which also applies to initialization and gauge fixing. Assuming the time for efficient classical computations to be negligible, this yields a topological fault-tolerant quantum computing scheme where all elementary logical operations can be performed in constant time.

  19. Rule-based fault-tolerant flight control

    NASA Technical Reports Server (NTRS)

    Handelman, Dave

    1988-01-01

    Fault tolerance has always been a desirable characteristic of aircraft. The ability to withstand unexpected changes in aircraft configuration has a direct impact on the ability to complete a mission effectively and safely. The possible synergistic effects of combining techniques of modern control theory, statistical hypothesis testing, and artificial intelligence in the attempt to provide failure accommodation for aircraft are investigated. This effort has resulted in the definition of a theory for rule based control and a system for development of such a rule based controller. Although presented here in response to the goal of aircraft fault tolerance, the rule based control technique is applicable to a wide range of complex control problems.

  20. Enhanced Fault-Tolerant Quantum Computing in d -Level Systems

    NASA Astrophysics Data System (ADS)

    Campbell, Earl T.

    2014-12-01

    Error-correcting codes protect quantum information and form the basis of fault-tolerant quantum computing. Leading proposals for fault-tolerant quantum computation require codes with an exceedingly rare property, a transversal non-Clifford gate. Codes with the desired property are presented for d -level qudit systems with prime d . The codes use n =d -1 qudits and can detect up to ˜d /3 errors. We quantify the performance of these codes for one approach to quantum computation known as magic-state distillation. Unlike prior work, we find performance is always enhanced by increasing d .

  1. Fault tolerant kinematic control of hyper-redundant manipulators

    NASA Technical Reports Server (NTRS)

    Bedrossian, Nazareth S.

    1994-01-01

    Hyper-redundant spatial manipulators possess fault-tolerant features because of their redundant structure. The kinematic control of these manipulators is investigated with special emphasis on fault-tolerant control. The manipulator tasks are viewed in the end-effector space while actuator commands are in joint-space, requiring an inverse kinematic algorithm to generate joint-angle commands from the end-effector ones. The rate-inverse kinematic control algorithm presented in this paper utilizes the pseudoinverse to accommodate for joint motor failures. An optimal scale factor for the robust inverse is derived.

  2. Fault tolerant issues in the BTeV trigger

    SciTech Connect

    Jeffrey A. Appel et al.

    2002-12-03

    The BTeV trigger performs sophisticated computations using large ensembles of FPGAs, DSPs, and conventional microprocessors. This system will have between 5,000 and 10,000 computing elements and many networks and data switches. While much attention has been devoted to developing efficient algorithms, the need for fault-tolerant, fault-adaptive, and flexible techniques and software to manage this huge computing platform has been identified as one of the most challenging aspects of this project. They describe the problem and offer an approach to solving it based on a distributed, hierarchical fault management system.

  3. Programs For Modeling Fault-Tolerant Computing Systems

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.

    1991-01-01

    Pade Approximation with Scaling, (PAWS) and Scaling Taylor Exponential Matrix (STEM) computer programs are software tools for design and validation. Provide flexible, user-friendly, language-based interface for input of Markov mathematical methods describing behaviors of fault-tolerant computer systems. Markov models include both recovery from faults via reconfiguration and behaviors of such systems when faults occur. PAWS and STEM produce exact solutions of probability of system failure and provide conservative estimate of number of significant digits in solution. Written in PASCAL and FORTRAN.

  4. Software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1990-01-01

    The use of back-to-back, or comparison, testing for regression test or porting is examined. The efficiency and the cost of the strategy is compared with manual and table-driven single version testing. Some of the key parameters that influence the efficiency and the cost of the approach are the failure identification effort during single version program testing, the extent of implemented changes, the nature of the regression test data (e.g., random), and the nature of the inter-version failure correlation and fault-masking. The advantages and disadvantages of the technique are discussed, together with some suggestions concerning its practical use.

  5. A performance assessment of a byzantine resilient fault-tolerant computer

    NASA Technical Reports Server (NTRS)

    Young, Steven D.; Elks, Carl R.; Graham, R. L.

    1989-01-01

    This report presents the results of a performance analysis of a quad-redundant Fault-Tolerant Processor (FTP). The FTP is a computing system specifically designed for applications where very high reliability is required. Examples of such applications are flight control systems, nuclear power systems, and spacecraft control systems. The FTP performance was analyzed in a hierarchical manner encompassing the hardware, the operating system, and the application. At the hardware level, the hardware organization and design was assessed in relation to system throughput and response. Analysis at the operating system level revealed that the scheduler took only 3.2 percent of each 40ms frame, while the redundancy management software took 10.4 percent. The application level performance was analyzed via a synthetic workload and a representative flight control model. The estimated throughput for this application was found to be 317.6 KIPS if not exercising the voter. Exercising the voter to ensure fault tolerance will diminish this number linearly as the number of votes is increased. This performance analysis method was proven effective by uncovering undesirable behavior and anomalies in the FTP system.

  6. H∞ fault-tolerant control for time-varied actuator fault of nonlinear system

    NASA Astrophysics Data System (ADS)

    Liu, Chunsheng; Jiang, Bin

    2014-12-01

    This paper studies H∞ fault-tolerant control for a class of uncertain nonlinear systems subject to time-varied actuator faults. A radial basis function neural network is utilised to approximate the unknown nonlinear functions; an updating rule is designed to estimate on-line time-varied fault of actuator; and the controller with the states feedback and faults estimation is applied to compensate for the effects of fault and minimise H∞ performance criteria in order to get a desired H∞ disturbance rejection constraint. Sufficient conditions are derived, which guarantees that the closed-loop system is robustly stable and satisfies the H∞ performance in both normal and fault cases. In order to reduce computing cost, a simplified algorithm of matrix Riccati inequality is given. A spacecraft model is presented to demonstrate the effectiveness of the proposed methods.

  7. Multi-version software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1989-01-01

    A number of experimental and theoretical issues associated with the practical use of multi-version software to provide run-time tolerance to software faults were investigated. A specialized tool was developed and evaluated for measuring testing coverage for a variety of metrics. The tool was used to collect information on the relationships between software faults and coverage provided by the testing process as measured by different metrics (including data flow metrics). Considerable correlation was found between coverage provided by some higher metrics and the elimination of faults in the code. Back-to-back testing was continued as an efficient mechanism for removal of un-correlated faults, and common-cause faults of variable span. Software reliability estimation methods was also continued based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. New fault tolerance models were formulated. Simulation studies of the Acceptance Voting and Multi-stage Voting algorithms were finished and it was found that these two schemes for software fault tolerance are superior in many respects to some commonly used schemes. Particularly encouraging are the safety properties of the Acceptance testing scheme.

  8. Locating hardware faults in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-01-12

    Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

  9. Survey of fault-tolerant multistage networks and comparison to the extra stage cube

    SciTech Connect

    Adams, G.B. III; Siegel, H.J.

    1984-01-01

    A variety of fault-tolerant multistage interconnection networks for parallel processing systems that have been proposed in the literature are surveyed. A network is fault-tolerant if it can continue to meet its fault tolerance criterion in the presence of one or more failures of the type(s) allowed by its fault model. Significant differences in fault models and fault-tolerance criteria exist among various fault-tolerant networks. This makes direct comparison of these networks difficult. In analyzing the networks, this paper compares the various models and assesses the effect of choosing a common model and criterion. Network characteristics such as degree of fault tolerance, routing control method, and permutation capability are discussed. The networks surveyed and compared to the extra stage cube are the modified baseline, augmented delta, f-network, enhanced inverse augmented data manipulator, gamma, fault-tolerant Benes, and beta-networks. 21 references.

  10. Electronic Power Switch for Fault-Tolerant Networks

    NASA Technical Reports Server (NTRS)

    Volp, J.

    1987-01-01

    Power field-effect transistors reduce energy waste and simplify interconnections. Current switch containing power field-effect transistor (PFET) placed in series with each load in fault-tolerant power-distribution system. If system includes several loads and supplies, switches placed in series with adjacent loads and supplies. System of switches protects against overloads and losses of individual power sources.

  11. Study of fault tolerant software technology for dynamic systems

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Zacharias, G. L.

    1985-01-01

    The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented.

  12. Data-driven Fault Tolerance for Work Stealing Computations

    SciTech Connect

    Ma, Wenjing; Krishnamoorthy, Sriram

    2012-06-25

    Checkpoint-restart approaches to fault tolerance typically roll back all the processes to the previous checkpoint in the event of a failure. Work stealing is a promising technique to dynamically tolerate variations in the execution environment, including faults, system noise, and energy constraints. In this paper, we present fault tolerance mechanisms for task parallel computations, a popular computation idiom, employing work stealing. The computation is organized as a collection of tasks with data in a global address space. The completion of data operations, rather than the actual messages, is tracked to derive an idempotent data store. This information is used to accurately identify the tasks to be re-executed, therefore to recompute only the lost data, in the presence of random work stealing. We consider three recovery schemes that present distinct trade-offs -- lazy recovery with potentially increased re-execution cost, immediate collective recovery with associated synchronization overheads, and noncollective recovery enabled by additional communication. We employ distributed work stealing to dynamically rebalance the tasks on the live processes and evaluate the three schemes using candidate application benchmarks. We demonstrate that the overheads (space and time) of the fault tolerance mechanism are low, the cost incurred due to failures are small, and the overheads decrease with per-process work at scale.

  13. Clouds: A support architecture for fault tolerant, distributed systems

    NASA Technical Reports Server (NTRS)

    Dasgupta, P.; Leblanco, R. J., Jr.

    1986-01-01

    Clouds is a distributed operating system providing support for fault tolerance, location independence, reconfiguration, and transactions. The implementation paradigm uses objects and nested actions as building blocks. Subsystems and applications that can be supported by Clouds to further enhance the performance and utility of the system are also discussed.

  14. cost and benefits optimization model for fault-tolerant aircraft electronic systems

    NASA Technical Reports Server (NTRS)

    1983-01-01

    The factors involved in economic assessment of fault tolerant systems (FTS) and fault tolerant flight control systems (FTFCS) are discussed. Algorithms for optimization and economic analysis of FTFCS are documented.

  15. Fault tolerance control for proton exchange membrane fuel cell systems

    NASA Astrophysics Data System (ADS)

    Wu, Xiaojuan; Zhou, Boyang

    2016-08-01

    Fault diagnosis and controller design are two important aspects to improve proton exchange membrane fuel cell (PEMFC) system durability. However, the two tasks are often separately performed. For example, many pressure and voltage controllers have been successfully built. However, these controllers are designed based on the normal operation of PEMFC. When PEMFC faces problems such as flooding or membrane drying, a controller with a specific design must be used. This paper proposes a unique scheme that simultaneously performs fault diagnosis and tolerance control for the PEMFC system. The proposed control strategy consists of a fault diagnosis, a reconfiguration mechanism and adjustable controllers. Using a back-propagation neural network, a model-based fault detection method is employed to detect the PEMFC current fault type (flooding, membrane drying or normal). According to the diagnosis results, the reconfiguration mechanism determines which backup controllers to be selected. Three nonlinear controllers based on feedback linearization approaches are respectively built to adjust the voltage and pressure difference in the case of normal, membrane drying and flooding conditions. The simulation results illustrate that the proposed fault tolerance control strategy can track the voltage and keep the pressure difference at desired levels in faulty conditions.

  16. A Self-Stabilizing Hybrid Fault-Tolerant Synchronization Protocol

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2015-01-01

    This paper presents a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. The strategy consists of two parts: first, converting Byzantine faults into symmetric faults, and second, using a proven symmetric-fault tolerant algorithm to solve the general case of the problem. A protocol (algorithm) is also present that tolerates symmetric faults, provided that there are more good nodes than faulty ones. The solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. The solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. A mechanical verification of a proposed protocol is also present. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period.

  17. Reliability model of fault-tolerant data processing system with primary and backup nodes

    NASA Astrophysics Data System (ADS)

    Rahman, P. A.; Bobkova, E. Yu

    2016-04-01

    This paper deals with the fault-tolerant data processing systems, which are widely used in modern world of information technologies and have acceptable overhead expenses in hardware implementation. A simplified reliability model for duplex systems and the offered by authors advanced model for data processing systems with primary and backup nodes based on a three-state model of recoverable elements, which takes into consideration different failure rates of passive and active nodes and finite time of node activation, are also given. A calculation formula for the availability factor of the dual-node data processing system with primary and backup nodes and calculation examples are also provided.

  18. Trojan horse attack free fault-tolerant quantum key distribution protocols

    NASA Astrophysics Data System (ADS)

    Yang, Chun-Wei; Hwang, Tzonelih

    2013-11-01

    This work proposes two quantum key distribution (QKD) protocols—each of which is robust under one kind of collective noises—collective-dephasing noise and collective-rotation noise. Due to the use of a new coding function which produces error-robust codewords allowing one-time transmission of quanta, the proposed QKD schemes are fault-tolerant and congenitally free from Trojan horse attacks without having to use any extra hardware. Moreover, by adopting two Bell state measurements instead of a 4-GHZ state joint measurement for decoding, the proposed protocols are practical in combating collective noises.

  19. Impact of coverage on the reliability of a fault tolerant computer

    NASA Technical Reports Server (NTRS)

    Bavuso, S. J.

    1975-01-01

    A mathematical reliability model is established for a reconfigurable fault tolerant avionic computer system utilizing state-of-the-art computers. System reliability is studied in light of the coverage probabilities associated with the first and second independent hardware failures. Coverage models are presented as a function of detection, isolation, and recovery probabilities. Upper and lower bonds are established for the coverage probabilities and the method for computing values for the coverage probabilities is investigated. Further, an architectural variation is proposed which is shown to enhance coverage.

  20. Simulated fault injection - A methodology to evaluate fault tolerant microprocessor architectures

    NASA Technical Reports Server (NTRS)

    Choi, Gwan S.; Iyer, Ravishankar K.; Carreno, Victor A.

    1990-01-01

    A simulation-based fault-injection method for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault impact. As an example, a fault-tolerant architecture which models the digital aspects of a dual-channel real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100 percent coverage of single transients. Approximately 12 percent of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist.

  1. Multi-fault Tolerance for Cartesian Data Distributions

    SciTech Connect

    Ali, Nawab; Krishnamoorthy, Sriram; Halappanavar, Mahantesh; Daily, Jeffrey A.

    2013-06-01

    Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale sys- tems. Algorithm-based fault tolerance (ABFT) is a promising approach that involves modications to the algorithm to recover from faults with lower over- heads than replicated storage and a signicant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra (FTLA) algo- rithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches as- sume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle corre- lated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several al- ternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent fail- ures on regular blocked data distributions. The evaluation of these algorithms demonstrates that the additional desirable properties are provided by the pro- posed approach with minimal overhead.

  2. Fault tolerant highly reliable inertial navigation system

    NASA Astrophysics Data System (ADS)

    Jeerage, Mahesh; Boettcher, Kevin

    This paper describes a development of failure detection and isolation (FDI) strategies for highly reliable inertial navigation systems. FDI strategies are developed based on the generalized likelihood ratio test (GLRT). A relationship between detection threshold and false alarm rate is developed in terms of the sensor parameters. A new method for correct isolation of failed sensors is presented. Evaluation of FDI performance parameters, such as false alarm rate, wrong isolation probability, and correct isolation probability, are presented. Finally a fault recovery scheme capable of correcting false isolation of good sensors is presented.

  3. Fault-tolerant reactor protection system

    DOEpatents

    Gaubatz, Donald C.

    1997-01-01

    A reactor protection system having four divisions, with quad redundant sensors for each scram parameter providing input to four independent microprocessor-based electronic chassis. Each electronic chassis acquires the scram parameter data from its own sensor, digitizes the information, and then transmits the sensor reading to the other three electronic chassis via optical fibers. To increase system availability and reduce false scrams, the reactor protection system employs two levels of voting on a need for reactor scram. The electronic chassis perform software divisional data processing, vote 2/3 with spare based upon information from all four sensors, and send the divisional scram signals to the hardware logic panel, which performs a 2/4 division vote on whether or not to initiate a reactor scram. Each chassis makes a divisional scram decision based on data from all sensors. Each division performs independently of the others (asynchronous operation). All communications between the divisions are asynchronous. Each chassis substitutes its own spare sensor reading in the 2/3 vote if a sensor reading from one of the other chassis is faulty or missing. Therefore the presence of at least two valid sensor readings in excess of a set point is required before terminating the output to the hardware logic of a scram inhibition signal even when one of the four sensors is faulty or when one of the divisions is out of service.

  4. Fault-tolerant reactor protection system

    DOEpatents

    Gaubatz, D.C.

    1997-04-15

    A reactor protection system is disclosed having four divisions, with quad redundant sensors for each scram parameter providing input to four independent microprocessor-based electronic chassis. Each electronic chassis acquires the scram parameter data from its own sensor, digitizes the information, and then transmits the sensor reading to the other three electronic chassis via optical fibers. To increase system availability and reduce false scrams, the reactor protection system employs two levels of voting on a need for reactor scram. The electronic chassis perform software divisional data processing, vote 2/3 with spare based upon information from all four sensors, and send the divisional scram signals to the hardware logic panel, which performs a 2/4 division vote on whether or not to initiate a reactor scram. Each chassis makes a divisional scram decision based on data from all sensors. Each division performs independently of the others (asynchronous operation). All communications between the divisions are asynchronous. Each chassis substitutes its own spare sensor reading in the 2/3 vote if a sensor reading from one of the other chassis is faulty or missing. Therefore the presence of at least two valid sensor readings in excess of a set point is required before terminating the output to the hardware logic of a scram inhibition signal even when one of the four sensors is faulty or when one of the divisions is out of service. 16 figs.

  5. Combining dynamical decoupling with fault-tolerant quantum computation

    SciTech Connect

    Ng, Hui Khoon; Preskill, John; Lidar, Daniel A.

    2011-07-15

    We study how dynamical decoupling (DD) pulse sequences can improve the reliability of quantum computers. We prove upper bounds on the accuracy of DD-protected quantum gates and derive sufficient conditions for DD-protected gates to outperform unprotected gates. Under suitable conditions, fault-tolerant quantum circuits constructed from DD-protected gates can tolerate stronger noise and have a lower overhead cost than fault-tolerant circuits constructed from unprotected gates. Our accuracy estimates depend on the dynamics of the bath that couples to the quantum computer and can be expressed either in terms of the operator norm of the bath's Hamiltonian or in terms of the power spectrum of bath correlations; we explain in particular how the performance of recursively generated concatenated pulse sequences can be analyzed from either viewpoint. Our results apply to Hamiltonian noise models with limited spatial correlations.

  6. Integrated sensor and actuator fault-tolerant control

    NASA Astrophysics Data System (ADS)

    Seron, María M.; De Doná, José A.; Richter, Jan H.

    2013-04-01

    We propose a fault-tolerant control scheme that deals with sensor and actuator faults through the use of a virtual actuator (VA) and a bank of virtual sensors (VSs). A novel feature of the scheme is that the VSs implicitly integrate both fault detection and isolation (FDI) and - in conjunction with the VA - controller reconfiguration tasks. The VA and the bank of VSs operate in closed-loop with an observer-based tracking controller designed for a nominal (fault free) model of the plant. A switching rule that reconfigures the VA and engages the suitable VS from the bank is based on sets defined for measurable residual signals constructed directly from the VS signals. Our method handles abrupt actuator and sensor faults of arbitrary magnitude including complete outage. The overall scheme is shown to guarantee closed-loop boundedness and setpoint tracking under all considered fault situations. Enhancements of the scheme to deal with errors in the fault detection and isolation are also proposed. Applications of the scheme to a winding machine and an interconnected tank system are presented.

  7. Fault detection, isolation and reconfiguration in FTMP Methods and experimental results. [fault tolerant multiprocessor

    NASA Technical Reports Server (NTRS)

    Lala, J. H.

    1983-01-01

    The Fault-Tolerant Multiprocessor (FTMP) is a highly reliable computer designed to meet a goal of 10 to the -10th failures per hour and built with the objective of flying an active-control transport aircraft. Fault detection, identification, and recovery software is described, and experimental results obtained by injecting faults in the pin level in the FTMP are presented. Over 21,000 faults were injected in the CPU, memory, bus interface circuits, and error detection, masking, and error reporting circuits of one LRU of the multiprocessor. Detection, isolation, and reconfiguration times were recorded for each fault, and the results were found to agree well with earlier assumptions made in reliability modeling.

  8. Advanced information processing system: The Army Fault-Tolerant Architecture detailed design overview

    NASA Technical Reports Server (NTRS)

    Harper, Richard E.; Babikyan, Carol A.; Butler, Bryan P.; Clasen, Robert J.; Harris, Chris H.; Lala, Jaynarayan H.; Masotto, Thomas K.; Nagle, Gail A.; Prizant, Mark J.; Treadwell, Steven

    1994-01-01

    The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The Computer Aided Low Altitude Night Helicopter Flight Program has identified automated Terrain Following/Terrain Avoidance, Nap of the Earth (TF/TA, NOE) operation as key enabling technology for advanced tactical rotorcraft to enhance mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction times is life-critical and mission-critical necessitating an ultra-reliable/high throughput computing platform for dependable service for flight control, fusion of sensor data, route planning, near-field/far-field navigation, and obstacle avoidance operations. To address these needs the Army Fault Tolerant Architecture (AFTA) is being designed and developed. This computer system is based upon the Fault Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Labs (CSDL). AFTA is hard real-time, Byzantine, fault-tolerant parallel processor which is programmed in the ADA language. This document describes the results of the Detailed Design (Phase 2 and 3 of a 3-year project) of the AFTA development. This document contains detailed descriptions of the program objectives, the TF/TA NOE application requirements, architecture, hardware design, operating systems design, systems performance measurements and analytical models.

  9. Multiversion software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1990-01-01

    In this project we have proposed to investigate a number of experimental and theoretical issues associated with the practical use of multi-version software in providing dependable software through fault-avoidance and fault-elimination, as well as run-time tolerance of software faults. In the period reported here we have working on the following: We have continued collection of data on the relationships between software faults and reliability, and the coverage provided by the testing process as measured by different metrics (including data flow metrics). We continued work on software reliability estimation methods based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. We have continued studying back-to-back testing as an efficient mechanism for removal of uncorrelated faults, and common-cause faults of variable span. We have also been studying back-to-back testing as a tool for improvement of the software change process, including regression testing. We continued investigating existing, and worked on formulation of new fault-tolerance models. In particular, we have partly finished evaluation of Consensus Voting in the presence of correlated failures, and are in the process of finishing evaluation of Consensus Recovery Block (CRB) under failure correlation. We find both approaches far superior to commonly employed fixed agreement number voting (usually majority voting). We have also finished a cost analysis of the CRB approach.

  10. Algorithm-dependent fault tolerance for distributed computing

    SciTech Connect

    P. D. Hough; M. e. Goldsby; E. J. Walsh

    2000-02-01

    Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

  11. Active fault tolerant control of a flexible beam

    NASA Astrophysics Data System (ADS)

    Bai, Yuanqiang; Grigoriadis, Karolos M.; Song, Gangbing

    2007-04-01

    This paper presents the development and application of an H∞ fault detection and isolation (FDI) filter and fault tolerant controller (FTC) for smart structures. A linear matrix inequality (LMI) formulation is obtained to design the full order robust H∞ filter to estimate the faulty input signals. A fault tolerant H∞ controller is designed for the combined system of plant and filter which minimizes the control objective selected in the presence of disturbances and faults. A cantilevered flexible beam bonded with piezoceramic smart materials, in particular the PZT (Lead Zirconate Titanate), in the form of a patch is used in the validation of the FDI filter and FTC controller design. These PZT patches are surface-bonded on the beam and perform as actuators and sensors. A real-time data acquisition and control system is used to record the experimental data and to implement the designed FDI filter and FTC. To assist the control system design, system identification is conducted for the first mode of the smart structural system. The state space model from system identification is used for the H∞ FDI filter design. The controller was designed based on minimization of the control effort and displacement of the beam. The residuals obtained from the filter through experiments clearly identify the fault signals. The experimental results of the proposed FTC controller show its e effectiveness for the vibration suppression of the beam for the faulty system when the piezoceramic actuator has a partial failure.

  12. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.

  13. Sliding mode based fault detection, reconstruction and fault tolerant control scheme for motor systems.

    PubMed

    Mekki, Hemza; Benzineb, Omar; Boukhetala, Djamel; Tadjine, Mohamed; Benbouzid, Mohamed

    2015-07-01

    The fault-tolerant control problem belongs to the domain of complex control systems in which inter-control-disciplinary information and expertise are required. This paper proposes an improved faults detection, reconstruction and fault-tolerant control (FTC) scheme for motor systems (MS) with typical faults. For this purpose, a sliding mode controller (SMC) with an integral sliding surface is adopted. This controller can make the output of system to track the desired position reference signal in finite-time and obtain a better dynamic response and anti-disturbance performance. But this controller cannot deal directly with total system failures. However an appropriate combination of the adopted SMC and sliding mode observer (SMO), later it is designed to on-line detect and reconstruct the faults and also to give a sensorless control strategy which can achieve tolerance to a wide class of total additive failures. The closed-loop stability is proved, using the Lyapunov stability theory. Simulation results in healthy and faulty conditions confirm the reliability of the suggested framework. PMID:25747198

  14. Fault Tolerance in ZigBee Wireless Sensor Networks

    NASA Technical Reports Server (NTRS)

    Alena, Richard; Gilstrap, Ray; Baldwin, Jarren; Stone, Thom; Wilson, Pete

    2011-01-01

    Wireless sensor networks (WSN) based on the IEEE 802.15.4 Personal Area Network standard are finding increasing use in the home automation and emerging smart energy markets. The network and application layers, based on the ZigBee 2007 PRO Standard, provide a convenient framework for component-based software that supports customer solutions from multiple vendors. This technology is supported by System-on-a-Chip solutions, resulting in extremely small and low-power nodes. The Wireless Connections in Space Project addresses the aerospace flight domain for both flight-critical and non-critical avionics. WSNs provide the inherent fault tolerance required for aerospace applications utilizing such technology. The team from Ames Research Center has developed techniques for assessing the fault tolerance of ZigBee WSNs challenged by radio frequency (RF) interference or WSN node failure.

  15. Fault-tolerant clock synchronization validation methodology. [in computer systems

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.; Palumbo, Daniel L.; Johnson, Sally C.

    1987-01-01

    A validation method for the synchronization subsystem of a fault-tolerant computer system is presented. The high reliability requirement of flight-crucial systems precludes the use of most traditional validation methods. The method presented utilizes formal design proof to uncover design and coding errors and experimentation to validate the assumptions of the design proof. The experimental method is described and illustrated by validating the clock synchronization system of the Software Implemented Fault Tolerance computer. The design proof of the algorithm includes a theorem that defines the maximum skew between any two nonfaulty clocks in the system in terms of specific system parameters. Most of these parameters are deterministic. One crucial parameter is the upper bound on the clock read error, which is stochastic. The probability that this upper bound is exceeded is calculated from data obtained by the measurement of system parameters. This probability is then included in a detailed reliability analysis of the system.

  16. A Fault Tolerant System for an Integrated Avionics Sensor Configuration

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Lancraft, R. E.

    1984-01-01

    An aircraft sensor fault tolerant system methodology for the Transport Systems Research Vehicle in a Microwave Landing System (MLS) environment is described. The fault tolerant system provides reliable estimates in the presence of possible failures both in ground-based navigation aids, and in on-board flight control and inertial sensors. Sensor failures are identified by utilizing the analytic relationships between the various sensors arising from the aircraft point mass equations of motion. The estimation and failure detection performance of the software implementation (called FINDS) of the developed system was analyzed on a nonlinear digital simulation of the research aircraft. Simulation results showing the detection performance of FINDS, using a dual redundant sensor compliment, are presented for bias, hardover, null, ramp, increased noise and scale factor failures. In general, the results show that FINDS can distinguish between normal operating sensor errors and failures while providing an excellent detection speed for bias failures in the MLS, indicated airspeed, attitude and radar altimeter sensors.

  17. Fault-tolerant software for aircraft control systems

    NASA Technical Reports Server (NTRS)

    1978-01-01

    Concepts for software to implement real time aircraft control systems on a centralized digital computer were discussed. A fault tolerant software structure employing functionally redundant routines with concurrent error detection was proposed for critical control functions involving safety of flight and landing. A degraded recovery block concept was devised to allow collocation of critical and noncritical software modules within the same control structure. The additional computer resources required to implement the proposed software structure for a representative set of aircraft control functions were discussed. It was estimated that approximately 30 percent more memory space is required to implement the total set of control functions. A reliability model for the fault tolerant software was described and parametric estimates of failure rate were made.

  18. FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft

    NASA Technical Reports Server (NTRS)

    Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.

    1978-01-01

    The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.

  19. Fault-tolerant Landau-Zener quantum gates

    SciTech Connect

    Hicke, C.; Santos, L. F.; Dykman, M. I.

    2006-01-15

    We present a method to perform fault-tolerant single-qubit gate operations using Landau-Zener tunneling. In a single Landau-Zener pulse, the qubit transition frequency is varied in time so that it passes through the frequency of the radiation field. We show that a simple three-pulse sequence allows eliminating errors in the gate up to the third order in errors in the qubit energies or the radiation frequency.

  20. Decomposition in reliability analysis of fault-tolerant systems

    NASA Technical Reports Server (NTRS)

    Trivedi, K. S.; Geist, R. M.

    1983-01-01

    The existing approaches to reliability modeling are briefly reviewed. An examination of the limitations of the existing approaches in modeling ultrareliable fault-tolerant systems illustrates the need to use decomposition techniques. The notion of behavioral decomposition is introduced for dealing with reliability models with a large number of states, and a series of examples is presented. The CARE (computer-aided reliability estimation) and HARP (hybrid automated reliability predictor) approaches to reliability are discussed.

  1. Fault tolerant sequential circuits using sequence invariant state machines

    NASA Technical Reports Server (NTRS)

    Alahmad, M.; Whitaker, S.

    1991-01-01

    The idea of introducing redundancy to improve the reliability of digital systems originates from papers published in the 1950's. Since then, redundancy has been recognized as a realistic means for constructing reliable systems. A method using redundancy to reconfigure the Sequency Invariant State Machine (SISM) to achieve fault tolerance is introduced. This new architecture is most useful in space applications, where recovery rather than replacement of faulty modules is the only means of maintenance.

  2. Bounded-time fault-tolerant rule-based systems

    NASA Technical Reports Server (NTRS)

    Browne, James C.; Emerson, Allen; Gouda, Mohamed; Miranker, Daniel; Mok, Aloysius; Rosier, Louis

    1990-01-01

    Two systems concepts are introduced: bounded response-time and self-stabilization in the context of rule-based programs. These concepts are essential for the design of rule-based programs which must be highly fault tolerant and perform in a real time environment. The mechanical analysis of programs for these two properties is discussed. The techniques are used to analyze a NASA application.

  3. A fault-tolerant one-way quantum computer

    SciTech Connect

    Raussendorf, R. . E-mail: rraussendorf@perimeterinstitute.ca; Harrington, J.; Goyal, K.

    2006-09-15

    We describe a fault-tolerant one-way quantum computer on cluster states in three dimensions. The presented scheme uses methods of topological error correction resulting from a link between cluster states and surface codes. The error threshold is 1.4% for local depolarizing error and 0.11% for each source in an error model with preparation-, gate-, storage-, and measurement errors.

  4. The art of fault-tolerant system reliability modeling

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.; Johnson, Sally C.

    1990-01-01

    A step-by-step tutorial of the methods and tools used for the reliability analysis of fault-tolerant systems is presented. Emphasis is on the representation of architectural features in mathematical models. Details of the mathematical solution of complex reliability models are not presented. Instead the use of several recently developed computer programs--SURE, ASSIST, STEM, PAWS--which automate the generation and solution of these models is described.

  5. Validation of a fault-tolerant clock synchronization system

    NASA Technical Reports Server (NTRS)

    Butler, R. W.; Johnson, S. C.

    1984-01-01

    A validation method for the synchronization subsystem of a fault tolerant computer system is investigated. The method combines formal design verification with experimental testing. The design proof reduces the correctness of the clock synchronization system to the correctness of a set of axioms which are experimentally validated. Since the reliability requirements are often extreme, requiring the estimation of extremely large quantiles, an asymptotic approach to estimation in the tail of a distribution is employed.

  6. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.

  7. Scalable and Fault Tolerant Failure Detection and Consensus

    SciTech Connect

    Katti, Amogh; Di Fatta, Giuseppe; Naughton III, Thomas J; Engelmann, Christian

    2015-01-01

    Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

  8. Reliability of Fault Tolerant Control Systems. Part 2

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva

    2000-01-01

    This paper reports Part II of a two part effort that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability properties peculiar to fault-tolerant control systems are emphasized, such as the presence of analytic redundancy in high proportion, the dependence of failures on control performance, and high risks associated with decisions in redundancy management due to multiple sources of uncertainties and sometimes large processing requirements. As a consequence, coverage of failures through redundancy management can be severely limited. The paper proposes to formulate the fault tolerant control problem as an optimization problem that maximizes coverage of failures through redundancy management. Coverage modeling is attempted in a way that captures its dependence on the control performance and on the diagnostic resolution. Under the proposed redundancy management policy, it is shown that an enhanced overall system reliability can be achieved with a control law of a superior robustness, with an estimator of a higher resolution, and with a control performance requirement of a lesser stringency.

  9. Faster quantum chemistry simulation on fault-tolerant quantum computers

    NASA Astrophysics Data System (ADS)

    Cody Jones, N.; Whitfield, James D.; McMahon, Peter L.; Yung, Man-Hong; Van Meter, Rodney; Aspuru-Guzik, Alán; Yamamoto, Yoshihisa

    2012-11-01

    Quantum computers can in principle simulate quantum physics exponentially faster than their classical counterparts, but some technical hurdles remain. We propose methods which substantially improve the performance of a particular form of simulation, ab initio quantum chemistry, on fault-tolerant quantum computers; these methods generalize readily to other quantum simulation problems. Quantum teleportation plays a key role in these improvements and is used extensively as a computing resource. To improve execution time, we examine techniques for constructing arbitrary gates which perform substantially faster than circuits based on the conventional Solovay-Kitaev algorithm (Dawson and Nielsen 2006 Quantum Inform. Comput. 6 81). For a given approximation error ɛ, arbitrary single-qubit gates can be produced fault-tolerantly and using a restricted set of gates in time which is O(log ɛ) or O(log log ɛ) with sufficient parallel preparation of ancillas, constant average depth is possible using a method we call programmable ancilla rotations. Moreover, we construct and analyze efficient implementations of first- and second-quantized simulation algorithms using the fault-tolerant arbitrary gates and other techniques, such as implementing various subroutines in constant time. A specific example we analyze is the ground-state energy calculation for lithium hydride.

  10. Active Fault Tolerant Control for Ultrasonic Piezoelectric Motor

    NASA Astrophysics Data System (ADS)

    Boukhnifer, Moussa

    2012-07-01

    Ultrasonic piezoelectric motor technology is an important system component in integrated mechatronics devices working on extreme operating conditions. Due to these constraints, robustness and performance of the control interfaces should be taken into account in the motor design. In this paper, we apply a new architecture for a fault tolerant control using Youla parameterization for an ultrasonic piezoelectric motor. The distinguished feature of proposed controller architecture is that it shows structurally how the controller design for performance and robustness may be done separately which has the potential to overcome the conflict between performance and robustness in the traditional feedback framework. A fault tolerant control architecture includes two parts: one part for performance and the other part for robustness. The controller design works in such a way that the feedback control system will be solely controlled by the proportional plus double-integral PI2 performance controller for a nominal model without disturbances and H∞ robustification controller will only be activated in the presence of the uncertainties or an external disturbances. The simulation results demonstrate the effectiveness of the proposed fault tolerant control architecture.

  11. Resource requirements for a fault-tolerant quantum Fourier transform

    NASA Astrophysics Data System (ADS)

    Goto, Hayato

    2014-11-01

    We investigate resource requirements for a fault-tolerant quantum Fourier transform. The quantum Fourier transform is a basic subroutine for quantum algorithms which provide an exponential speedup over known classical ones, such as Shor's algorithm for factoring. To implement single-qubit rotations required for a quantum Fourier transform in a fault-tolerant manner, we consider two types of approaches: gate synthesis and state distillation. While the gate synthesis approximates single-qubit rotations with basic quantum operations, the state distillation allows one to perform single-qubit rotations for a quantum Fourier transform exactly. It is unknown, however, which approach is better for a quantum Fourier transform. Here we develop a state-distillation method optimized for a quantum Fourier transform and compare this performance with those of state-of-the-art techniques for gate synthesis without and with ancillary states (ancillas). The performance is evaluated with the resource requirement for a quantum Fourier transform. The resource is measured by the total number of π /8 gates denoted by T , which is called the T count. Contrary to the expectation, the T count for the state distillation is considerably larger than those for the ancilla-free and ancilla-assisted gate synthesis. Thus, we conclude that the ancilla-assisted gate synthesis is a better approach to a fault-tolerant quantum Fourier transform.

  12. Fault-Tolerant, Radiation-Hard DSP

    NASA Technical Reports Server (NTRS)

    Czajkowski, David

    2011-01-01

    Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high

  13. Verification of the FtCayuga fault-tolerant microprocessor system. Volume 1: A case study in theorem prover-based verification

    NASA Technical Reports Server (NTRS)

    Srivas, Mandayam; Bickford, Mark

    1991-01-01

    The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.

  14. A scheme for fault tolerance in earth sensors

    NASA Technical Reports Server (NTRS)

    Murugesan, S.; Goel, P. S.

    1989-01-01

    A system is presented that uses dual-redundant earth sensors to measure pitch and roll errors of a three-axis stabilized spacecraft, with provision for (1) autonomously detecting and identifying a faulty earth sensor, and (2) automatically selecting the outputs of the fault-free sensor for closed-loop attitude control, before failures cause major problems. A brief description is given of the system, and various failure modes of earth sensors and their effects are discussed. Novel techniques and algorithms for automatic fault detection, identification, and reconfiguration (FDIR) of dual-redundant earth sensors are developed. The algorithms are validated through computer simulations, and the results are presented. The proposed scheme can easily be implemented without much penalty on hardware, power consumption, and processing time.

  15. Algorithm-Based Fault Tolerance for Numerical Subroutines

    NASA Technical Reports Server (NTRS)

    Tumon, Michael; Granat, Robert; Lou, John

    2007-01-01

    A software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.

  16. Disturbance observer based fault estimation and dynamic output feedback fault tolerant control for fuzzy systems with local nonlinear models.

    PubMed

    Han, Jian; Zhang, Huaguang; Wang, Yingchun; Liu, Yang

    2015-11-01

    This paper addresses the problems of fault estimation (FE) and fault tolerant control (FTC) for fuzzy systems with local nonlinear models, external disturbances, sensor and actuator faults, simultaneously. Disturbance observer (DO) and FE observer are designed, simultaneously. Compared with the existing results, the proposed observer is with a wider application range. Using the estimation information, a novel fuzzy dynamic output feedback fault tolerant controller (DOFFTC) is designed. The controller can be used for the fuzzy systems with unmeasurable local nonlinear models, mismatched input disturbances, and measurement output affecting by sensor faults and disturbances. At last, the simulation shows the effectiveness of the proposed methods. PMID:26456728

  17. Fault-tolerance and two-level pipelining in VLSI systolic arrays

    SciTech Connect

    Kung, H.T.; Lam, M.S.

    1984-01-01

    The authors address two important issues in systolic array designs: fault-tolerance and two-level pipelining. The proposed systolic fault-tolerant scheme maintains the original data flow pattern by bypassing defective cells with a few registers. As a result, many of the desirable properties of systolic arrays (such as local and regular communication between cells) are preserved. Two-level pipelining refers to the use of pipelined functional units in the implementation of systolic cells. The authors paper addresses the problem of efficiently utilizing pipelined units to increase the overall system throughput. They show that both of these problems can be reduced to the same mathematical problem of incorporating extra delays on certain data paths in originally correct systolic designs. They introduce the mathematical notion of a cut which enables them to handle this problem effectively. The results obtained by applying the techniques described are encouraging. When applied to systolic arrays without feedback cycles, the arrays can tolerate large numbers of failures (with the addition of very little hardware) while maintaining the original throughput. Furthermore, all of the pipeline stages in the cells can be kept fully utilized through the addition of a small number of delay registers. However, adding delays to systolic arrays with cycles typically induces a significant decrease in throughput. In response to this, they have derived a new class of systolic algorithms in which the data cycle around a ring of processing cells. The systolic ring architecture has the property that its performance degrades gracefully as cells fail. Using the cut theory for arrays without feedback and the ring architecture approach for those with feedback, they have effective fault-tolerant and two-level pipelining schemes for most systolic arrays. 24 references.

  18. SIFT - Multiprocessor architecture for Software Implemented Fault Tolerance flight control and avionics computers

    NASA Technical Reports Server (NTRS)

    Forman, P.; Moses, K.

    1979-01-01

    A brief description of a SIFT (Software Implemented Fault Tolerance) Flight Control Computer with emphasis on implementation is presented. A multiprocessor system that relies on software-implemented fault detection and reconfiguration algorithms is described. A high level reliability and fault tolerance is achieved by the replication of computing tasks among processing units.

  19. Fault Detection, Isolation and Recovery (FDIR) Portable Liquid Oxygen Hardware Demonstrator

    NASA Technical Reports Server (NTRS)

    Oostdyk, Rebecca L.; Perotti, Jose M.

    2011-01-01

    The Fault Detection, Isolation and Recovery (FDIR) hardware demonstration will highlight the effort being conducted by Constellation's Ground Operations (GO) to provide the Launch Control System (LCS) with system-level health management during vehicle processing and countdown activities. A proof-of-concept demonstration of the FDIR prototype established the capability of the software to provide real-time fault detection and isolation using generated Liquid Hydrogen data. The FDIR portable testbed unit (presented here) aims to enhance FDIR by providing a dynamic simulation of Constellation subsystems that feed the FDIR software live data based on Liquid Oxygen system properties. The LO2 cryogenic ground system has key properties that are analogous to the properties of an electronic circuit. The LO2 system is modeled using electrical components and an equivalent circuit is designed on a printed circuit board to simulate the live data. The portable testbed is also be equipped with data acquisition and communication hardware to relay the measurements to the FDIR application running on a PC. This portable testbed is an ideal capability to perform FDIR software testing, troubleshooting, training among others.

  20. A fault-tolerant control architecture for unmanned aerial vehicles

    NASA Astrophysics Data System (ADS)

    Drozeski, Graham R.

    Research has presented several approaches to achieve varying degrees of fault-tolerance in unmanned aircraft. Approaches in reconfigurable flight control are generally divided into two categories: those which incorporate multiple non-adaptive controllers and switch between them based on the output of a fault detection and identification element, and those that employ a single adaptive controller capable of compensating for a variety of fault modes. Regardless of the approach for reconfigurable flight control, certain fault modes dictate system restructuring in order to prevent a catastrophic failure. System restructuring enables active control of actuation not employed by the nominal system to recover controllability of the aircraft. After system restructuring, continued operation requires the generation of flight paths that adhere to an altered flight envelope. The control architecture developed in this research employs a multi-tiered hierarchy to allow unmanned aircraft to generate and track safe flight paths despite the occurrence of potentially catastrophic faults. The hierarchical architecture increases the level of autonomy of the system by integrating five functionalities with the baseline system: fault detection and identification, active system restructuring, reconfigurable flight control; reconfigurable path planning, and mission adaptation. Fault detection and identification algorithms continually monitor aircraft performance and issue fault declarations. When the severity of a fault exceeds the capability of the baseline flight controller, active system restructuring expands the controllability of the aircraft using unconventional control strategies not exploited by the baseline controller. Each of the reconfigurable flight controllers and the baseline controller employ a proven adaptive neural network control strategy. A reconfigurable path planner employs an adaptive model of the vehicle to re-shape the desired flight path. Generation of the revised

  1. Modeling and measurement of fault-tolerant multiprocessors

    NASA Technical Reports Server (NTRS)

    Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

    1985-01-01

    The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.

  2. Fault-Tolerant Control Based on Hybrid Redundancy

    NASA Astrophysics Data System (ADS)

    Takagi, Taro; Takahashi, Masanori

    This paper presents a new fault-tolerant control system (FTCS) against actuator failures. The proposed FTCS is based on a hybrid of static and dynamic redundancies. The redundancy-mode is selected appropriately by only a switching logic which is designed from the control performance. Hence, no fault detector is utilized. For all switched modes, a unity high-gain feedback controller with a parallel feedforward compensator is introduced to attain the stabilization and the asymptotic tracking. Because the controller has high robustness with respect to uncertainties, the FTCS can cope with variations in dynamics that is caused by the failure. In this paper, several simulation results for the connected vehicles are shown to confirm the effectiveness of the FTCS.

  3. Gain-Scheduled Fault Tolerance Control Under False Identification

    NASA Technical Reports Server (NTRS)

    Shin, Jong-Yeob; Belcastro, Christine (Technical Monitor)

    2006-01-01

    An active fault tolerant control (FTC) law is generally sensitive to false identification since the control gain is reconfigured for fault occurrence. In the conventional FTC law design procedure, dynamic variations due to false identification are not considered. In this paper, an FTC synthesis method is developed in order to consider possible variations of closed-loop dynamics under false identification into the control design procedure. An active FTC synthesis problem is formulated into an LMI optimization problem to minimize the upper bound of the induced-L2 norm which can represent the worst-case performance degradation due to false identification. The developed synthesis method is applied for control of the longitudinal motions of FASER (Free-flying Airplane for Subscale Experimental Research). The designed FTC law of the airplane is simulated for pitch angle command tracking under a false identification case.

  4. Hypothetical Scenario Generator for Fault-Tolerant Diagnosis

    NASA Technical Reports Server (NTRS)

    James, Mark

    2007-01-01

    The Hypothetical Scenario Generator for Fault-tolerant Diagnostics (HSG) is an algorithm being developed in conjunction with other components of artificial- intelligence systems for automated diagnosis and prognosis of faults in spacecraft, aircraft, and other complex engineering systems. By incorporating prognostic capabilities along with advanced diagnostic capabilities, these developments hold promise to increase the safety and affordability of the affected engineering systems by making it possible to obtain timely and accurate information on the statuses of the systems and predicting impending failures well in advance. The HSG is a specific instance of a hypothetical- scenario generator that implements an innovative approach for performing diagnostic reasoning when data are missing. The special purpose served by the HSG is to (1) look for all possible ways in which the present state of the engineering system can be mapped with respect to a given model and (2) generate a prioritized set of future possible states and the scenarios of which they are parts.

  5. An improved fault-tolerant control scheme for PWM inverter-fed induction motor-based EVs.

    PubMed

    Tabbache, Bekheïra; Benbouzid, Mohamed; Kheloui, Abdelaziz; Bourgeot, Jean-Matthieu; Mamoune, Abdeslam

    2013-11-01

    This paper proposes an improved fault-tolerant control scheme for PWM inverter-fed induction motor-based electric vehicles. The proposed strategy deals with power switch (IGBTs) failures mitigation within a reconfigurable induction motor control. To increase the vehicle powertrain reliability regarding IGBT open-circuit failures, 4-wire and 4-leg PWM inverter topologies are investigated and their performances discussed in a vehicle context. The proposed fault-tolerant topologies require only minimum hardware modifications to the conventional off-the-shelf six-switch three-phase drive, mitigating the IGBTs failures by specific inverter control. Indeed, the two topologies exploit the induction motor neutral accessibility for fault-tolerant purposes. The 4-wire topology uses then classical hysteresis controllers to account for the IGBT failures. The 4-leg topology, meanwhile, uses a specific 3D space vector PWM to handle vehicle requirements in terms of size (DC bus capacitors) and cost (IGBTs number). Experiments on an induction motor drive and simulations on an electric vehicle are carried-out using a European urban driving cycle to show that the proposed fault-tolerant control approach is effective and provides a simple configuration with high performance in terms of speed and torque responses. PMID:23916869

  6. Minimizing resource overheads for fault-tolerant preparation of encoded states of the Steane code

    PubMed Central

    Goto, Hayato

    2016-01-01

    The seven-qubit quantum error-correcting code originally proposed by Steane is one of the best known quantum codes. The Steane code has a desirable property that most basic operations can be performed easily in a fault-tolerant manner. A major obstacle to fault-tolerant quantum computation with the Steane code is fault-tolerant preparation of encoded states, which requires large computational resources. Here we propose efficient state preparation methods for zero and magic states encoded with the Steane code, where the zero state is one of the computational basis states and the magic state allows us to achieve universality in fault-tolerant quantum computation. The methods minimize resource overheads for the fault-tolerant state preparation, and therefore reduce necessary resources for quantum computation with the Steane code. Thus, the present results will open a new possibility for efficient fault-tolerant quantum computation. PMID:26812959

  7. Sliding mode fault detection and fault-tolerant control of smart dampers in semi-active control of building structures

    NASA Astrophysics Data System (ADS)

    Yeganeh Fallah, Arash; Taghikhany, Touraj

    2015-12-01

    Recent decades have witnessed much interest in the application of active and semi-active control strategies for seismic protection of civil infrastructures. However, the reliability of these systems is still in doubt as there remains the possibility of malfunctioning of their critical components (i.e. actuators and sensors) during an earthquake. This paper focuses on the application of the sliding mode method due to the inherent robustness of its fault detection observer and fault-tolerant control. The robust sliding mode observer estimates the state of the system and reconstructs the actuators’ faults which are used for calculating a fault distribution matrix. Then the fault-tolerant sliding mode controller reconfigures itself by the fault distribution matrix and accommodates the fault effect on the system. Numerical simulation of a three-story structure with magneto-rheological dampers demonstrates the effectiveness of the proposed fault-tolerant control system. It was shown that the fault-tolerant control system maintains the performance of the structure at an acceptable level in the post-fault case.

  8. Software Implemented Fault-Tolerant (SIFT) user's guide

    NASA Technical Reports Server (NTRS)

    Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.

    1984-01-01

    Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.

  9. Computer-aided reliability estimation. [for fault-tolerant systems

    NASA Technical Reports Server (NTRS)

    Stiffler, J. J.

    1977-01-01

    Computer-aided reliability estimation (CARE) programs are developed to improve the tools available for estimating the reliability of fault-tolerant systems. A description is presented of a program, called CARE II, which was developed after the first program reported by Mathur (1971). Attention is given to the CARE II reliability model, the CARE II coverage model, and CARE II limitations which are to be rectified in CARE III. It is pointed out that the present coverage model in CARE II is extremely versatile. The major limitation is related to the burden placed on the user to determine the basic parameters from which the coverage calculations are made.

  10. Implementing fault tolerance in a superconducting quantum circuit

    NASA Astrophysics Data System (ADS)

    Barends, Rami

    2015-03-01

    The surface code error correction scheme is appealing for superconducting circuits as the fundamental operations have been demonstrated at the fault-tolerant threshold. Here, we present experimental results on the repetition code, a one-dimensional primitive of the surface code which can detect bit-flip errors, implemented on a device consisting of nine Xmon transmon qubits. We discuss the basic mechanics of error detection, show preservation of a Greenberger-Horne-Zeilinger state, and show suppression of environmentally-induced error.