hardware fault tolerance: Topics by Science.gov

Sample records for hardware fault tolerance

Analysis of a hardware and software fault tolerant processor for critical applications

NASA Technical Reports Server (NTRS)

Dugan, Joanne B.

1993-01-01

Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input-dependent. Software faults are triggered by a particular set of inputs. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. A methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor. The models consider both transient and permanent faults, hardware and software faults, independent and related software faults, automatic recovery, and reconfiguration.
Study of a unified hardware and software fault-tolerant architecture

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart

1989-01-01

A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.
Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks

NASA Technical Reports Server (NTRS)

Lee, Y. H.; Shin, K. G.

1982-01-01

A fault-tolerant multiprocessor with a rollback recovery mechanism is discussed. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery block is constructed by consecutive state-save operations and several state-save units in every processor and memory module. When a fault is detected, the multiprocessor reconfigures itself to replace the faulty component and then the process originally assigned to the faulty component retreats to one of the previously saved states in order to resume fault-free execution. A mathematical model is proposed to calculate both the coverage of multi-step rollback recovery and the risk of restart. A performance evaluation in terms of task execution time is also presented.
Fault Tolerant Characteristics of Artificial Neural Network Electronic Hardware

NASA Technical Reports Server (NTRS)

Zee, Frank

1995-01-01

The fault tolerant characteristics of analog-VLSI artificial neural network (with 32 neurons and 532 synapses) chips are studied by exposing them to high energy electrons, high energy protons, and gamma ionizing radiations under biased and unbiased conditions. The biased chips became nonfunctional after receiving a cumulative dose of less than 20 krads, while the unbiased chips only started to show degradation with a cumulative dose of over 100 krads. As the total radiation dose increased, all the components demonstrated graceful degradation. The analog sigmoidal function of the neuron became steeper (increase in gain), current leakage from the synapses progressively shifted the sigmoidal curve, and the digital memory of the synapses and the memory addressing circuits began to gradually fail. From these radiation experiments, we can learn how to modify certain designs of the neural network electronic hardware without using radiation-hardening techniques to increase its reliability and fault tolerance.
A methodology for testing fault-tolerant software

NASA Technical Reports Server (NTRS)

Andrews, D. M.; Mahmood, A.; Mccluskey, E. J.

1985-01-01

A methodology for testing fault tolerant software is presented. There are problems associated with testing fault tolerant software because many errors are masked or corrected by voters, limiter, or automatic channel synchronization. This methodology illustrates how the same strategies used for testing fault tolerant hardware can be applied to testing fault tolerant software. For example, one strategy used in testing fault tolerant hardware is to disable the redundancy during testing. A similar testing strategy is proposed for software, namely, to move the major emphasis on testing earlier in the development cycle (before the redundancy is in place) thus reducing the possibility that undetected errors will be masked when limiters and voters are added.
Rapid recovery from transient faults in the fault-tolerant processor with fault-tolerant shared memory

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Butler, Bryan P.

1990-01-01

The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
A fault-tolerant intelligent robotic control system

NASA Technical Reports Server (NTRS)

Marzwell, Neville I.; Tso, Kam Sing

1993-01-01

This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines.
Study of fault-tolerant software technology

NASA Technical Reports Server (NTRS)

Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.

1984-01-01

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.
Fault recovery characteristics of the fault tolerant multi-processor

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1990-01-01

The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, P.A.

1988-02-01

The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

NASA Technical Reports Server (NTRS)

Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

1992-01-01

Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions.
The Design of a Fault-Tolerant COTS-Based Bus Architecture

NASA Technical Reports Server (NTRS)

Chau, Savio N.; Alkalai, Leon; Burt, John B.; Tai, Ann T.

1999-01-01

In this paper, we report our experiences and findings on the design of a fault-tolerant bus architecture comprised of two COTS buses, the IEEE 1394 and the 12C. This fault-tolerant bus is the backbone system bus for the avionics architecture of the X2000 program at the Jet Propulsion Laboratory. COTS buses are attractive because of the availability of low cost commercial products. However, they are not specifically designed for highly reliable applications such as long-life deep-space missions. The X2000 design team has devised a multi-level fault tolerance approach to compensate for this shortcoming of COTS buses. First, the approach enhances the fault tolerance capabilities of the IEEE 1394 and 12 C buses by adding a layer of fault handling hardware and software. Second, algorithms are developed to enable the IEEE 1394 and the 12 C buses assist each other to isolate and recovery from faults. Third, the set of IEEE 1394 and 12 C buses is duplicated to further enhance system reliability. The X2000 design team has paid special attention to guarantee that all fault tolerance provisions will not cause the bus design to deviate from the commercial standard specifications. Otherwise, the economic attractiveness of using COTS will be diminished. The hardware and software design of the X2000 fault-tolerant bus are being implemented and flight hardware will be delivered to the ST4 and Europa Orbiter missions.
Spacecraft fault tolerance: The Magellan experience

NASA Technical Reports Server (NTRS)

Kasuda, Rick; Packard, Donna Sexton

1993-01-01

Interplanetary and earth orbiting missions are now imposing unique fault tolerant requirements upon spacecraft design. Mission success is the prime motivator for building spacecraft with fault tolerant systems. The Magellan spacecraft had many such requirements imposed upon its design. Magellan met these requirements by building redundancy into all the major subsystem components and designing the onboard hardware and software with the capability to detect a fault, isolate it to a component, and issue commands to achieve a back-up configuration. This discussion is limited to fault protection, which is the autonomous capability to respond to a fault. The Magellan fault protection design is discussed, as well as the developmental and flight experiences and a summary of the lessons learned.
Design study of Software-Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.

1982-01-01

Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.
Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1991-01-01

An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.
Fault Tolerant Real-Time Systems

DTIC Science & Technology

1993-09-30

The ART (Advanced Real-Time Technology) Project of Carnegie Mellon University is engaged in wide ranging research on hard real - time systems . The...including hardware and software fault tolerance using temporal redundancy and analytic redundancy to permit the construction of real - time systems whose
Software fault tolerance in computer operating systems

NASA Technical Reports Server (NTRS)

Iyer, Ravishankar K.; Lee, Inhwan

1994-01-01

This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.
Implementation of an experimental fault-tolerant memory system

NASA Technical Reports Server (NTRS)

Carter, W. C.; Mccarthy, C. E.

1976-01-01

The experimental fault-tolerant memory system described in this paper has been designed to enable the modular addition of spares, to validate the theoretical fault-secure and self-testing properties of the translator/corrector, to provide a basis for experiments using the new testing and correction processes for recovery, and to determine the practicality of such systems. The hardware design and implementation are described, together with methods of fault insertion. The hardware/software interface, including a restricted single error correction/double error detection (SEC/DED) code, is specified. Procedures are carefully described which, (1) test for specified physical faults, (2) ensure that single error corrections are not miscorrections due to triple faults, and (3) enable recovery from double errors.
Fault-tolerant, high-level quantum circuits: form, compilation and description

NASA Astrophysics Data System (ADS)

Paler, Alexandru; Polian, Ilia; Nemoto, Kae; Devitt, Simon J.

2017-06-01

Fault-tolerant quantum error correction is a necessity for any quantum architecture destined to tackle interesting, large-scale problems. Its theoretical formalism has been well founded for nearly two decades. However, we still do not have an appropriate compiler to produce a fault-tolerant, error-corrected description from a higher-level quantum circuit for state-of the-art hardware models. There are many technical hurdles, including dynamic circuit constructions that occur when constructing fault-tolerant circuits with commonly used error correcting codes. We introduce a package that converts high-level quantum circuits consisting of commonly used gates into a form employing all decompositions and ancillary protocols needed for fault-tolerant error correction. We call this form the (I)initialisation, (C)NOT, (M)measurement form (ICM) and consists of an initialisation layer of qubits into one of four distinct states, a massive, deterministic array of CNOT operations and a series of time-ordered X- or Z-basis measurements. The form allows a more flexible approach towards circuit optimisation. At the same time, the package outputs a standard circuit or a canonical geometric description which is a necessity for operating current state-of-the-art hardware architectures using topological quantum codes.
A verified design of a fault-tolerant clock synchronization circuit: Preliminary investigations

NASA Technical Reports Server (NTRS)

Miner, Paul S.

1992-01-01

Schneider demonstrates that many fault tolerant clock synchronization algorithms can be represented as refinements of a single proven correct paradigm. Shankar provides mechanical proof that Schneider's schema achieves Byzantine fault tolerant clock synchronization provided that 11 constraints are satisfied. Some of the constraints are assumptions about physical properties of the system and cannot be established formally. Proofs are given that the fault tolerant midpoint convergence function satisfies three of the constraints. A hardware design is presented, implementing the fault tolerant midpoint function, which is shown to satisfy the remaining constraints. The synchronization circuit will recover completely from transient faults provided the maximum fault assumption is not violated. The initialization protocol for the circuit also provides a recovery mechanism from total system failure caused by correlated transient faults.

Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

NASA Technical Reports Server (NTRS)

Akamine, Robert L.; Hodson, Robert F.; LaMeres, Brock J.; Ray, Robert E.

2011-01-01

Fault tolerant systems require the ability to detect and recover from physical damage caused by the hardware s environment, faulty connectors, and system degradation over time. This ability applies to military, space, and industrial computing applications. The integrity of Point-to-Point (P2P) communication, between two microcontrollers for example, is an essential part of fault tolerant computing systems. In this paper, different methods of fault detection and recovery are presented and analyzed.
Adding Fault Tolerance to NPB Benchmarks Using ULFM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J

2016-01-01

In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In thismore » work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.« less
Fly-By-Light/Power-By-Wire Fault-Tolerant Fiber-Optic Backplane

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2002-01-01

The design and development of a fault-tolerant fiber-optic backplane to demonstrate feasibility of such architecture is presented. The simulation results of test cases on the backplane in the advent of induced faults are presented, and the fault recovery capability of the architecture is demonstrated. The architecture was designed, developed, and implemented using the Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL). The architecture was synthesized and implemented in hardware using Field Programmable Gate Arrays (FPGA) on multiple prototype boards.
Fault Tolerance for VLSI Multicomputers

DTIC Science & Technology

1985-08-01

that consists of hundreds or thousands of VLSI computation nodes interconnected by dedicated links. Some important applications of high-end computers...technology, and intended applications . A proposed fault tolerance scheme combines hardware that performs error detection and system-level protocols for...order to recover from the error and resume correct operation, a valid system state must be restored. A low-overhead, application -transparent error
Fault-Tolerant Computing: An Overview

DTIC Science & Technology

1991-06-01

Addison Wesley:, Reading, MA) 1984. [8] J. Wakerly , Error Detecting Codes, Self-Checking Circuits and Applications , (Elsevier North Holland, Inc.- New York... applicable to bit-sliced organi- zations of hardware. In the first time step, the normal computation is performed on the operands and the results...for error detection and fault tolerance in parallel processor systems while perform- ing specific computation-intensive applications [111. Contrary to
Distributed asynchronous microprocessor architectures in fault tolerant integrated flight systems

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1983-01-01

The paper discusses the implementation of fault tolerant digital flight control and navigation systems for rotorcraft application. It is shown that in implementing fault tolerance at the systems level using advanced LSI/VLSI technology, aircraft physical layout and flight systems requirements tend to define a system architecture of distributed, asynchronous microprocessors in which fault tolerance can be achieved locally through hardware redundancy and/or globally through application of analytical redundancy. The effects of asynchronism on the execution of dynamic flight software is discussed. It is shown that if the asynchronous microprocessors have knowledge of time, these errors can be significantly reduced through appropiate modifications of the flight software. Finally, the papear extends previous work to show that through the combined use of time referencing and stable flight algorithms, individual microprocessors can be configured to autonomously tolerate intermittent faults.
Computer hardware fault administration

DOEpatents

Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

2010-09-14

Computer hardware fault administration carried out in a parallel computer, where the parallel computer includes a plurality of compute nodes. The compute nodes are coupled for data communications by at least two independent data communications networks, where each data communications network includes data communications links connected to the compute nodes. Typical embodiments carry out hardware fault administration by identifying a location of a defective link in the first data communications network of the parallel computer and routing communications data around the defective link through the second data communications network of the parallel computer.
Application of Fault-Tolerant Computing For Spacecraft Using Commercial-Off-The-Shelf Microprocessors

DTIC Science & Technology

2000-06-01

real - time operating system and design of a human-computer interface (HCI) for a triple modular redundant (TMR) fault-tolerant microprocessor for use in space-based applications. Once disadvantage of using COTS hardware components is their susceptibility to the radiation effects present in the space environment. and specifically, radiation-induced single-event upsets (SEUs). In the event of an SEU, a fault-tolerant system can mitigate the effects of the upset and continue to process from the last known correct system state. The TMR basic hardware
Characterization of the faulted behavior of digital computers and fault tolerant systems

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Miner, Paul S.

1989-01-01

A development status evaluation is presented for efforts conducted at NASA-Langley since 1977, toward the characterization of the latent fault in digital fault-tolerant systems. Attention is given to the practical, high speed, generalized gate-level logic system simulator developed, as well as to the validation methodology used for the simulator, on the basis of faultable software and hardware simulations employing a prototype MIL-STD-1750A processor. After validation, latency tests will be performed.
A fault tolerant 80960 engine controller

NASA Technical Reports Server (NTRS)

Reichmuth, D. M.; Gage, M. L.; Paterson, E. S.; Kramer, D. D.

1993-01-01

The paper describes the design of the 80960 Fault Tolerant Engine Controller for the supervision of engine operations, which was designed for the NASA Marshall Space Center. Consideration is given to the major electronic components of the controller, including the engine controller, effectors, and the sensors, as well as to the controller hardware, the controller module and the communications module, and the controller software. The architecture of the controller hardware allows modifications to be made to fit the requirements of any new propulsion systems. Multiple flow diagrams are presented illustrating the controller's operations.
Fault tolerance in computational grids: perspectives, challenges, and issues.

PubMed

Haider, Sajjad; Nazir, Babar

2016-01-01

Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.
A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hursey, Joshua J; Naughton, III, Thomas J; Vallee, Geoffroy R

The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
A Voyager attitude control perspective on fault tolerant systems

NASA Technical Reports Server (NTRS)

Rasmussen, R. D.; Litty, E. C.

1981-01-01

In current spacecraft design, a trend can be observed to achieve greater fault tolerance through the application of on-board software dedicated to detecting and isolating failures. Whether fault tolerance through software can meet the desired objectives depends on very careful consideration and control of the system in which the software is imbedded. The considered investigation has the objective to provide some of the insight needed for the required analysis of the system. A description is given of the techniques which have been developed in this connection during the development of the Voyager spacecraft. The Voyager Galileo Attitude and Articulation Control Subsystem (AACS) fault tolerant design is discussed to emphasize basic lessons learned from this experience. The central driver of hardware redundancy implementation on Voyager was known as the 'single point failure criterion'.
[Advanced Development for Space Robotics With Emphasis on Fault Tolerance Technology

NASA Technical Reports Server (NTRS)

Tesar, Delbert

1997-01-01

This report describes work developing fault tolerant redundant robotic architectures and adaptive control strategies for robotic manipulator systems which can dynamically accommodate drastic robot manipulator mechanism, sensor or control failures and maintain stable end-point trajectory control with minimum disturbance. Kinematic designs of redundant, modular, reconfigurable arms for fault tolerance were pursued at a fundamental level. The approach developed robotic testbeds to evaluate disturbance responses of fault tolerant concepts in robotic mechanisms and controllers. The development was implemented in various fault tolerant mechanism testbeds including duality in the joint servo motor modules, parallel and serial structural architectures, and dual arms. All have real-time adaptive controller technologies to react to mechanism or controller disturbances (failures) to perform real-time reconfiguration to continue the task operations. The developments fall into three main areas: hardware, software, and theoretical.
SFT: Scalable Fault Tolerance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

2006-04-15

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparentmore » and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.« less
Fault-tolerant clock synchronization in distributed systems

NASA Technical Reports Server (NTRS)

Ramanathan, Parameswaran; Shin, Kang G.; Butler, Ricky W.

1990-01-01

Existing fault-tolerant clock synchronization algorithms are compared and contrasted. These include the following: software synchronization algorithms, such as convergence-averaging, convergence-nonaveraging, and consistency algorithms, as well as probabilistic synchronization; hardware synchronization algorithms; and hybrid synchronization. The worst-case clock skews guaranteed by representative algorithms are compared, along with other important aspects such as time, message, and cost overhead imposed by the algorithms. More recent developments such as hardware-assisted software synchronization and algorithms for synchronizing large, partially connected distributed systems are especially emphasized.
Reliability modeling of fault-tolerant computer based systems

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.

1987-01-01

Digital fault-tolerant computer-based systems have become commonplace in military and commercial avionics. These systems hold the promise of increased availability, reliability, and maintainability over conventional analog-based systems through the application of replicated digital computers arranged in fault-tolerant configurations. Three tightly coupled factors of paramount importance, ultimately determining the viability of these systems, are reliability, safety, and profitability. Reliability, the major driver affects virtually every aspect of design, packaging, and field operations, and eventually produces profit for commercial applications or increased national security. However, the utilization of digital computer systems makes the task of producing credible reliability assessment a formidable one for the reliability engineer. The root of the problem lies in the digital computer's unique adaptability to changing requirements, computational power, and ability to test itself efficiently. Addressed here are the nuances of modeling the reliability of systems with large state sizes, in the Markov sense, which result from systems based on replicated redundant hardware and to discuss the modeling of factors which can reduce reliability without concomitant depletion of hardware. Advanced fault-handling models are described and methods of acquiring and measuring parameters for these models are delineated.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

NASA Technical Reports Server (NTRS)

Smith, T. B., Jr.; Lala, J. H.

1983-01-01

The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
An Integrated Fault Tolerant Robotic Controller System for High Reliability and Safety

NASA Technical Reports Server (NTRS)

Marzwell, Neville I.; Tso, Kam S.; Hecht, Myron

1994-01-01

This paper describes the concepts and features of a fault-tolerant intelligent robotic control system being developed for applications that require high dependability (reliability, availability, and safety). The system consists of two major elements: a fault-tolerant controller and an operator workstation. The fault-tolerant controller uses a strategy which allows for detection and recovery of hardware, operating system, and application software failures.The fault-tolerant controller can be used by itself in a wide variety of applications in industry, process control, and communications. The controller in combination with the operator workstation can be applied to robotic applications such as spaceborne extravehicular activities, hazardous materials handling, inspection and maintenance of high value items (e.g., space vehicles, reactor internals, or aircraft), medicine, and other tasks where a robot system failure poses a significant risk to life or property.
Final Project Report. Scalable fault tolerance runtime technology for petascale computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krishnamoorthy, Sriram; Sadayappan, P

With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been amore » considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.« less

The X-38 Spacecraft Fault-Tolerant Avionics System

NASA Technical Reports Server (NTRS)

Kouba,Coy; Buscher, Deborah; Busa, Joseph

2003-01-01

In 1995 NASA began an experimental program to develop a reusable crew return vehicle (CRV) for the International Space Station. The purpose of the CRV was threefold: (i) to bring home an injured or ill crewmember; (ii) to bring home the entire crew if the Shuttle fleet was grounded; and (iii) to evacuate the crew in the case of an imminent Station threat (i.e., fire, decompression, etc). Built at the Johnson Space Center, were two approach and landing prototypes and one spacecraft demonstrator (called V201). A series of increasingly complex ground subsystem tests were completed, and eight successful high-altitude drop tests were achieved to prove the design concept. In this program, an unprecedented amount of commercial-off-the-shelf technology was utilized in this first crewed spacecraft NASA has built since the Shuttle program. Unfortunately, in 2002 the program was canceled due to changing Agency priorities. The vehicle was 80% complete and the program was shut down in such a manner as to preserve design, development, test and engineering data. This paper describes the X-38 V201 fault-tolerant avionics system. Based on Draper Laboratory's Byzantine-resilient fault-tolerant parallel processing system and their "network element" hardware, each flight computer exchanges information on a strict timescale to process input data, compare results, and issue voted vehicle output commands. Major accomplishments achieved in this development include: (i) a space qualified two-fault tolerant design using mostly COTS (hardware and operating system); (ii) a single event upset tolerant network element board, (iii) on-the-fly recovery of a failed processor; (iv) use of synched cache; (v) realignment of memory to bring back a failed channel; (vi) flight code automatically generated from the master measurement list; and (vii) built in-house by a team of civil servants and support contractors. This paper will present an overview of the avionics system and the hardware
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

1984-01-01

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
Fault-tolerant software - Experiment with the sift operating system. [Software Implemented Fault Tolerance computer

NASA Technical Reports Server (NTRS)

Brunelle, J. E.; Eckhardt, D. E., Jr.

1985-01-01

Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.
Fault-tolerant processing system

NASA Technical Reports Server (NTRS)

Palumbo, Daniel L. (Inventor)

1996-01-01

A fault-tolerant, fiber optic interconnect, or backplane, which serves as a via for data transfer between modules. Fault tolerance algorithms are embedded in the backplane by dividing the backplane into a read bus and a write bus and placing a redundancy management unit (RMU) between the read bus and the write bus so that all data transmitted by the write bus is subjected to the fault tolerance algorithms before the data is passed for distribution to the read bus. The RMU provides both backplane control and fault tolerance.
Hardware Fault Simulator for Microprocessors

NASA Technical Reports Server (NTRS)

Hess, L. M.; Timoc, C. C.

1983-01-01

Breadboarded circuit is faster and more thorough than software simulator. Elementary fault simulator for AND gate uses three gates and shaft register to simulate stuck-at-one or stuck-at-zero conditions at inputs and output. Experimental results showed hardware fault simulator for microprocessor gave faster results than software simulator, by two orders of magnitude, with one test being applied every 4 microseconds.
FTAPE: A fault injection tool to measure fault tolerance

NASA Technical Reports Server (NTRS)

Tsai, Timothy K.; Iyer, Ravishankar K.

1995-01-01

The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.
Transparent Ada rendezvous in a fault tolerant distributed system

NASA Technical Reports Server (NTRS)

Racine, Roger

1986-01-01

There are many problems associated with distributing an Ada program over a loosely coupled communication network. Some of these problems involve the various aspects of the distributed rendezvous. The problems addressed involve supporting the delay statement in a selective call and supporting the else clause in a selective call. Most of these difficulties are compounded by the need for an efficient communication system. The difficulties are compounded even more by considering the possibility of hardware faults occurring while the program is running. With a hardware fault tolerant computer system, it is possible to design a distribution scheme and communication software which is efficient and allows Ada semantics to be preserved. An Ada design for the communications software of one such system will be presented, including a description of the services provided in the seven layers of an International Standards Organization (ISO) Open System Interconnect (OSI) model communications system. The system capabilities (hardware and software) that allow this communication system will also be described.
Physical fault tolerance of nanoelectronics.

PubMed

Szkopek, Thomas; Roychowdhury, Vwani P; Antoniadis, Dimitri A; Damoulakis, John N

2011-04-29

The error rate in complementary transistor circuits is suppressed exponentially in electron number, arising from an intrinsic physical implementation of fault-tolerant error correction. Contrariwise, explicit assembly of gates into the most efficient known fault-tolerant architecture is characterized by a subexponential suppression of error rate with electron number, and incurs significant overhead in wiring and complexity. We conclude that it is more efficient to prevent logical errors with physical fault tolerance than to correct logical errors with fault-tolerant architecture.
Using certification trails to achieve software fault tolerance

NASA Technical Reports Server (NTRS)

Sullivan, Gregory F.; Masson, Gerald M.

1993-01-01

A conceptually novel and powerful technique to achieve fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance was formalized and it was illustrated by applying it to the fundamental problem of finding a minimum spanning tree. Cases in which the second phase can be run concorrectly with the first and act as a monitor are discussed. The certification trail approach was compared to other approaches to fault tolerance. Because of space limitations we have omitted examples of our technique applied to the Huffman tree, and convex hull problems. These can be found in the full version of this paper.
The Design and Semi-Physical Simulation Test of Fault-Tolerant Controller for Aero Engine

NASA Astrophysics Data System (ADS)

Liu, Yuan; Zhang, Xin; Zhang, Tianhong

2017-11-01

A new fault-tolerant control method for aero engine is proposed, which can accurately diagnose the sensor fault by Kalman filter banks and reconstruct the signal by real-time on-board adaptive model combing with a simplified real-time model and an improved Kalman filter. In order to verify the feasibility of the method proposed, a semi-physical simulation experiment has been carried out. Besides the real I/O interfaces, controller hardware and the virtual plant model, semi-physical simulation system also contains real fuel system. Compared with the hardware-in-the-loop (HIL) simulation, semi-physical simulation system has a higher degree of confidence. In order to meet the needs of semi-physical simulation, a rapid prototyping controller with fault-tolerant control ability based on NI CompactRIO platform is designed and verified on the semi-physical simulation test platform. The result shows that the controller can realize the aero engine control safely and reliably with little influence on controller performance in the event of fault on sensor.
Verification of the FtCayuga fault-tolerant microprocessor system. Volume 1: A case study in theorem prover-based verification

NASA Technical Reports Server (NTRS)

Srivas, Mandayam; Bickford, Mark

1991-01-01

The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Fault tree models for fault tolerant hypercube multiprocessors

NASA Technical Reports Server (NTRS)

Boyd, Mark A.; Tuazon, Jezus O.

1991-01-01

Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.
Machine-checked proofs of the design and implementation of a fault-tolerant circuit

NASA Technical Reports Server (NTRS)

Bevier, William R.; Young, William D.

1990-01-01

A formally verified implementation of the 'oral messages' algorithm of Pease, Shostak, and Lamport is described. An abstract implementation of the algorithm is verified to achieve interactive consistency in the presence of faults. This abstract characterization is then mapped down to a hardware level implementation which inherits the fault-tolerant characteristics of the abstract version. All steps in the proof were checked with the Boyer-Moore theorem prover. A significant results is the demonstration of a fault-tolerant device that is formally specified and whose implementation is proved correct with respect to this specification. A significant simplifying assumption is that the redundant processors behave synchronously. A mechanically checked proof that the oral messages algorithm is 'optimal' in the sense that no algorithm which achieves agreement via similar message passing can tolerate a larger proportion of faulty processor is also described.
Ultrareliable fault-tolerant control systems

NASA Technical Reports Server (NTRS)

Webster, L. D.; Slykhouse, R. A.; Booth, L. A., Jr.; Carson, T. M.; Davis, G. J.; Howard, J. C.

1984-01-01

It is demonstrated that fault-tolerant computer systems, such as on the Shuttles, based on redundant, independent operation are a viable alternative in fault tolerant system designs. The ultrareliable fault-tolerant control system (UFTCS) was developed and tested in laboratory simulations of an UH-1H helicopter. UFTCS includes asymptotically stable independent control elements in a parallel, cross-linked system environment. Static redundancy provides the fault tolerance. A polling is performed among the computers, with results allowing for time-delay channel variations with tight bounds. When compared with the laboratory and actual flight data for the helicopter, the probability of a fault was, for the first 10 hr of flight given a quintuple computer redundancy, found to be 1 in 290 billion. Two weeks of untended Space Station operations would experience a fault probability of 1 in 24 million. Techniques for avoiding channel divergence problems are identified.
Problems related to the integration of fault tolerant aircraft electronic systems

NASA Technical Reports Server (NTRS)

Bannister, J. A.; Adlakha, V.; Triyedi, K.; Alspaugh, T. A., Jr.

1982-01-01

Problems related to the design of the hardware for an integrated aircraft electronic system are considered. Taxonomies of concurrent systems are reviewed and a new taxonomy is proposed. An informal methodology intended to identify feasible regions of the taxonomic design space is described. Specific tools are recommended for use in the methodology. Based on the methodology, a preliminary strawman integrated fault tolerant aircraft electronic system is proposed. Next, problems related to the programming and control of inegrated aircraft electronic systems are discussed. Issues of system resource management, including the scheduling and allocation of real time periodic tasks in a multiprocessor environment, are treated in detail. The role of software design in integrated fault tolerant aircraft electronic systems is discussed. Conclusions and recommendations for further work are included.
Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lumsdaine, Andrew

2013-03-08

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility ormore » control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.« less
Advanced cloud fault tolerance system

NASA Astrophysics Data System (ADS)

Sumangali, K.; Benny, Niketa

2017-11-01

Cloud computing has become a prevalent on-demand service on the internet to store, manage and process data. A pitfall that accompanies cloud computing is the failures that can be encountered in the cloud. To overcome these failures, we require a fault tolerance mechanism to abstract faults from users. We have proposed a fault tolerant architecture, which is a combination of proactive and reactive fault tolerance. This architecture essentially increases the reliability and the availability of the cloud. In the future, we would like to compare evaluations of our proposed architecture with existing architectures and further improve it.
Hardware fault insertion and instrumentation system: Mechanization and validation

NASA Technical Reports Server (NTRS)

Benson, J. W.

1987-01-01

Automated test capability for extensive low-level hardware fault insertion testing is developed. The test capability is used to calibrate fault detection coverage and associated latency times as relevant to projecting overall system reliability. Described are modifications made to the NASA Ames Reconfigurable Flight Control System (RDFCS) Facility to fully automate the total test loop involving the Draper Laboratories' Fault Injector Unit. The automated capability provided included the application of sequences of simulated low-level hardware faults, the precise measurement of fault latency times, the identification of fault symptoms, and bulk storage of test case results. A PDP-11/60 served as a test coordinator, and a PDP-11/04 as an instrumentation device. The fault injector was controlled by applications test software in the PDP-11/60, rather than by manual commands from a terminal keyboard. The time base was especially developed for this application to use a variety of signal sources in the system simulator.
An approach to secure weather and climate models against hardware faults

NASA Astrophysics Data System (ADS)

Düben, Peter D.; Dawson, Andrew

2017-03-01

Enabling Earth System models to run efficiently on future supercomputers is a serious challenge for model development. Many publications study efficient parallelization to allow better scaling of performance on an increasing number of computing cores. However, one of the most alarming threats for weather and climate predictions on future high performance computing architectures is widely ignored: the presence of hardware faults that will frequently hit large applications as we approach exascale supercomputing. Changes in the structure of weather and climate models that would allow them to be resilient against hardware faults are hardly discussed in the model development community. In this paper, we present an approach to secure the dynamical core of weather and climate models against hardware faults using a backup system that stores coarse resolution copies of prognostic variables. Frequent checks of the model fields on the backup grid allow the detection of severe hardware faults, and prognostic variables that are changed by hardware faults on the model grid can be restored from the backup grid to continue model simulations with no significant delay. To justify the approach, we perform model simulations with a C-grid shallow water model in the presence of frequent hardware faults. As long as the backup system is used, simulations do not crash and a high level of model quality can be maintained. The overhead due to the backup system is reasonable and additional storage requirements are small. Runtime is increased by only 13 % for the shallow water model.
An approach to secure weather and climate models against hardware faults

NASA Astrophysics Data System (ADS)

Düben, Peter; Dawson, Andrew

2017-04-01

Enabling Earth System models to run efficiently on future supercomputers is a serious challenge for model development. Many publications study efficient parallelisation to allow better scaling of performance on an increasing number of computing cores. However, one of the most alarming threats for weather and climate predictions on future high performance computing architectures is widely ignored: the presence of hardware faults that will frequently hit large applications as we approach exascale supercomputing. Changes in the structure of weather and climate models that would allow them to be resilient against hardware faults are hardly discussed in the model development community. We present an approach to secure the dynamical core of weather and climate models against hardware faults using a backup system that stores coarse resolution copies of prognostic variables. Frequent checks of the model fields on the backup grid allow the detection of severe hardware faults, and prognostic variables that are changed by hardware faults on the model grid can be restored from the backup grid to continue model simulations with no significant delay. To justify the approach, we perform simulations with a C-grid shallow water model in the presence of frequent hardware faults. As long as the backup system is used, simulations do not crash and a high level of model quality can be maintained. The overhead due to the backup system is reasonable and additional storage requirements are small. Runtime is increased by only 13% for the shallow water model.

FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing
Relaxed fault-tolerant hardware implementation of neural networks in the presence of multiple transient errors.

PubMed

Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro

2012-08-01

Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments.
Fault tolerant software modules for SIFT

NASA Technical Reports Server (NTRS)

Hecht, M.; Hecht, H.

1982-01-01

The implementation of software fault tolerance is investigated for critical modules of the Software Implemented Fault Tolerance (SIFT) operating system to support the computational and reliability requirements of advanced fly by wire transport aircraft. Fault tolerant designs generated for the error reported and global executive are examined. A description of the alternate routines, implementation requirements, and software validation are included.
Coordinated Fault Tolerance for High-Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack; Bosilca, George; et al.

2013-04-08

Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.
Quantitative fault tolerant control design for a hydraulic actuator with a leaking piston seal

NASA Astrophysics Data System (ADS)

Karpenko, Mark

Hydraulic actuators are complex fluid power devices whose performance can be degraded in the presence of system faults. In this thesis a linear, fixed-gain, fault tolerant controller is designed that can maintain the positioning performance of an electrohydraulic actuator operating under load with a leaking piston seal and in the presence of parametric uncertainties. Developing a control system tolerant to this class of internal leakage fault is important since a leaking piston seal can be difficult to detect, unless the actuator is disassembled. The designed fault tolerant control law is of low-order, uses only the actuator position as feedback, and can: (i) accommodate nonlinearities in the hydraulic functions, (ii) maintain robustness against typical uncertainties in the hydraulic system parameters, and (iii) keep the positioning performance of the actuator within prescribed tolerances despite an internal leakage fault that can bypass up to 40% of the rated servovalve flow across the actuator piston. Experimental tests verify the functionality of the fault tolerant control under normal and faulty operating conditions. The fault tolerant controller is synthesized based on linear time-invariant equivalent (LTIE) models of the hydraulic actuator using the quantitative feedback theory (QFT) design technique. A numerical approach for identifying LTIE frequency response functions of hydraulic actuators from acceptable input-output responses is developed so that linearizing the hydraulic functions can be avoided. The proposed approach can properly identify the features of the hydraulic actuator frequency response that are important for control system design and requires no prior knowledge about the asymptotic behavior or structure of the LTIE transfer functions. A distributed hardware-in-the-loop (HIL) simulation architecture is constructed that enables the performance of the proposed fault tolerant control law to be further substantiated, under realistic operating
Fault-tolerant rotary actuator

DOEpatents

Tesar, Delbert

2006-10-17

A fault-tolerant actuator module, in a single containment shell, containing two actuator subsystems that are either asymmetrically or symmetrically laid out is provided. Fault tolerance in the actuators of the present invention is achieved by the employment of dual sets of equal resources. Dual resources are integrated into single modules, with each having the external appearance and functionality of a single set of resources.
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1980-01-01

RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
Rule-based fault diagnosis of hall sensors and fault-tolerant control of PMSM

NASA Astrophysics Data System (ADS)

Song, Ziyou; Li, Jianqiu; Ouyang, Minggao; Gu, Jing; Feng, Xuning; Lu, Dongbin

2013-07-01

Hall sensor is widely used for estimating rotor phase of permanent magnet synchronous motor(PMSM). And rotor position is an essential parameter of PMSM control algorithm, hence it is very dangerous if Hall senor faults occur. But there is scarcely any research focusing on fault diagnosis and fault-tolerant control of Hall sensor used in PMSM. From this standpoint, the Hall sensor faults which may occur during the PMSM operating are theoretically analyzed. According to the analysis results, the fault diagnosis algorithm of Hall sensor, which is based on three rules, is proposed to classify the fault phenomena accurately. The rotor phase estimation algorithms, based on one or two Hall sensor(s), are initialized to engender the fault-tolerant control algorithm. The fault diagnosis algorithm can detect 60 Hall fault phenomena in total as well as all detections can be fulfilled in 1/138 rotor rotation period. The fault-tolerant control algorithm can achieve a smooth torque production which means the same control effect as normal control mode (with three Hall sensors). Finally, the PMSM bench test verifies the accuracy and rapidity of fault diagnosis and fault-tolerant control strategies. The fault diagnosis algorithm can detect all Hall sensor faults promptly and fault-tolerant control algorithm allows the PMSM to face failure conditions of one or two Hall sensor(s). In addition, the transitions between health-control and fault-tolerant control conditions are smooth without any additional noise and harshness. Proposed algorithms can deal with the Hall sensor faults of PMSM in real applications, and can be provided to realize the fault diagnosis and fault-tolerant control of PMSM.
Intelligent fault-tolerant controllers

NASA Technical Reports Server (NTRS)

Huang, Chien Y.

1987-01-01

A system with fault tolerant controls is one that can detect, isolate, and estimate failures and perform necessary control reconfiguration based on this new information. Artificial intelligence (AI) is concerned with semantic processing, and it has evolved to include the topics of expert systems and machine learning. This research represents an attempt to apply AI to fault tolerant controls, hence, the name intelligent fault tolerant control (IFTC). A generic solution to the problem is sought, providing a system based on logic in addition to analytical tools, and offering machine learning capabilities. The advantages are that redundant system specific algorithms are no longer needed, that reasonableness is used to quickly choose the correct control strategy, and that the system can adapt to new situations by learning about its effects on system dynamics.
Fault-Tolerant Heat Exchanger

NASA Technical Reports Server (NTRS)

Izenson, Michael G.; Crowley, Christopher J.

2005-01-01

A compact, lightweight heat exchanger has been designed to be fault-tolerant in the sense that a single-point leak would not cause mixing of heat-transfer fluids. This particular heat exchanger is intended to be part of the temperature-regulation system for habitable modules of the International Space Station and to function with water and ammonia as the heat-transfer fluids. The basic fault-tolerant design is adaptable to other heat-transfer fluids and heat exchangers for applications in which mixing of heat-transfer fluids would pose toxic, explosive, or other hazards: Examples could include fuel/air heat exchangers for thermal management on aircraft, process heat exchangers in the cryogenic industry, and heat exchangers used in chemical processing. The reason this heat exchanger can tolerate a single-point leak is that the heat-transfer fluids are everywhere separated by a vented volume and at least two seals. The combination of fault tolerance, compactness, and light weight is implemented in a unique heat-exchanger core configuration: Each fluid passage is entirely surrounded by a vented region bridged by solid structures through which heat is conducted between the fluids. Precise, proprietary fabrication techniques make it possible to manufacture the vented regions and heat-conducting structures with very small dimensions to obtain a very large coefficient of heat transfer between the two fluids. A large heat-transfer coefficient favors compact design by making it possible to use a relatively small core for a given heat-transfer rate. Calculations and experiments have shown that in most respects, the fault-tolerant heat exchanger can be expected to equal or exceed the performance of the non-fault-tolerant heat exchanger that it is intended to supplant (see table). The only significant disadvantages are a slight weight penalty and a small decrease in the mass-specific heat transfer.
Development and Evaluation of Fault-Tolerant Flight Control Systems

NASA Technical Reports Server (NTRS)

Song, Yong D.; Gupta, Kajal (Technical Monitor)

2004-01-01

The research is concerned with developing a new approach to enhancing fault tolerance of flight control systems. The original motivation for fault-tolerant control comes from the need for safe operation of control elements (e.g. actuators) in the event of hardware failures in high reliability systems. One such example is modem space vehicle subjected to actuator/sensor impairments. A major task in flight control is to revise the control policy to balance impairment detectability and to achieve sufficient robustness. This involves careful selection of types and parameters of the controllers and the impairment detecting filters used. It also involves a decision, upon the identification of some failures, on whether and how a control reconfiguration should take place in order to maintain a certain system performance level. In this project new flight dynamic model under uncertain flight conditions is considered, in which the effects of both ramp and jump faults are reflected. Stabilization algorithms based on neural network and adaptive method are derived. The control algorithms are shown to be effective in dealing with uncertain dynamics due to external disturbances and unpredictable faults. The overall strategy is easy to set up and the computation involved is much less as compared with other strategies. Computer simulation software is developed. A serious of simulation studies have been conducted with varying flight conditions.
A Fault-tolerant RISC Microprocessor for Spacecraft Applications

NASA Technical Reports Server (NTRS)

Timoc, Constantin; Benz, Harry

1990-01-01

Viewgraphs on a fault-tolerant RISC microprocessor for spacecraft applications are presented. Topics covered include: reduced instruction set computer; fault tolerant registers; fault tolerant ALU; and double rail CMOS logic.
A distributed fault-tolerant signal processor /FTSP/

NASA Astrophysics Data System (ADS)

Bonneau, R. J.; Evett, R. C.; Young, M. J.

1980-01-01

A digital fault-tolerant signal processor (FTSP), an example of a self-repairing programmable system is analyzed. The design configuration is discussed in terms of fault tolerance, system-level fault detection, isolation and common memory. Special attention is given to the FDIR (fault detection isolation and reconfiguration) logic, noting that the reconfiguration decisions are based on configuration, summary status, end-around tests, and north marker/synchro data. Several mechanisms of fault detection are described which initiate reconfiguration at different levels. It is concluded that the reliability of a signal processor can be significantly enhanced by the use of fault-tolerant techniques.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
An improved fault-tolerant control scheme for PWM inverter-fed induction motor-based EVs.

PubMed

Tabbache, Bekheïra; Benbouzid, Mohamed; Kheloui, Abdelaziz; Bourgeot, Jean-Matthieu; Mamoune, Abdeslam

2013-11-01

This paper proposes an improved fault-tolerant control scheme for PWM inverter-fed induction motor-based electric vehicles. The proposed strategy deals with power switch (IGBTs) failures mitigation within a reconfigurable induction motor control. To increase the vehicle powertrain reliability regarding IGBT open-circuit failures, 4-wire and 4-leg PWM inverter topologies are investigated and their performances discussed in a vehicle context. The proposed fault-tolerant topologies require only minimum hardware modifications to the conventional off-the-shelf six-switch three-phase drive, mitigating the IGBTs failures by specific inverter control. Indeed, the two topologies exploit the induction motor neutral accessibility for fault-tolerant purposes. The 4-wire topology uses then classical hysteresis controllers to account for the IGBT failures. The 4-leg topology, meanwhile, uses a specific 3D space vector PWM to handle vehicle requirements in terms of size (DC bus capacitors) and cost (IGBTs number). Experiments on an induction motor drive and simulations on an electric vehicle are carried-out using a European urban driving cycle to show that the proposed fault-tolerant control approach is effective and provides a simple configuration with high performance in terms of speed and torque responses. Copyright © 2013 ISA. Published by Elsevier Ltd. All rights reserved.
Agent Based Fault Tolerance for the Mobile Environment

NASA Astrophysics Data System (ADS)

Park, Taesoon

This paper presents a fault-tolerance scheme based on mobile agents for the reliable mobile computing systems. Mobility of the agent is suitable to trace the mobile hosts and the intelligence of the agent makes it efficient to support the fault tolerance services. This paper presents two approaches to implement the mobile agent based fault tolerant service and their performances are evaluated and compared with other fault-tolerant schemes.
Eigenstructure Assignment for Fault Tolerant Flight Control Design

NASA Technical Reports Server (NTRS)

Sobel, Kenneth; Joshi, Suresh (Technical Monitor)

2002-01-01

In recent years, fault tolerant flight control systems have gained an increased interest for high performance military aircraft as well as civil aircraft. Fault tolerant control systems can be described as either active or passive. An active fault tolerant control system has to either reconfigure or adapt the controller in response to a failure. One approach is to reconfigure the controller based upon detection and identification of the failure. Another approach is to use direct adaptive control to adjust the controller without explicitly identifying the failure. In contrast, a passive fault tolerant control system uses a fixed controller which achieves acceptable performance for a presumed set of failures. We have obtained a passive fault tolerant flight control law for the F/A-18 aircraft which achieves acceptable handling qualities for a class of control surface failures. The class of failures includes the symmetric failure of any one control surface being stuck at its trim value. A comparison was made of an eigenstructure assignment gain designed for the unfailed aircraft with a fault tolerant multiobjective optimization gain. We have shown that time responses for the unfailed aircraft using the eigenstructure assignment gain and the fault tolerant gain are identical. Furthermore, the fault tolerant gain achieves MIL-F-8785C specifications for all failure conditions.
Analysis of typical fault-tolerant architectures using HARP

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Bechta Dugan, Joanne; Trivedi, Kishor S.; Rothmann, Elizabeth M.; Smith, W. Earl

1987-01-01

Difficulties encountered in the modeling of fault-tolerant systems are discussed. The Hybrid Automated Reliability Predictor (HARP) approach to modeling fault-tolerant systems is described. The HARP is written in FORTRAN, consists of nearly 30,000 lines of codes and comments, and is based on behavioral decomposition. Using the behavioral decomposition, the dependability model is divided into fault-occurrence/repair and fault/error-handling models; the characteristics and combining of these two models are examined. Examples in which the HARP is applied to the modeling of some typical fault-tolerant systems, including a local-area network, two fault-tolerant computer systems, and a flight control system, are presented.
Fault-tolerant dynamic task graph scheduling

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kurt, Mehmet C.; Krishnamoorthy, Sriram; Agrawal, Kunal

2014-11-16

In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space andmore » time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.« less
Minimalist fault-tolerance techniques for mitigating single-event effects in non-radiation-hardened microcontrollers

NASA Astrophysics Data System (ADS)

Caldwell, Douglas Wyche

Commercial microcontrollers--monolithic integrated circuits containing microprocessor, memory and various peripheral functions--such as are used in industrial, automotive and military applications, present spacecraft avionics system designers an appealing mix of higher performance and lower power together with faster system-development time and lower unit costs. However, these parts are not radiation-hardened for application in the space environment and Single-Event Effects (SEE) caused by high-energy, ionizing radiation present a significant challenge. Mitigating these effects with techniques which require minimal additional support logic, and thereby preserve the high functional density of these devices, can allow their benefits to be realized. This dissertation uses fault-tolerance to mitigate the transient errors and occasional latchups that non-hardened microcontrollers can experience in the space radiation environment. Space systems requirements and the historical use of fault-tolerant computers in spacecraft provide context. Space radiation and its effects in semiconductors define the fault environment. A reference architecture is presented which uses two or three microcontrollers with a combination of hardware and software voting techniques to mitigate SEE. A prototypical spacecraft function (an inertial measurement unit) is used to illustrate the techniques and to explore how real application requirements impact the fault-tolerance approach. Low-cost approaches which leverage features of existing commercial microcontrollers are analyzed. A high-speed serial bus is used for voting among redundant devices and a novel wire-OR output voting scheme exploits the bidirectional controls of I/O pins. A hardware testbed and prototype software were constructed to evaluate two- and three-processor configurations. Simulated Single-Event Upsets (SEUs) were injected at high rates and the response of the system monitored. The resulting statistics were used to evaluate

Fault detection and fault tolerance in robotics

NASA Technical Reports Server (NTRS)

Visinsky, Monica; Walker, Ian D.; Cavallaro, Joseph R.

1992-01-01

Robots are used in inaccessible or hazardous environments in order to alleviate some of the time, cost and risk involved in preparing men to endure these conditions. In order to perform their expected tasks, the robots are often quite complex, thus increasing their potential for failures. If men must be sent into these environments to repair each component failure in the robot, the advantages of using the robot are quickly lost. Fault tolerant robots are needed which can effectively cope with failures and continue their tasks until repairs can be realistically scheduled. Before fault tolerant capabilities can be created, methods of detecting and pinpointing failures must be perfected. This paper develops a basic fault tree analysis of a robot in order to obtain a better understanding of where failures can occur and how they contribute to other failures in the robot. The resulting failure flow chart can also be used to analyze the resiliency of the robot in the presence of specific faults. By simulating robot failures and fault detection schemes, the problems involved in detecting failures for robots are explored in more depth.
Multi-version software reliability through fault-avoidance and fault-tolerance

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.

1989-01-01

A number of experimental and theoretical issues associated with the practical use of multi-version software to provide run-time tolerance to software faults were investigated. A specialized tool was developed and evaluated for measuring testing coverage for a variety of metrics. The tool was used to collect information on the relationships between software faults and coverage provided by the testing process as measured by different metrics (including data flow metrics). Considerable correlation was found between coverage provided by some higher metrics and the elimination of faults in the code. Back-to-back testing was continued as an efficient mechanism for removal of un-correlated faults, and common-cause faults of variable span. Software reliability estimation methods was also continued based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. New fault tolerance models were formulated. Simulation studies of the Acceptance Voting and Multi-stage Voting algorithms were finished and it was found that these two schemes for software fault tolerance are superior in many respects to some commonly used schemes. Particularly encouraging are the safety properties of the Acceptance testing scheme.
Software-implemented fault insertion: An FTMP example

NASA Technical Reports Server (NTRS)

Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.

1987-01-01

This report presents a model for fault insertion through software; describes its implementation on a fault-tolerant computer, FTMP; presents a summary of fault detection, identification, and reconfiguration data collected with software-implemented fault insertion; and compares the results to hardware fault insertion data. Experimental results show detection time to be a function of time of insertion and system workload. For the fault detection time, there is no correlation between software-inserted faults and hardware-inserted faults; this is because hardware-inserted faults must manifest as errors before detection, whereas software-inserted faults immediately exercise the error detection mechanisms. In summary, the software-implemented fault insertion is able to be used as an evaluation technique for the fault-handling capabilities of a system in fault detection, identification and recovery. Although the software-inserted faults do not map directly to hardware-inserted faults, experiments show software-implemented fault insertion is capable of emulating hardware fault insertion, with greater ease and automation.
Fault-tolerant wait-free shared objects

NASA Technical Reports Server (NTRS)

Jayanti, Prasad; Chandra, Tushar D.; Toueg, Sam

1992-01-01

A concurrent system consists of processes communicating via shared objects, such as shared variables, queues, etc. The concept of wait-freedom was introduced to cope with process failures: each process that accesses a wait-free object is guaranteed to get a response even if all the other processes crash. However, if a wait-free object 'crashes,' all the processes that access that object are prevented from making progress. In this paper, we introduce the concept of fault-tolerant wait-free objects, and study the problem of implementing them. We give a universal method to construct fault-tolerant wait-free objects, for all types of 'responsive' failures (including one in which faulty objects may 'lie'). In sharp contrast, we prove that many common and interesting types (such as queues, sets, and test&set) have no fault-tolerant wait-free implementations even under the most benign of the 'non-responsive' types of failure. We also introduce several concepts and techniques that are central to the design of fault-tolerant concurrent systems: the concepts of self-implementation and graceful degradation, and techniques to automatically increase the fault-tolerance of implementations. We prove matching lower bounds on the resource complexity of most of our algorithms.
Fault tolerant linear actuator

DOEpatents

Tesar, Delbert

2004-09-14

In varying embodiments, the fault tolerant linear actuator of the present invention is a new and improved linear actuator with fault tolerance and positional control that may incorporate velocity summing, force summing, or a combination of the two. In one embodiment, the invention offers a velocity summing arrangement with a differential gear between two prime movers driving a cage, which then drives a linear spindle screw transmission. Other embodiments feature two prime movers driving separate linear spindle screw transmissions, one internal and one external, in a totally concentric and compact integrated module.
Fault Tolerant Homopolar Magnetic Bearings

NASA Technical Reports Server (NTRS)

Li, Ming-Hsiu; Palazzolo, Alan; Kenny, Andrew; Provenza, Andrew; Beach, Raymond; Kascak, Albert

2003-01-01

Magnetic suspensions (MS) satisfy the long life and low loss conditions demanded by satellite and ISS based flywheels used for Energy Storage and Attitude Control (ACESE) service. This paper summarizes the development of a novel MS that improves reliability via fault tolerant operation. Specifically, flux coupling between poles of a homopolar magnetic bearing is shown to deliver desired forces even after termination of coil currents to a subset of failed poles . Linear, coordinate decoupled force-voltage relations are also maintained before and after failure by bias linearization. Current distribution matrices (CDM) which adjust the currents and fluxes following a pole set failure are determined for many faulted pole combinations. The CDM s and the system responses are obtained utilizing 1D magnetic circuit models with fringe and leakage factors derived from detailed, 3D, finite element field models. Reliability results are presented vs. detection/correction delay time and individual power amplifier reliability for 4, 6, and 7 pole configurations. Reliability is shown for two success criteria, i.e. (a) no catcher bearing contact following pole failures and (b) re-levitation off of the catcher bearings following pole failures. An advantage of the method presented over other redundant operation approaches is a significantly reduced requirement for backup hardware such as additional actuators or power amplifiers.
Fault tolerant control laws

NASA Technical Reports Server (NTRS)

Ly, U. L.; Ho, J. K.

1986-01-01

A systematic procedure for the synthesis of fault tolerant control laws to actuator failure has been presented. Two design methods were used to synthesize fault tolerant controllers: the conventional LQ design method and a direct feedback controller design method SANDY. The latter method is used primarily to streamline the full-state Q feedback design into a practical implementable output feedback controller structure. To achieve robustness to control actuator failure, the redundant surfaces are properly balanced according to their control effectiveness. A simple gain schedule based on the landing gear up/down logic involving only three gains was developed to handle three design flight conditions: Mach .25 and Mach .60 at 5000 ft and Mach .90 at 20,000 ft. The fault tolerant control law developed in this study provides good stability augmentation and performance for the relaxed static stability aircraft. The augmented aircraft responses are found to be invariant to the presence of a failure. Furthermore, single-loop stability margins of +6 dB in gain and +30 deg in phase were achieved along with -40 dB/decade rolloff at high frequency.
What does fault tolerant Deep Learning need from MPI?

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amatya, Vinay C.; Vishnu, Abhinav; Siegel, Charles M.

Deep Learning (DL) algorithms have become the {\\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for amore » fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less
Sequential behavior and its inherent tolerance to memory faults.

NASA Technical Reports Server (NTRS)

Meyer, J. F.

1972-01-01

Representation of a memory fault of a sequential machine M by a function mu on the states of M and the result of the fault by an appropriately determined machine M(mu). Given some sequential behavior B, its inherent tolerance to memory faults can then be measured in terms of the minimum memory redundancy required to realize B with a state-assigned machine having fault tolerance type tau and fault tolerance level t. A behavior having maximum inherent tolerance is exhibited, and it is shown that behaviors of the same size can have different inherent tolerance.
Software Fault Tolerance: A Tutorial

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo

2000-01-01

Because of our present inability to produce error-free software, software fault tolerance is and will continue to be an important consideration in software systems. The root cause of software design errors is the complexity of the systems. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. After a brief overview of the software development processes, we note how hard-to-detect design faults are likely to be introduced during development and how software faults tend to be state-dependent and activated by particular input sequences. Although component reliability is an important quality measure for system level analysis, software reliability is hard to characterize and the use of post-verification reliability estimates remains a controversial issue. For some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing catastrophes. Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Multiversion techniques are based on the assumption that software built differently should fail differently and thus, if one of the redundant versions fails, it is expected that at least one of the other versions will provide an acceptable output. Recovery blocks, N-version programming, and other multiversion techniques are reviewed.
Predeployment validation of fault-tolerant systems through software-implemented fault insertion

NASA Technical Reports Server (NTRS)

Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.

1989-01-01

Fault injection-based automated testing (FIAT) environment, which can be used to experimentally characterize and evaluate distributed realtime systems under fault-free and faulted conditions is described. A survey is presented of validation methodologies. The need for fault insertion based on validation methodologies is demonstrated. The origins and models of faults, and motivation for the FIAT concept are reviewed. FIAT employs a validation methodology which builds confidence in the system through first providing a baseline of fault-free performance data and then characterizing the behavior of the system with faults present. Fault insertion is accomplished through software and allows faults or the manifestation of faults to be inserted by either seeding faults into memory or triggering error detection mechanisms. FIAT is capable of emulating a variety of fault-tolerant strategies and architectures, can monitor system activity, and can automatically orchestrate experiments involving insertion of faults. There is a common system interface which allows ease of use to decrease experiment development and run time. Fault models chosen for experiments on FIAT have generated system responses which parallel those observed in real systems under faulty conditions. These capabilities are shown by two example experiments each using a different fault-tolerance strategy.
ROBUS-2: A Fault-Tolerant Broadcast Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.

2005-01-01

The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault-tolerant integrated modular architecture currently under development at NASA Langley Research Center. The ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, clock synchronization, and distributed diagnosis (group membership). The ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 is tolerant to internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. This version of the ROBUS is intended for laboratory experimentation and demonstrations of the capability to reintegrate failed nodes, dynamically update the communication schedule, and tolerate and recover from correlated transient faults.
Fault Injection Campaign for a Fault Tolerant Duplex Framework

NASA Technical Reports Server (NTRS)

Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.

2007-01-01

Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.
Catastrophic Fault Recovery with Self-Reconfigurable Chips

NASA Technical Reports Server (NTRS)

Zheng, Will Hua; Marzwell, Neville I.; Chau, Savio N.

2006-01-01

Mission critical systems typically employ multi-string redundancy to cope with possible hardware failure. Such systems are only as fault tolerant as there are many redundant strings. Once a particular critical component exhausts its redundant spares, the multi-string architecture cannot tolerate any further hardware failure. This paper aims at addressing such catastrophic faults through the use of 'Self-Reconfigurable Chips' as a last resort effort to 'repair' a faulty critical component.
Impact of coverage on the reliability of a fault tolerant computer

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1975-01-01

A mathematical reliability model is established for a reconfigurable fault tolerant avionic computer system utilizing state-of-the-art computers. System reliability is studied in light of the coverage probabilities associated with the first and second independent hardware failures. Coverage models are presented as a function of detection, isolation, and recovery probabilities. Upper and lower bonds are established for the coverage probabilities and the method for computing values for the coverage probabilities is investigated. Further, an architectural variation is proposed which is shown to enhance coverage.
Hardware Evolution of Control Electronics

NASA Technical Reports Server (NTRS)

Gwaltney, David; Steincamp, Jim; Corder, Eric; King, Ken; Ferguson, M. I.; Dutton, Ken

2003-01-01

The evolution of closed-loop motor speed controllers implemented on the JPL FPTA2 is presented. The response of evolved controller to sinusoidal commands, controller reconfiguration for fault tolerance,and hardware evolution are described.
Advanced information processing system: The Army Fault-Tolerant Architecture detailed design overview

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Babikyan, Carol A.; Butler, Bryan P.; Clasen, Robert J.; Harris, Chris H.; Lala, Jaynarayan H.; Masotto, Thomas K.; Nagle, Gail A.; Prizant, Mark J.; Treadwell, Steven

1994-01-01

The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The Computer Aided Low Altitude Night Helicopter Flight Program has identified automated Terrain Following/Terrain Avoidance, Nap of the Earth (TF/TA, NOE) operation as key enabling technology for advanced tactical rotorcraft to enhance mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction times is life-critical and mission-critical necessitating an ultra-reliable/high throughput computing platform for dependable service for flight control, fusion of sensor data, route planning, near-field/far-field navigation, and obstacle avoidance operations. To address these needs the Army Fault Tolerant Architecture (AFTA) is being designed and developed. This computer system is based upon the Fault Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Labs (CSDL). AFTA is hard real-time, Byzantine, fault-tolerant parallel processor which is programmed in the ADA language. This document describes the results of the Detailed Design (Phase 2 and 3 of a 3-year project) of the AFTA development. This document contains detailed descriptions of the program objectives, the TF/TA NOE application requirements, architecture, hardware design, operating systems design, systems performance measurements and analytical models.
Advanced information processing system: Fault injection study and results

NASA Technical Reports Server (NTRS)

Burkhardt, Laura F.; Masotto, Thomas K.; Lala, Jaynarayan H.

1992-01-01

The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given.
Distributed Fault-Tolerant Control of Networked Uncertain Euler-Lagrange Systems Under Actuator Faults.

PubMed

Chen, Gang; Song, Yongduan; Lewis, Frank L

2016-05-03

This paper investigates the distributed fault-tolerant control problem of networked Euler-Lagrange systems with actuator and communication link faults. An adaptive fault-tolerant cooperative control scheme is proposed to achieve the coordinated tracking control of networked uncertain Lagrange systems on a general directed communication topology, which contains a spanning tree with the root node being the active target system. The proposed algorithm is capable of compensating for the actuator bias fault, the partial loss of effectiveness actuation fault, the communication link fault, the model uncertainty, and the external disturbance simultaneously. The control scheme does not use any fault detection and isolation mechanism to detect, separate, and identify the actuator faults online, which largely reduces the online computation and expedites the responsiveness of the controller. To validate the effectiveness of the proposed method, a test-bed of multiple robot-arm cooperative control system is developed for real-time verification. Experiments on the networked robot-arms are conduced and the results confirm the benefits and the effectiveness of the proposed distributed fault-tolerant control algorithms.
Simulated fault injection - A methodology to evaluate fault tolerant microprocessor architectures

NASA Technical Reports Server (NTRS)

Choi, Gwan S.; Iyer, Ravishankar K.; Carreno, Victor A.

1990-01-01

A simulation-based fault-injection method for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault impact. As an example, a fault-tolerant architecture which models the digital aspects of a dual-channel real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100 percent coverage of single transients. Approximately 12 percent of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist.

Software fault tolerance for real-time avionics systems

NASA Technical Reports Server (NTRS)

Anderson, T.; Knight, J. C.

1983-01-01

Avionics systems have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be very expensive for systems which utilize concurrent processes. The concurrency present in most avionics systems and the further difficulties introduced by timing constraints imply that providing tolerance for software faults may be inordinately expensive or complex. A straightforward pragmatic approach to software fault tolerance which is believed to be applicable to many real-time avionics systems is proposed. A classification system for software errors is presented together with approaches to recovery and continued service for each error type.
Investigation of an advanced fault tolerant integrated avionics system

NASA Technical Reports Server (NTRS)

Dunn, W. R.; Cottrell, D.; Flanders, J.; Javornik, A.; Rusovick, M.

1986-01-01

Presented is an advanced, fault-tolerant multiprocessor avionics architecture as could be employed in an advanced rotorcraft such as LHX. The processor structure is designed to interface with existing digital avionics systems and concepts including the Army Digital Avionics System (ADAS) cockpit/display system, navaid and communications suites, integrated sensing suite, and the Advanced Digital Optical Control System (ADOCS). The report defines mission, maintenance and safety-of-flight reliability goals as might be expected for an operational LHX aircraft. Based on use of a modular, compact (16-bit) microprocessor card family, results of a preliminary study examining simplex, dual and standby-sparing architectures is presented. Given the stated constraints, it is shown that the dual architecture is best suited to meet reliability goals with minimum hardware and software overhead. The report presents hardware and software design considerations for realizing the architecture including redundancy management requirements and techniques as well as verification and validation needs and methods.
Fault Tolerant Frequent Pattern Mining

DOE Office of Scientific and Technical Information (OSTI.GOV)

Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan

FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing,more » though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.« less
Study on fault-tolerant processors for advanced launch system

NASA Technical Reports Server (NTRS)

Shin, Kang G.; Liu, Jyh-Charn

1990-01-01

Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.
Fault-tolerant locomotion of the hexapod robot.

PubMed

Yang, J M; Kim, J H

1998-01-01

In this paper, we propose a scheme for fault detection and tolerance of the hexapod robot locomotion on even terrain. The fault stability margin is defined to represent potential stability which a gait can have in case a sudden fault event occurs to one leg. Based on this, the fault-tolerant quadruped periodic gaits of the hexapod walking over perfectly even terrain are derived. It is demonstrated that the derived quadruped gait is the optimal one the hexapod can have maintaining fault stability margin nonnegative and a geometric condition should be satisfied for the optimal locomotion. By this scheme, when one leg is in failure, the hexapod robot has the modified tripod gait to continue the optimal locomotion.
Parallel and distributed computation for fault-tolerant object recognition

NASA Technical Reports Server (NTRS)

Wechsler, Harry

1988-01-01

The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.
Fault-tolerant building-block computer study

NASA Technical Reports Server (NTRS)

Rennels, D. A.

1978-01-01

Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.
On the design of fault-tolerant robotic manipulator systems

NASA Technical Reports Server (NTRS)

Tesar, Delbert

1993-01-01

Robotic systems are finding increasing use in space applications. Many of these devices are going to be operational on board the Space Station Freedom. Fault tolerance has been deemed necessary because of the criticality of the tasks and the inaccessibility of the systems to maintenance and repair. Design for fault tolerance in manipulator systems is an area within robotics that is without precedence in the literature. In this paper, we will attempt to lay down the foundations for such a technology. Design for fault tolerance demands new and special approaches to design, often at considerable variance from established design practices. These design aspects, together with reliability evaluation and modeling tools, are presented. Mechanical architectures that employ protective redundancies at many levels and have a modular architecture are then studied in detail. Once a mechanical architecture for fault tolerance has been derived, the chronological stages of operational fault tolerance are investigated. Failure detection, isolation, and estimation methods are surveyed, and such methods for robot sensors and actuators are derived. Failure recovery methods are also presented for each of the protective layers of redundancy. Failure recovery tactics often span all of the layers of a control hierarchy. Thus, a unified framework for decision-making and control, which orchestrates both the nominal redundancy management tasks and the failure management tasks, has been derived. The well-developed field of fault-tolerant computers is studied next, and some design principles relevant to the design of fault-tolerant robot controllers are abstracted. Conclusions are drawn, and a road map for the design of fault-tolerant manipulator systems is laid out with recommendations for a 10 DOF arm with dual actuators at each joint.
Reliability of Fault Tolerant Control Systems. Part 1

NASA Technical Reports Server (NTRS)

Wu, N. Eva

2001-01-01

This paper reports Part I of a two part effort, that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability analysis of fault-tolerant control systems is performed using Markov models. Reliability properties, peculiar to fault-tolerant control systems are emphasized. As a consequence, coverage of failures through redundancy management can be severely limited. It is shown that in the early life of a syi1ein composed of highly reliable subsystems, the reliability of the overall system is affine with respect to coverage, and inadequate coverage induces dominant single point failures. The utility of some existing software tools for assessing the reliability of fault tolerant control systems is also discussed. Coverage modeling is attempted in Part II in a way that captures its dependence on the control performance and on the diagnostic resolution.
Experiments in fault tolerant software reliability

NASA Technical Reports Server (NTRS)

Mcallister, David F.; Tai, K. C.; Vouk, Mladen A.

1987-01-01

The reliability of voting was evaluated in a fault-tolerant software system for small output spaces. The effectiveness of the back-to-back testing process was investigated. Version 3.0 of the RSDIMU-ATS, a semi-automated test bed for certification testing of RSDIMU software, was prepared and distributed. Software reliability estimation methods based on non-random sampling are being studied. The investigation of existing fault-tolerance models was continued and formulation of new models was initiated.
Verification of the FtCayuga fault-tolerant microprocessor system. Volume 2: Formal specification and correctness theorems

NASA Technical Reports Server (NTRS)

Bickford, Mark; Srivas, Mandayam

1991-01-01

Presented here is a formal specification and verification of a property of a quadruplicately redundant fault tolerant microprocessor system design. A complete listing of the formal specification of the system and the correctness theorems that are proved are given. The system performs the task of obtaining interactive consistency among the processors using a special instruction on the processors. The design is based on an algorithm proposed by Pease, Shostak, and Lamport. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, providing certain preconditions hold, using a computer aided design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Analysis of fault-tolerant neurocontrol architectures

NASA Technical Reports Server (NTRS)

Troudet, T.; Merrill, W.

1992-01-01

The fault-tolerance of analog parallel distributed implementations of a multivariable aircraft neurocontroller is analyzed by simulating weight and neuron failures in a simplified scheme of analog processing based on the functional architecture of the ETANN chip (Electrically Trainable Artificial Neural Network). The neural information processing is found to be only partially distributed throughout the set of weights of the neurocontroller synthesized with the backpropagation algorithm. Although the degree of distribution of the neural processing, and consequently the fault-tolerance of the neurocontroller, could be enhanced using Locally Distributed Weight and Neuron Approaches, a satisfactory level of fault-tolerance could only be obtained by retraining the degrated VLSI neurocontroller. The possibility of maintaining neurocontrol performance and stability in the presence of single weight of neuron failures was demonstrated through an automated retraining procedure of the neurocontroller based on a pre-programmed choice and sequence of the training parameters.
Privacy-Assured Aggregation Protocol for Smart Metering: A Proactive Fault-Tolerant Approach [Proactive Fault-Tolerant Aggregation Protocol for Privacy-Assured Smart Metering

DOE PAGES

Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.; ...

2016-06-01

Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less
Privacy-Assured Aggregation Protocol for Smart Metering: A Proactive Fault-Tolerant Approach [Proactive Fault-Tolerant Aggregation Protocol for Privacy-Assured Smart Metering

DOE Office of Scientific and Technical Information (OSTI.GOV)

Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.

Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less
A second generation experiment in fault-tolerant software

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

The primary goal was to determine whether the application of fault tolerance to software increases its reliability if the cost of production is the same as for an equivalent nonfault tolerance version derived from the same requirements specification. Software development protocols are discussed. The feasibility of adapting to software design fault tolerance the technique of N-fold Modular Redundancy with majority voting was studied.
Fault Tolerance Middleware for a Multi-Core System

NASA Technical Reports Server (NTRS)

Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

2012-01-01

Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the
Fault tolerant architectures for integrated aircraft electronics systems, task 2

NASA Technical Reports Server (NTRS)

Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

1984-01-01

The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported.
Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous Vehicles

DTIC Science & Technology

2007-11-01

Tolerant Overactuated Autonomous Vehicles Casavola, A.; Garone, E. (2007) Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous ...Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous Vehicles 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6...Tolerant Overactuated Autonomous Vehicles 3.2 - 2 RTO-MP-AVT-145 UNCLASSIFIED/UNLIMITED Control allocation problem (CAP) - Given a virtual input v(t
Method and system for environmentally adaptive fault tolerant computing

NASA Technical Reports Server (NTRS)

Copenhaver, Jason L. (Inventor); Jeremy, Ramos (Inventor); Wolfe, Jeffrey M. (Inventor); Brenner, Dean (Inventor)

2010-01-01

A method and system for adapting fault tolerant computing. The method includes the steps of measuring an environmental condition representative of an environment. An on-board processing system's sensitivity to the measured environmental condition is measured. It is determined whether to reconfigure a fault tolerance of the on-board processing system based in part on the measured environmental condition. The fault tolerance of the on-board processing system may be reconfigured based in part on the measured environmental condition.
SABRE: a bio-inspired fault-tolerant electronic architecture.

PubMed

Bremner, P; Liu, Y; Samie, M; Dragffy, G; Pipe, A G; Tempesti, G; Timmis, J; Tyrrell, A M

2013-03-01

As electronic devices become increasingly complex, ensuring their reliable, fault-free operation is becoming correspondingly more challenging. It can be observed that, in spite of their complexity, biological systems are highly reliable and fault tolerant. Hence, we are motivated to take inspiration for biological systems in the design of electronic ones. In SABRE (self-healing cellular architectures for biologically inspired highly reliable electronic systems), we have designed a bio-inspired fault-tolerant hierarchical architecture for this purpose. As in biology, the foundation for the whole system is cellular in nature, with each cell able to detect faults in its operation and trigger intra-cellular or extra-cellular repair as required. At the next level in the hierarchy, arrays of cells are configured and controlled as function units in a transport triggered architecture (TTA), which is able to perform partial-dynamic reconfiguration to rectify problems that cannot be solved at the cellular level. Each TTA is, in turn, part of a larger multi-processor system which employs coarser grain reconfiguration to tolerate faults that cause a processor to fail. In this paper, we describe the details of operation of each layer of the SABRE hierarchy, and how these layers interact to provide a high systemic level of fault tolerance.

Algorithm-Based Fault Tolerance Integrated with Replication

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2008-01-01

In a proposed approach to programming and utilization of commercial off-the-shelf computing equipment, a combination of algorithm-based fault tolerance (ABFT) and replication would be utilized to obtain high degrees of fault tolerance without incurring excessive costs. The basic idea of the proposed approach is to integrate ABFT with replication such that the algorithmic portions of computations would be protected by ABFT, and the logical portions by replication. ABFT is an extremely efficient, inexpensive, high-coverage technique for detecting and mitigating faults in computer systems used for algorithmic computations, but does not protect against errors in logical operations surrounding algorithms.
Abstractions for Fault-Tolerant Distributed System Verification

NASA Technical Reports Server (NTRS)

Pike, Lee S.; Maddalon, Jeffrey M.; Miner, Paul S.; Geser, Alfons

2004-01-01

Four kinds of abstraction for the design and analysis of fault tolerant distributed systems are discussed. These abstractions concern system messages, faults, fault masking voting, and communication. The abstractions are formalized in higher order logic, and are intended to facilitate specifying and verifying such systems in higher order theorem provers.
Fault-tolerant communication channel structures

NASA Technical Reports Server (NTRS)

Tai, Ann T. (Inventor); Alkalai, Leon (Inventor); Chau, Savio N. (Inventor)

2006-01-01

Systems and techniques for implementing fault-tolerant communication channels and features in communication systems. Selected commercial-off-the-shelf devices can be integrated in such systems to reduce the cost.
Assessing the Progress of Trapped-Ion Processors Towards Fault-Tolerant Quantum Computation

NASA Astrophysics Data System (ADS)

Bermudez, A.; Xu, X.; Nigmatullin, R.; O'Gorman, J.; Negnevitsky, V.; Schindler, P.; Monz, T.; Poschinger, U. G.; Hempel, C.; Home, J.; Schmidt-Kaler, F.; Biercuk, M.; Blatt, R.; Benjamin, S.; Müller, M.

2017-10-01

A quantitative assessment of the progress of small prototype quantum processors towards fault-tolerant quantum computation is a problem of current interest in experimental and theoretical quantum information science. We introduce a necessary and fair criterion for quantum error correction (QEC), which must be achieved in the development of these quantum processors before their sizes are sufficiently big to consider the well-known QEC threshold. We apply this criterion to benchmark the ongoing effort in implementing QEC with topological color codes using trapped-ion quantum processors and, more importantly, to guide the future hardware developments that will be required in order to demonstrate beneficial QEC with small topological quantum codes. In doing so, we present a thorough description of a realistic trapped-ion toolbox for QEC and a physically motivated error model that goes beyond standard simplifications in the QEC literature. We focus on laser-based quantum gates realized in two-species trapped-ion crystals in high-optical aperture segmented traps. Our large-scale numerical analysis shows that, with the foreseen technological improvements described here, this platform is a very promising candidate for fault-tolerant quantum computation.
A Unified Fault-Tolerance Protocol

NASA Technical Reports Server (NTRS)

Miner, Paul; Gedser, Alfons; Pike, Lee; Maddalon, Jeffrey

2004-01-01

Davies and Wakerly show that Byzantine fault tolerance can be achieved by a cascade of broadcasts and middle value select functions. We present an extension of the Davies and Wakerly protocol, the unified protocol, and its proof of correctness. We prove that it satisfies validity and agreement properties for communication of exact values. We then introduce bounded communication error into the model. Inexact communication is inherent for clock synchronization protocols. We prove that validity and agreement properties hold for inexact communication, and that exact communication is a special case. As a running example, we illustrate the unified protocol using the SPIDER family of fault-tolerant architectures. In particular we demonstrate that the SPIDER interactive consistency, distributed diagnosis, and clock synchronization protocols are instances of the unified protocol.
Multiple Embedded Processors for Fault-Tolerant Computing

NASA Technical Reports Server (NTRS)

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Sliding Mode Fault Tolerant Control with Adaptive Diagnosis for Aircraft Engines

NASA Astrophysics Data System (ADS)

Xiao, Lingfei; Du, Yanbin; Hu, Jixiang; Jiang, Bin

2018-03-01

In this paper, a novel sliding mode fault tolerant control method is presented for aircraft engine systems with uncertainties and disturbances on the basis of adaptive diagnostic observer. By taking both sensors faults and actuators faults into account, the general model of aircraft engine control systems which is subjected to uncertainties and disturbances, is considered. Then, the corresponding augmented dynamic model is established in order to facilitate the fault diagnosis and fault tolerant controller design. Next, a suitable detection observer is designed to detect the faults effectively. Through creating an adaptive diagnostic observer and based on sliding mode strategy, the sliding mode fault tolerant controller is constructed. Robust stabilization is discussed and the closed-loop system can be stabilized robustly. It is also proven that the adaptive diagnostic observer output errors and the estimations of faults converge to a set exponentially, and the converge rate greater than some value which can be adjusted by choosing designable parameters properly. The simulation on a twin-shaft aircraft engine verifies the applicability of the proposed fault tolerant control method.
Model-Based Fault Tolerant Control

NASA Technical Reports Server (NTRS)

Kumar, Aditya; Viassolo, Daniel

2008-01-01

The Model Based Fault Tolerant Control (MBFTC) task was conducted under the NASA Aviation Safety and Security Program. The goal of MBFTC is to develop and demonstrate real-time strategies to diagnose and accommodate anomalous aircraft engine events such as sensor faults, actuator faults, or turbine gas-path component damage that can lead to in-flight shutdowns, aborted take offs, asymmetric thrust/loss of thrust control, or engine surge/stall events. A suite of model-based fault detection algorithms were developed and evaluated. Based on the performance and maturity of the developed algorithms two approaches were selected for further analysis: (i) multiple-hypothesis testing, and (ii) neural networks; both used residuals from an Extended Kalman Filter to detect the occurrence of the selected faults. A simple fusion algorithm was implemented to combine the results from each algorithm to obtain an overall estimate of the identified fault type and magnitude. The identification of the fault type and magnitude enabled the use of an online fault accommodation strategy to correct for the adverse impact of these faults on engine operability thereby enabling continued engine operation in the presence of these faults. The performance of the fault detection and accommodation algorithm was extensively tested in a simulation environment.
Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems

NASA Technical Reports Server (NTRS)

Nair, V. S. S.; Hoskote, Yatin V.; Abraham, Jacob A.

1992-01-01

The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. A probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system is developed. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model. Based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.
Sliding mode based fault detection, reconstruction and fault tolerant control scheme for motor systems.

PubMed

Mekki, Hemza; Benzineb, Omar; Boukhetala, Djamel; Tadjine, Mohamed; Benbouzid, Mohamed

2015-07-01

The fault-tolerant control problem belongs to the domain of complex control systems in which inter-control-disciplinary information and expertise are required. This paper proposes an improved faults detection, reconstruction and fault-tolerant control (FTC) scheme for motor systems (MS) with typical faults. For this purpose, a sliding mode controller (SMC) with an integral sliding surface is adopted. This controller can make the output of system to track the desired position reference signal in finite-time and obtain a better dynamic response and anti-disturbance performance. But this controller cannot deal directly with total system failures. However an appropriate combination of the adopted SMC and sliding mode observer (SMO), later it is designed to on-line detect and reconstruct the faults and also to give a sensorless control strategy which can achieve tolerance to a wide class of total additive failures. The closed-loop stability is proved, using the Lyapunov stability theory. Simulation results in healthy and faulty conditions confirm the reliability of the suggested framework. Copyright © 2015 ISA. Published by Elsevier Ltd. All rights reserved.
Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Panda, Dhabaleswar Kumar; Beckman, Pete

2011-07-28

With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system throughmore » fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on
Software reliability through fault-avoidance and fault-tolerance

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.

1993-01-01

Strategies and tools for the testing, risk assessment and risk control of dependable software-based systems were developed. Part of this project consists of studies to enable the transfer of technology to industry, for example the risk management techniques for safety-concious systems. Theoretical investigations of Boolean and Relational Operator (BRO) testing strategy were conducted for condition-based testing. The Basic Graph Generation and Analysis tool (BGG) was extended to fully incorporate several variants of the BRO metric. Single- and multi-phase risk, coverage and time-based models are being developed to provide additional theoretical and empirical basis for estimation of the reliability and availability of large, highly dependable software. A model for software process and risk management was developed. The use of cause-effect graphing for software specification and validation was investigated. Lastly, advanced software fault-tolerance models were studied to provide alternatives and improvements in situations where simple software fault-tolerance strategies break down.
Energy-efficient fault tolerance in multiprocessor real-time systems

NASA Astrophysics Data System (ADS)

Guo, Yifeng

The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is
A benchmark for fault tolerant flight control evaluation

NASA Astrophysics Data System (ADS)

Smaili, H.; Breeman, J.; Lombaerts, T.; Stroosma, O.

2013-12-01

A large transport aircraft simulation benchmark (REconfigurable COntrol for Vehicle Emergency Return - RECOVER) has been developed within the GARTEUR (Group for Aeronautical Research and Technology in Europe) Flight Mechanics Action Group 16 (FM-AG(16)) on Fault Tolerant Control (2004 2008) for the integrated evaluation of fault detection and identification (FDI) and reconfigurable flight control strategies. The benchmark includes a suitable set of assessment criteria and failure cases, based on reconstructed accident scenarios, to assess the potential of new adaptive control strategies to improve aircraft survivability. The application of reconstruction and modeling techniques, based on accident flight data, has resulted in high-fidelity nonlinear aircraft and fault models to evaluate new Fault Tolerant Flight Control (FTFC) concepts and their real-time performance to accommodate in-flight failures.
The cost of software fault tolerance

NASA Technical Reports Server (NTRS)

Migneault, G. E.

1982-01-01

The proposed use of software fault tolerance techniques as a means of reducing software costs in avionics and as a means of addressing the issue of system unreliability due to faults in software is examined. A model is developed to provide a view of the relationships among cost, redundancy, and reliability which suggests strategies for software development and maintenance which are not conventional.
Design and analysis of linear fault-tolerant permanent-magnet vernier machines.

PubMed

Xu, Liang; Ji, Jinghua; Liu, Guohai; Du, Yi; Liu, Hu

2014-01-01

This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis.
Critical fault patterns determination in fault-tolerant computer systems

NASA Technical Reports Server (NTRS)

Mccluskey, E. J.; Losq, J.

1978-01-01

The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics.
Study of fault tolerant software technology for dynamic systems

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Zacharias, G. L.

1985-01-01

The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented.
Measurement and analysis of operating system fault tolerance

NASA Technical Reports Server (NTRS)

Lee, I.; Tang, D.; Iyer, R. K.

1992-01-01

This paper demonstrates a methodology to model and evaluate the fault tolerance characteristics of operational software. The methodology is illustrated through case studies on three different operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Measurements are made on these systems for substantial periods to collect software error and recovery data. In addition to investigating basic dependability characteristics such as major software problems and error distributions, we develop two levels of models to describe error and recovery processes inside an operating system and on multiple instances of an operating system running in a distributed environment. Based on the models, reward analysis is conducted to evaluate the loss of service due to software errors and the effect of the fault-tolerance techniques implemented in the systems. Software error correlation in multicomputer systems is also investigated.
Reconfigurable tree architectures using subtree oriented fault tolerance

NASA Technical Reports Server (NTRS)

Lowrie, Matthew B.

1987-01-01

An approach to the design of reconfigurable tree architecture is presented in which spare processors are allocated at the leaves. The approach is unique in that spares are associated with subtrees and sharing of spares between these subtrees can occur. The Subtree Oriented Fault Tolerance (SOFT) approach is more reliable than previous approaches capable of tolerating link and switch failures for both single chip and multichip tree implementations while reducing redundancy in terms of both spare processors and links. VLSI layout is 0(n) for binary trees and is directly extensible to N-ary trees and fault tolerance through performance degradation.

Fault-tolerant Control of a Cyber-physical System

NASA Astrophysics Data System (ADS)

Roxana, Rusu-Both; Eva-Henrietta, Dulf

2017-10-01

Cyber-physical systems represent a new emerging field in automatic control. The fault system is a key component, because modern, large scale processes must meet high standards of performance, reliability and safety. Fault propagation in large scale chemical processes can lead to loss of production, energy, raw materials and even environmental hazard. The present paper develops a multi-agent fault-tolerant control architecture using robust fractional order controllers for a (13C) cryogenic separation column cascade. The JADE (Java Agent DEvelopment Framework) platform was used to implement the multi-agent fault tolerant control system while the operational model of the process was implemented in Matlab/SIMULINK environment. MACSimJX (Multiagent Control Using Simulink with Jade Extension) toolbox was used to link the control system and the process model. In order to verify the performance and to prove the feasibility of the proposed control architecture several fault simulation scenarios were performed.
cost and benefits optimization model for fault-tolerant aircraft electronic systems

NASA Technical Reports Server (NTRS)

1983-01-01

The factors involved in economic assessment of fault tolerant systems (FTS) and fault tolerant flight control systems (FTFCS) are discussed. Algorithms for optimization and economic analysis of FTFCS are documented.
Formal Techniques for Synchronized Fault-Tolerant Systems

NASA Technical Reports Server (NTRS)

DiVito, Ben L.; Butler, Ricky W.

1992-01-01

We present the formal verification of synchronizing aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization is based on an extended state machine model incorporating snapshots of local processors clocks.
Fault tolerant architectures for integrated aircraft electronics systems

NASA Technical Reports Server (NTRS)

Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

1983-01-01

Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered.
Mission Management Computer and Sequencing Hardware for RLV-TD HEX-01 Mission

NASA Astrophysics Data System (ADS)

Gupta, Sukrat; Raj, Remya; Mathew, Asha Mary; Koshy, Anna Priya; Paramasivam, R.; Mookiah, T.

2017-12-01

Reusable Launch Vehicle-Technology Demonstrator Hypersonic Experiment (RLV-TD HEX-01) mission posed some unique challenges in the design and development of avionics hardware. This work presents the details of mission critical avionics hardware mainly Mission Management Computer (MMC) and sequencing hardware. The Navigation, Guidance and Control (NGC) chain for RLV-TD is dual redundant with cross-strapped Remote Terminals (RTs) interfaced through MIL-STD-1553B bus. MMC is Bus Controller on the 1553 bus, which does the function of GPS aided navigation, guidance, digital autopilot and sequencing for the RLV-TD launch vehicle in different periodicities (10, 20, 500 ms). Digital autopilot execution in MMC with a periodicity of 10 ms (in ascent phase) is introduced for the first time and successfully demonstrated in the flight. MMC is built around Intel i960 processor and has inbuilt fault tolerance features like ECC for memories. Fault Detection and Isolation schemes are implemented to isolate the failed MMC. The sequencing hardware comprises Stage Processing System (SPS) and Command Execution Module (CEM). SPS is `RT' on the 1553 bus which receives the sequencing and control related commands from MMCs and posts to downstream modules after proper error handling for final execution. SPS is designed as a high reliability system by incorporating various fault tolerance and fault detection features. CEM is a relay based module for sequence command execution.
Design and Analysis of Linear Fault-Tolerant Permanent-Magnet Vernier Machines

PubMed Central

Xu, Liang; Liu, Guohai; Du, Yi; Liu, Hu

2014-01-01

This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis. PMID:24982959
Gyro-based Maximum-Likelihood Thruster Fault Detection and Identification

NASA Technical Reports Server (NTRS)

Wilson, Edward; Lages, Chris; Mah, Robert; Clancy, Daniel (Technical Monitor)

2002-01-01

When building smaller, less expensive spacecraft, there is a need for intelligent fault tolerance vs. increased hardware redundancy. If fault tolerance can be achieved using existing navigation sensors, cost and vehicle complexity can be reduced. A maximum likelihood-based approach to thruster fault detection and identification (FDI) for spacecraft is developed here and applied in simulation to the X-38 space vehicle. The system uses only gyro signals to detect and identify hard, abrupt, single and multiple jet on- and off-failures. Faults are detected within one second and identified within one to five accords,
Evaluation of reliability modeling tools for advanced fault tolerant systems

NASA Technical Reports Server (NTRS)

Baker, Robert; Scheper, Charlotte

1986-01-01

The Computer Aided Reliability Estimation (CARE III) and Automated Reliability Interactice Estimation System (ARIES 82) reliability tools for application to advanced fault tolerance aerospace systems were evaluated. To determine reliability modeling requirements, the evaluation focused on the Draper Laboratories' Advanced Information Processing System (AIPS) architecture as an example architecture for fault tolerance aerospace systems. Advantages and limitations were identified for each reliability evaluation tool. The CARE III program was designed primarily for analyzing ultrareliable flight control systems. The ARIES 82 program's primary use was to support university research and teaching. Both CARE III and ARIES 82 were not suited for determining the reliability of complex nodal networks of the type used to interconnect processing sites in the AIPS architecture. It was concluded that ARIES was not suitable for modeling advanced fault tolerant systems. It was further concluded that subject to some limitations (the difficulty in modeling systems with unpowered spare modules, systems where equipment maintenance must be considered, systems where failure depends on the sequence in which faults occurred, and systems where multiple faults greater than a double near coincident faults must be considered), CARE III is best suited for evaluating the reliability of advanced tolerant systems for air transport.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Fault tolerant operation of switched reluctance machine

NASA Astrophysics Data System (ADS)

Wang, Wei

The energy crisis and environmental challenges have driven industry towards more energy efficient solutions. With nearly 60% of electricity consumed by various electric machines in industry sector, advancement in the efficiency of the electric drive system is of vital importance. Adjustable speed drive system (ASDS) provides excellent speed regulation and dynamic performance as well as dramatically improved system efficiency compared with conventional motors without electronics drives. Industry has witnessed tremendous grow in ASDS applications not only as a driving force but also as an electric auxiliary system for replacing bulky and low efficiency auxiliary hydraulic and mechanical systems. With the vast penetration of ASDS, its fault tolerant operation capability is more widely recognized as an important feature of drive performance especially for aerospace, automotive applications and other industrial drive applications demanding high reliability. The Switched Reluctance Machine (SRM), a low cost, highly reliable electric machine with fault tolerant operation capability, has drawn substantial attention in the past three decades. Nevertheless, SRM is not free of fault. Certain faults such as converter faults, sensor faults, winding shorts, eccentricity and position sensor faults are commonly shared among all ASDS. In this dissertation, a thorough understanding of various faults and their influence on transient and steady state performance of SRM is developed via simulation and experimental study, providing necessary knowledge for fault detection and post fault management. Lumped parameter models are established for fast real time simulation and drive control. Based on the behavior of the faults, a fault detection scheme is developed for the purpose of fast and reliable fault diagnosis. In order to improve the SRM power and torque capacity under faults, the maximum torque per ampere excitation are conceptualized and validated through theoretical analysis and
Fault tolerant programmable digital attitude control electronics study

NASA Technical Reports Server (NTRS)

Sorensen, A. A.

1974-01-01

The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation.
VLSI Implementation of Fault Tolerance Multiplier based on Reversible Logic Gate

NASA Astrophysics Data System (ADS)

Ahmad, Nabihah; Hakimi Mokhtar, Ahmad; Othman, Nurmiza binti; Fhong Soon, Chin; Rahman, Ab Al Hadi Ab

2017-08-01

Multiplier is one of the essential component in the digital world such as in digital signal processing, microprocessor, quantum computing and widely used in arithmetic unit. Due to the complexity of the multiplier, tendency of errors are very high. This paper aimed to design a 2×2 bit Fault Tolerance Multiplier based on Reversible logic gate with low power consumption and high performance. This design have been implemented using 90nm Complemetary Metal Oxide Semiconductor (CMOS) technology in Synopsys Electronic Design Automation (EDA) Tools. Implementation of the multiplier architecture is by using the reversible logic gates. The fault tolerance multiplier used the combination of three reversible logic gate which are Double Feynman gate (F2G), New Fault Tolerance (NFT) gate and Islam Gate (IG) with the area of 160μm x 420.3μm (67.25 mm2). This design achieved a low power consumption of 122.85μW and propagation delay of 16.99ns. The fault tolerance multiplier proposed achieved a low power consumption and high performance which suitable for application of modern computing as it has a fault tolerance capabilities.
Fault-tolerant cooperative output regulation for multi-vehicle systems with sensor faults

NASA Astrophysics Data System (ADS)

Qin, Liguo; He, Xiao; Zhou, D. H.

2017-10-01

This paper presents a unified framework of fault diagnosis and fault-tolerant cooperative output regulation (FTCOR) for a linear discrete-time multi-vehicle system with sensor faults. The FTCOR control law is designed through three steps. A cooperative output regulation (COR) controller is designed based on the internal mode principle when there are no sensor faults. A sufficient condition on the existence of the COR controller is given based on the discrete-time algebraic Riccati equation (DARE). Then, a decentralised fault diagnosis scheme is designed to cope with sensor faults occurring in followers. A residual generator is developed to detect sensor faults of each follower, and a bank of fault-matching estimators are proposed to isolate and estimate sensor faults of each follower. Unlike the current distributed fault diagnosis for multi-vehicle systems, the presented decentralised fault diagnosis scheme in each vehicle reduces the communication and computation load by only using the information of the vehicle. By combing the sensor fault estimation and the COR control law, an FTCOR controller is proposed. Finally, the simulation results demonstrate the effectiveness of the FTCOR controller.
An optimized implementation of a fault-tolerant clock synchronization circuit

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo

1995-01-01

A fault-tolerant clock synchronization circuit was designed and tested. A comparison to a previous design and the procedure followed to achieve the current optimization are included. The report also includes a description of the system and the results of tests performed to study the synchronization and fault-tolerant characteristics of the implementation.
Fault-tolerant measurement-based quantum computing with continuous-variable cluster states.

PubMed

Menicucci, Nicolas C

2014-03-28

A long-standing open question about Gaussian continuous-variable cluster states is whether they enable fault-tolerant measurement-based quantum computation. The answer is yes. Initial squeezing in the cluster above a threshold value of 20.5 dB ensures that errors from finite squeezing acting on encoded qubits are below the fault-tolerance threshold of known qubit-based error-correcting codes. By concatenating with one of these codes and using ancilla-based error correction, fault-tolerant measurement-based quantum computation of theoretically indefinite length is possible with finitely squeezed cluster states.
The Design of a Fault-Tolerant COTS-Based Bus Architecture for Space Applications

NASA Technical Reports Server (NTRS)

Chau, Savio N.; Alkalai, Leon; Tai, Ann T.

2000-01-01

The high-performance, scalability and miniaturization requirements together with the power, mass and cost constraints mandate the use of commercial-off-the-shelf (COTS) components and standards in the X2000 avionics system architecture for deep-space missions. In this paper, we report our experiences and findings on the design of an IEEE 1394 compliant fault-tolerant COTS-based bus architecture. While the COTS standard IEEE 1394 adequately supports power management, high performance and scalability, its topological criteria impose restrictions on fault tolerance realization. To circumvent the difficulties, we derive a "stack-tree" topology that not only complies with the IEEE 1394 standard but also facilitates fault tolerance realization in a spaceborne system with limited dedicated resource redundancies. Moreover, by exploiting pertinent standard features of the 1394 interface which are not purposely designed for fault tolerance, we devise a comprehensive set of fault detection mechanisms to support the fault-tolerant bus architecture.
Fault tolerance of artificial neural networks with applications in critical systems

NASA Technical Reports Server (NTRS)

Protzel, Peter W.; Palumbo, Daniel L.; Arras, Michael K.

1992-01-01

This paper investigates the fault tolerance characteristics of time continuous recurrent artificial neural networks (ANN) that can be used to solve optimization problems. The principle of operations and performance of these networks are first illustrated by using well-known model problems like the traveling salesman problem and the assignment problem. The ANNs are then subjected to 13 simultaneous 'stuck at 1' or 'stuck at 0' faults for network sizes of up to 900 'neurons'. The effects of these faults is demonstrated and the cause for the observed fault tolerance is discussed. An application is presented in which a network performs a critical task for a real-time distributed processing system by generating new task allocations during the reconfiguration of the system. The performance degradation of the ANN under the presence of faults is investigated by large-scale simulations, and the potential benefits of delegating a critical task to a fault tolerant network are discussed.
Hybrid routing technique for a fault-tolerant, integrated information network

NASA Technical Reports Server (NTRS)

Meredith, B. D.

1986-01-01

The evolutionary growth of the space station and the diverse activities onboard are expected to require a hierarchy of integrated, local area networks capable of supporting data, voice, and video communications. In addition, fault-tolerant network operation is necessary to protect communications between critical systems attached to the net and to relieve the valuable human resources onboard the space station of time-critical data system repair tasks. A key issue for the design of the fault-tolerant, integrated network is the development of a robust routing algorithm which dynamically selects the optimum communication paths through the net. A routing technique is described that adapts to topological changes in the network to support fault-tolerant operation and system evolvability.
Flight elements: Fault detection and fault management

NASA Technical Reports Server (NTRS)

Lum, H.; Patterson-Hine, A.; Edge, J. T.; Lawler, D.

1990-01-01

Fault management for an intelligent computational system must be developed using a top down integrated engineering approach. An approach proposed includes integrating the overall environment involving sensors and their associated data; design knowledge capture; operations; fault detection, identification, and reconfiguration; testability; causal models including digraph matrix analysis; and overall performance impacts on the hardware and software architecture. Implementation of the concept to achieve a real time intelligent fault detection and management system will be accomplished via the implementation of several objectives, which are: Development of fault tolerant/FDIR requirement and specification from a systems level which will carry through from conceptual design through implementation and mission operations; Implementation of monitoring, diagnosis, and reconfiguration at all system levels providing fault isolation and system integration; Optimize system operations to manage degraded system performance through system integration; and Lower development and operations costs through the implementation of an intelligent real time fault detection and fault management system and an information management system.
Locating hardware faults in a data communications network of a parallel computer

DOEpatents

Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

2010-01-12

Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

Optimal Management of Redundant Control Authority for Fault Tolerance

NASA Technical Reports Server (NTRS)

Wu, N. Eva; Ju, Jianhong

2000-01-01

This paper is intended to demonstrate the feasibility of a solution to a fault tolerant control problem. It explains, through a numerical example, the design and the operation of a novel scheme for fault tolerant control. The fundamental principle of the scheme was formalized in [5] based on the notion of normalized nonspecificity. The novelty lies with the use of a reliability criterion for redundancy management, and therefore leads to a high overall system reliability.
A fault-tolerant strategy based on SMC for current-controlled converters

NASA Astrophysics Data System (ADS)

Azer, Peter M.; Marei, Mostafa I.; Sattar, Ahmed A.

2018-05-01

The sliding mode control (SMC) is used to control variable structure systems such as power electronics converters. This paper presents a fault-tolerant strategy based on the SMC for current-controlled AC-DC converters. The proposed SMC is based on three sliding surfaces for the three legs of the AC-DC converter. Two sliding surfaces are assigned to control the phase currents since the input three-phase currents are balanced. Hence, the third sliding surface is considered as an extra degree of freedom which is utilised to control the neutral voltage. This action is utilised to enhance the performance of the converter during open-switch faults. The proposed fault-tolerant strategy is based on allocating the sliding surface of the faulty leg to control the neutral voltage. Consequently, the current waveform is improved. The behaviour of the current-controlled converter during different types of open-switch faults is analysed. Double switch faults include three cases: two upper switch fault; upper and lower switch fault at different legs; and two switches of the same leg. The dynamic performance of the proposed system is evaluated during healthy and open-switch fault operations. Simulation results exhibit the various merits of the proposed SMC-based fault-tolerant strategy.
Reliability of Fault Tolerant Control Systems. Part 2

NASA Technical Reports Server (NTRS)

Wu, N. Eva

2000-01-01

This paper reports Part II of a two part effort that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability properties peculiar to fault-tolerant control systems are emphasized, such as the presence of analytic redundancy in high proportion, the dependence of failures on control performance, and high risks associated with decisions in redundancy management due to multiple sources of uncertainties and sometimes large processing requirements. As a consequence, coverage of failures through redundancy management can be severely limited. The paper proposes to formulate the fault tolerant control problem as an optimization problem that maximizes coverage of failures through redundancy management. Coverage modeling is attempted in a way that captures its dependence on the control performance and on the diagnostic resolution. Under the proposed redundancy management policy, it is shown that an enhanced overall system reliability can be achieved with a control law of a superior robustness, with an estimator of a higher resolution, and with a control performance requirement of a lesser stringency.
Fault tolerant high-performance PACS network design and implementation

NASA Astrophysics Data System (ADS)

Chimiak, William J.; Boehme, Johannes M.

1998-07-01

The Wake Forest University School of Medicine and the Wake Forest University/Baptist Medical Center (WFUBMC) are implementing a second generation PACS. The first generation PACS provided helpful information about the functional and temporal requirements of the system. It highlighted the importance of image retrieval speed, system availability, RIS/HIS integration, the ability to rapidly view images on any PACS workstation, network bandwidth, equipment redundancy, and the ability for the system to evolve using standards-based components. This paper deals with the network design and implementation of the PACS. The physical layout of the hospital areas served by the PACS, the choice of network equipment and installation issues encountered are addressed. Efforts to optimize fault tolerance are discussed. The PACS network is a gigabit, mixed-media network based on LAN emulation over ATM (LANE) with a rapid migration from LANE to Multiple Protocols Over ATM (MPOA) planned. Two fault-tolerant backbone ATM switches serve to distribute network accesses with two load-balancing 622 megabit per second (Mbps) OC-12 interconnections. The switch was sized to be upgradable to provide a 2.54 Gbps OC-48 interconnection with an OC-12 interconnection as a load-balancing backup. Modalities connect with legacy network interface cards to a switched-ethernet device. This device has two 155 Mbps OC-3 load-balancing uplinks to each of the backbone ATM switches of the PACS. This provides a fault-tolerant logical connection to the modality servers which pass verified DICOM images to the PACS servers and proper PACS diagnostic workstations. Where fiber pulls were prohibitively expensive, edge ATM switches were installed with an OC-12 uplink to a backbone ATM switches. The PACS and data base servers are fault-tolerant, hot-swappable Sun Enterprise Servers with an OC-12 connection to a backbone ATM switch and a fast-ethernet connection to a back-up network. The workstations come with 10
Data-based fault-tolerant control for affine nonlinear systems with actuator faults.

PubMed

Xie, Chun-Hua; Yang, Guang-Hong

2016-09-01

This paper investigates the fault-tolerant control (FTC) problem for unknown nonlinear systems with actuator faults including stuck, outage, bias and loss of effectiveness. The upper bounds of stuck faults, bias faults and loss of effectiveness faults are unknown. A new data-based FTC scheme is proposed. It consists of the online estimations of the bounds and a state-dependent function. The estimations are adjusted online to compensate automatically the actuator faults. The state-dependent function solved by using real system data helps to stabilize the system. Furthermore, all signals in the resulting closed-loop system are uniformly bounded and the states converge asymptotically to zero. Compared with the existing results, the proposed approach is data-based. Finally, two simulation examples are provided to show the effectiveness of the proposed approach. Copyright © 2016 ISA. Published by Elsevier Ltd. All rights reserved.
A Primer on Architectural Level Fault Tolerance

NASA Technical Reports Server (NTRS)

Butler, Ricky W.

2008-01-01

This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition.
Fault-tolerant arithmetic via time-shared TMR

NASA Astrophysics Data System (ADS)

Swartzlander, Earl E.

1999-11-01

Fault tolerance is increasingly important as society has come to depend on computers for more and more aspects of daily life. The current concern about the Y2K problems indicates just how much we depend on accurate computers. This paper describes work on time- shared TMR, a technique which is used to provide arithmetic operations that produce correct results in spite of circuit faults.
Refinement for fault-tolerance: An aircraft hand-off protocol

NASA Technical Reports Server (NTRS)

Marzullo, Keith; Schneider, Fred B.; Dehn, Jon

1994-01-01

Part of the Advanced Automation System (AAS) for air-traffic control is a protocol to permit flight hand-off from one air-traffic controller to another. The protocol must be fault-tolerant and, therefore, is subtle -- an ideal candidate for the application of formal methods. This paper describes a formal method for deriving fault-tolerant protocols that is based on refinement and proof outlines. The AAS hand-off protocol was actually derived using this method; that derivation is given.
Validation Methods for Fault-Tolerant avionics and control systems, working group meeting 1

NASA Technical Reports Server (NTRS)

1979-01-01

The proceedings of the first working group meeting on validation methods for fault tolerant computer design are presented. The state of the art in fault tolerant computer validation was examined in order to provide a framework for future discussions concerning research issues for the validation of fault tolerant avionics and flight control systems. The development of positions concerning critical aspects of the validation process are given.
Parameter Transient Behavior Analysis on Fault Tolerant Control System

NASA Technical Reports Server (NTRS)

Belcastro, Christine (Technical Monitor); Shin, Jong-Yeob

2003-01-01

In a fault tolerant control (FTC) system, a parameter varying FTC law is reconfigured based on fault parameters estimated by fault detection and isolation (FDI) modules. FDI modules require some time to detect fault occurrences in aero-vehicle dynamics. This paper illustrates analysis of a FTC system based on estimated fault parameter transient behavior which may include false fault detections during a short time interval. Using Lyapunov function analysis, the upper bound of an induced-L2 norm of the FTC system performance is calculated as a function of a fault detection time and the exponential decay rate of the Lyapunov function.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing

DTIC Science & Technology

2012-12-14

Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Matei Zaharia Tathagata Das Haoyuan Li Timothy Hunter Scott Shenker Ion...SUBTITLE Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER...time. However, current programming models for distributed stream processing are relatively low-level often leaving the user to worry about consistency of
Advanced development for space robotics with emphasis on fault tolerance

NASA Technical Reports Server (NTRS)

Tesar, D.; Chladek, J.; Hooper, R.; Sreevijayan, D.; Kapoor, C.; Geisinger, J.; Meaney, M.; Browning, G.; Rackers, K.

1995-01-01

This paper describes the ongoing work in fault tolerance at the University of Texas at Austin. The paper describes the technical goals the group is striving to achieve and includes a brief description of the individual projects focusing on fault tolerance. The ultimate goal is to develop and test technology applicable to all future missions of NASA (lunar base, Mars exploration, planetary surveillance, space station, etc.).
Advanced information processing system - Status report. [for fault tolerant and damage tolerant data processing for aerospace vehicles

NASA Technical Reports Server (NTRS)

Brock, L. D.; Lala, J.

1986-01-01

The Advanced Information Processing System (AIPS) is designed to provide a fault tolerant and damage tolerant data processing architecture for a broad range of aerospace vehicles. The AIPS architecture also has attributes to enhance system effectiveness such as graceful degradation, growth and change tolerance, integrability, etc. Two key building blocks being developed by the AIPS program are a fault and damage tolerant processor and communication network. A proof-of-concept system is now being built and will be tested to demonstrate the validity and performance of the AIPS concepts.
Fault tolerance in a supercomputer through dynamic repartitioning

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

2007-02-27

A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Nonuniform code concatenation for universal fault-tolerant quantum computing

NASA Astrophysics Data System (ADS)

Nikahd, Eesa; Sedighi, Mehdi; Saheb Zamani, Morteza

2017-09-01

Using transversal gates is a straightforward and efficient technique for fault-tolerant quantum computing. Since transversal gates alone cannot be computationally universal, they must be combined with other approaches such as magic state distillation, code switching, or code concatenation to achieve universality. In this paper we propose an alternative approach for universal fault-tolerant quantum computing, mainly based on the code concatenation approach proposed in [T. Jochym-O'Connor and R. Laflamme, Phys. Rev. Lett. 112, 010505 (2014), 10.1103/PhysRevLett.112.010505], but in a nonuniform fashion. The proposed approach is described based on nonuniform concatenation of the 7-qubit Steane code with the 15-qubit Reed-Muller code, as well as the 5-qubit code with the 15-qubit Reed-Muller code, which lead to two 49-qubit and 47-qubit codes, respectively. These codes can correct any arbitrary single physical error with the ability to perform a universal set of fault-tolerant gates, without using magic state distillation.
Experimental Demonstration of Fault-Tolerant State Preparation with Superconducting Qubits.

PubMed

Takita, Maika; Cross, Andrew W; Córcoles, A D; Chow, Jerry M; Gambetta, Jay M

2017-11-03

Robust quantum computation requires encoding delicate quantum information into degrees of freedom that are hard for the environment to change. Quantum encodings have been demonstrated in many physical systems by observing and correcting storage errors, but applications require not just storing information; we must accurately compute even with faulty operations. The theory of fault-tolerant quantum computing illuminates a way forward by providing a foundation and collection of techniques for limiting the spread of errors. Here we implement one of the smallest quantum codes in a five-qubit superconducting transmon device and demonstrate fault-tolerant state preparation. We characterize the resulting code words through quantum process tomography and study the free evolution of the logical observables. Our results are consistent with fault-tolerant state preparation in a protected qubit subspace.
Fault Detection, Isolation and Recovery (FDIR) Portable Liquid Oxygen Hardware Demonstrator

NASA Technical Reports Server (NTRS)

Oostdyk, Rebecca L.; Perotti, Jose M.

2011-01-01

The Fault Detection, Isolation and Recovery (FDIR) hardware demonstration will highlight the effort being conducted by Constellation's Ground Operations (GO) to provide the Launch Control System (LCS) with system-level health management during vehicle processing and countdown activities. A proof-of-concept demonstration of the FDIR prototype established the capability of the software to provide real-time fault detection and isolation using generated Liquid Hydrogen data. The FDIR portable testbed unit (presented here) aims to enhance FDIR by providing a dynamic simulation of Constellation subsystems that feed the FDIR software live data based on Liquid Oxygen system properties. The LO2 cryogenic ground system has key properties that are analogous to the properties of an electronic circuit. The LO2 system is modeled using electrical components and an equivalent circuit is designed on a printed circuit board to simulate the live data. The portable testbed is also be equipped with data acquisition and communication hardware to relay the measurements to the FDIR application running on a PC. This portable testbed is an ideal capability to perform FDIR software testing, troubleshooting, training among others.
Dataflow models for fault-tolerant control systems

NASA Technical Reports Server (NTRS)

Papadopoulos, G. M.

1984-01-01

Dataflow concepts are used to generate a unified hardware/software model of redundant physical systems which are prone to faults. Basic results in input congruence and synchronization are shown to reduce to a simple model of data exchanges between processing sites. Procedures are given for the construction of congruence schemata, the distinguishing features of any correctly designed redundant system.
Quantum neuromorphic hardware for quantum artificial intelligence

NASA Astrophysics Data System (ADS)

Prati, Enrico

2017-08-01

The development of machine learning methods based on deep learning boosted the field of artificial intelligence towards unprecedented achievements and application in several fields. Such prominent results were made in parallel with the first successful demonstrations of fault tolerant hardware for quantum information processing. To which extent deep learning can take advantage of the existence of a hardware based on qubits behaving as a universal quantum computer is an open question under investigation. Here I review the convergence between the two fields towards implementation of advanced quantum algorithms, including quantum deep learning.
Active Fault Tolerant Control for Ultrasonic Piezoelectric Motor

NASA Astrophysics Data System (ADS)

Boukhnifer, Moussa

2012-07-01

Ultrasonic piezoelectric motor technology is an important system component in integrated mechatronics devices working on extreme operating conditions. Due to these constraints, robustness and performance of the control interfaces should be taken into account in the motor design. In this paper, we apply a new architecture for a fault tolerant control using Youla parameterization for an ultrasonic piezoelectric motor. The distinguished feature of proposed controller architecture is that it shows structurally how the controller design for performance and robustness may be done separately which has the potential to overcome the conflict between performance and robustness in the traditional feedback framework. A fault tolerant control architecture includes two parts: one part for performance and the other part for robustness. The controller design works in such a way that the feedback control system will be solely controlled by the proportional plus double-integral PI2 performance controller for a nominal model without disturbances and H∞ robustification controller will only be activated in the presence of the uncertainties or an external disturbances. The simulation results demonstrate the effectiveness of the proposed fault tolerant control architecture.

Proactive Fault Tolerance for HPC with Xen Virtualization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian

2007-01-01

with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanismmore » for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.« less
Adaptive robust fault-tolerant control for linear MIMO systems with unmatched uncertainties

NASA Astrophysics Data System (ADS)

Zhang, Kangkang; Jiang, Bin; Yan, Xing-Gang; Mao, Zehui

2017-10-01

In this paper, two novel fault-tolerant control design approaches are proposed for linear MIMO systems with actuator additive faults, multiplicative faults and unmatched uncertainties. For time-varying multiplicative and additive faults, new adaptive laws and additive compensation functions are proposed. A set of conditions is developed such that the unmatched uncertainties are compensated by actuators in control. On the other hand, for unmatched uncertainties with their projection in unmatched space being not zero, based on a (vector) relative degree condition, additive functions are designed to compensate for the uncertainties from output channels in the presence of actuator faults. The developed fault-tolerant control schemes are applied to two aircraft systems to demonstrate the efficiency of the proposed approaches.
Fault tolerant filtering and fault detection for quantum systems driven by fields in single photon states

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gao, Qing, E-mail: qing.gao.chance@gmail.com; Dong, Daoyi, E-mail: daoyidong@gmail.com; Petersen, Ian R., E-mail: i.r.petersen@gmai.com

The purpose of this paper is to solve the fault tolerant filtering and fault detection problem for a class of open quantum systems driven by a continuous-mode bosonic input field in single photon states when the systems are subject to stochastic faults. Optimal estimates of both the system observables and the fault process are simultaneously calculated and characterized by a set of coupled recursive quantum stochastic differential equations.
Enhanced fault-tolerant quantum computing in d-level systems.

PubMed

Campbell, Earl T

2014-12-05

Error-correcting codes protect quantum information and form the basis of fault-tolerant quantum computing. Leading proposals for fault-tolerant quantum computation require codes with an exceedingly rare property, a transversal non-Clifford gate. Codes with the desired property are presented for d-level qudit systems with prime d. The codes use n=d-1 qudits and can detect up to ∼d/3 errors. We quantify the performance of these codes for one approach to quantum computation known as magic-state distillation. Unlike prior work, we find performance is always enhanced by increasing d.
A Novel Dual Separate Paths (DSP) Algorithm Providing Fault-Tolerant Communication for Wireless Sensor Networks.

PubMed

Tien, Nguyen Xuan; Kim, Semog; Rhee, Jong Myung; Park, Sang Yoon

2017-07-25

Fault tolerance has long been a major concern for sensor communications in fault-tolerant cyber physical systems (CPSs). Network failure problems often occur in wireless sensor networks (WSNs) due to various factors such as the insufficient power of sensor nodes, the dislocation of sensor nodes, the unstable state of wireless links, and unpredictable environmental interference. Fault tolerance is thus one of the key requirements for data communications in WSN applications. This paper proposes a novel path redundancy-based algorithm, called dual separate paths (DSP), that provides fault-tolerant communication with the improvement of the network traffic performance for WSN applications, such as fault-tolerant CPSs. The proposed DSP algorithm establishes two separate paths between a source and a destination in a network based on the network topology information. These paths are node-disjoint paths and have optimal path distances. Unicast frames are delivered from the source to the destination in the network through the dual paths, providing fault-tolerant communication and reducing redundant unicast traffic for the network. The DSP algorithm can be applied to wired and wireless networks, such as WSNs, to provide seamless fault-tolerant communication for mission-critical and life-critical applications such as fault-tolerant CPSs. The analyzed and simulated results show that the DSP-based approach not only provides fault-tolerant communication, but also improves network traffic performance. For the case study in this paper, when the DSP algorithm was applied to high-availability seamless redundancy (HSR) networks, the proposed DSP-based approach reduced the network traffic by 80% to 88% compared with the standard HSR protocol, thus improving network traffic performance.
A novel N-input voting algorithm for X-by-wire fault-tolerant systems.

PubMed

Karimi, Abbas; Zarafshan, Faraneh; Al-Haddad, S A R; Ramli, Abdul Rahman

2014-01-01

Voting is an important operation in multichannel computation paradigm and realization of ultrareliable and real-time control systems that arbitrates among the results of N redundant variants. These systems include N-modular redundant (NMR) hardware systems and diversely designed software systems based on N-version programming (NVP). Depending on the characteristics of the application and the type of selected voter, the voting algorithms can be implemented for either hardware or software systems. In this paper, a novel voting algorithm is introduced for real-time fault-tolerant control systems, appropriate for applications in which N is large. Then, its behavior has been software implemented in different scenarios of error-injection on the system inputs. The results of analyzed evaluations through plots and statistical computations have demonstrated that this novel algorithm does not have the limitations of some popular voting algorithms such as median and weighted; moreover, it is able to significantly increase the reliability and availability of the system in the best case to 2489.7% and 626.74%, respectively, and in the worst case to 3.84% and 1.55%, respectively.
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

Code of Federal Regulations, 2014 CFR

2014-01-01

... 14 Aeronautics and Space 1 2014-01-01 2014-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

Code of Federal Regulations, 2011 CFR

2011-01-01

... 14 Aeronautics and Space 1 2011-01-01 2011-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

Code of Federal Regulations, 2012 CFR

2012-01-01

... 14 Aeronautics and Space 1 2012-01-01 2012-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

Code of Federal Regulations, 2010 CFR

2010-01-01

... 14 Aeronautics and Space 1 2010-01-01 2010-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

Code of Federal Regulations, 2013 CFR

2013-01-01

... 14 Aeronautics and Space 1 2013-01-01 2013-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
Fault-tolerant onboard digital information switching and routing for communications satellites

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.; Kim, Heechul

1993-01-01

The NASA Lewis Research Center is developing an information-switching processor for future meshed very-small-aperture terminal (VSAT) communications satellites. The information-switching processor will switch and route baseband user data onboard the VSAT satellite to connect thousands of Earth terminals. Fault tolerance is a critical issue in developing information-switching processor circuitry that will provide and maintain reliable communications services. In parallel with the conceptual development of the meshed VSAT satellite network architecture, NASA designed and built a simple test bed for developing and demonstrating baseband switch architectures and fault-tolerance techniques. The meshed VSAT architecture and the switching demonstration test bed are described, and the initial switching architecture and the fault-tolerance techniques that were developed and tested are discussed.
Combining dynamical decoupling with fault-tolerant quantum computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ng, Hui Khoon; Preskill, John; Lidar, Daniel A.

2011-07-15

We study how dynamical decoupling (DD) pulse sequences can improve the reliability of quantum computers. We prove upper bounds on the accuracy of DD-protected quantum gates and derive sufficient conditions for DD-protected gates to outperform unprotected gates. Under suitable conditions, fault-tolerant quantum circuits constructed from DD-protected gates can tolerate stronger noise and have a lower overhead cost than fault-tolerant circuits constructed from unprotected gates. Our accuracy estimates depend on the dynamics of the bath that couples to the quantum computer and can be expressed either in terms of the operator norm of the bath's Hamiltonian or in terms of themore » power spectrum of bath correlations; we explain in particular how the performance of recursively generated concatenated pulse sequences can be analyzed from either viewpoint. Our results apply to Hamiltonian noise models with limited spatial correlations.« less
A Self-Stabilizing Hybrid Fault-Tolerant Synchronization Protocol

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2015-01-01

This paper presents a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. The strategy consists of two parts: first, converting Byzantine faults into symmetric faults, and second, using a proven symmetric-fault tolerant algorithm to solve the general case of the problem. A protocol (algorithm) is also present that tolerates symmetric faults, provided that there are more good nodes than faulty ones. The solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. The solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. A mechanical verification of a proposed protocol is also present. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period.
Evaluating and extending user-level fault tolerance in MPI applications

DOE PAGES

Laguna, Ignacio; Richards, David F.; Gamblin, Todd; ...

2016-01-11

The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulkmore » synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.« less
A fault tolerant gait for a hexapod robot over uneven terrain.

PubMed

Yang, J M; Kim, J H

2000-01-01

The fault tolerant gait of legged robots in static walking is a gait which maintains its stability against a fault event preventing a leg from having the support state. In this paper, a fault tolerant quadruped gait is proposed for a hexapod traversing uneven terrain with forbidden regions, which do not offer viable footholds but can be stepped over. By comparing performance of straight-line motion and crab walking over even terrain, it is shown that the proposed gait has better mobility and terrain adaptability than previously developed gaits. Based on the proposed gait, we present a method for the generation of the fault tolerant locomotion of a hexapod over uneven terrain with forbidden regions. The proposed method minimizes the number of legs on the ground during walking, and foot adjustment algorithm is used for avoiding steps on forbidden regions. The effectiveness of the proposed strategy over uneven terrain is demonstrated with a computer simulation.
Universal fault-tolerant quantum computation with only transversal gates and error correction.

PubMed

Paetznick, Adam; Reichardt, Ben W

2013-08-30

Transversal implementations of encoded unitary gates are highly desirable for fault-tolerant quantum computation. Though transversal gates alone cannot be computationally universal, they can be combined with specially distilled resource states in order to achieve universality. We show that "triorthogonal" stabilizer codes, introduced for state distillation by Bravyi and Haah [Phys. Rev. A 86, 052329 (2012)], admit transversal implementation of the controlled-controlled-Z gate. We then construct a universal set of fault-tolerant gates without state distillation by using only transversal controlled-controlled-Z, transversal Hadamard, and fault-tolerant error correction. We also adapt the distillation procedure of Bravyi and Haah to Toffoli gates, improving on existing Toffoli distillation schemes.
Design of on-board Bluetooth wireless network system based on fault-tolerant technology

NASA Astrophysics Data System (ADS)

You, Zheng; Zhang, Xiangqi; Yu, Shijie; Tian, Hexiang

2007-11-01

In this paper, the Bluetooth wireless data transmission technology is applied in on-board computer system, to realize wireless data transmission between peripherals of the micro-satellite integrating electronic system, and in view of the high demand of reliability of a micro-satellite, a design of Bluetooth wireless network based on fault-tolerant technology is introduced. The reliability of two fault-tolerant systems is estimated firstly using Markov model, then the structural design of this fault-tolerant system is introduced; several protocols are established to make the system operate correctly, some related problems are listed and analyzed, with emphasis on Fault Auto-diagnosis System, Active-standby switch design and Data-Integrity process.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary

NASA Technical Reports Server (NTRS)

Smith, T. B., III; Lala, J. H.

1984-01-01

The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
Fault tolerant control of multivariable processes using auto-tuning PID controller.

PubMed

Yu, Ding-Li; Chang, T K; Yu, Ding-Wen

2005-02-01

Fault tolerant control of dynamic processes is investigated in this paper using an auto-tuning PID controller. A fault tolerant control scheme is proposed composing an auto-tuning PID controller based on an adaptive neural network model. The model is trained online using the extended Kalman filter (EKF) algorithm to learn system post-fault dynamics. Based on this model, the PID controller adjusts its parameters to compensate the effects of the faults, so that the control performance is recovered from degradation. The auto-tuning algorithm for the PID controller is derived with the Lyapunov method and therefore, the model predicted tracking error is guaranteed to converge asymptotically. The method is applied to a simulated two-input two-output continuous stirred tank reactor (CSTR) with various faults, which demonstrate the applicability of the developed scheme to industrial processes.

Fault-Tolerant Coding for State Machines

NASA Technical Reports Server (NTRS)

Naegle, Stephanie Taft; Burke, Gary; Newell, Michael

2008-01-01

Two reliable fault-tolerant coding schemes have been proposed for state machines that are used in field-programmable gate arrays and application-specific integrated circuits to implement sequential logic functions. The schemes apply to strings of bits in state registers, which are typically implemented in practice as assemblies of flip-flop circuits. If a single-event upset (SEU, a radiation-induced change in the bit in one flip-flop) occurs in a state register, the state machine that contains the register could go into an erroneous state or could hang, by which is meant that the machine could remain in undefined states indefinitely. The proposed fault-tolerant coding schemes are intended to prevent the state machine from going into an erroneous or hang state when an SEU occurs. To ensure reliability of the state machine, the coding scheme for bits in the state register must satisfy the following criteria: 1. All possible states are defined. 2. An SEU brings the state machine to a known state. 3. There is no possibility of a hang state. 4. No false state is entered. 5. An SEU exerts no effect on the state machine. Fault-tolerant coding schemes that have been commonly used include binary encoding and "one-hot" encoding. Binary encoding is the simplest state machine encoding and satisfies criteria 1 through 3 if all possible states are defined. Binary encoding is a binary count of the state machine number in sequence; the table represents an eight-state example. In one-hot encoding, N bits are used to represent N states: All except one of the bits in a string are 0, and the position of the 1 in the string represents the state. With proper circuit design, one-hot encoding can satisfy criteria 1 through 4. Unfortunately, the requirement to use N bits to represent N states makes one-hot coding inefficient.
High-Intensity Radiated Field Fault-Injection Experiment for a Fault-Tolerant Distributed Communication System

NASA Technical Reports Server (NTRS)

Yates, Amy M.; Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Gonzalez, Oscar R.; Gray, W. Steven

2010-01-01

Safety-critical distributed flight control systems require robustness in the presence of faults. In general, these systems consist of a number of input/output (I/O) and computation nodes interacting through a fault-tolerant data communication system. The communication system transfers sensor data and control commands and can handle most faults under typical operating conditions. However, the performance of the closed-loop system can be adversely affected as a result of operating in harsh environments. In particular, High-Intensity Radiated Field (HIRF) environments have the potential to cause random fault manifestations in individual avionic components and to generate simultaneous system-wide communication faults that overwhelm existing fault management mechanisms. This paper presents the design of an experiment conducted at the NASA Langley Research Center's HIRF Laboratory to statistically characterize the faults that a HIRF environment can trigger on a single node of a distributed flight control system.
Fault diagnosis and fault-tolerant finite control set-model predictive control of a multiphase voltage-source inverter supplying BLDC motor.

PubMed

Salehifar, Mehdi; Moreno-Equilaz, Manuel

2016-01-01

Due to its fault tolerance, a multiphase brushless direct current (BLDC) motor can meet high reliability demand for application in electric vehicles. The voltage-source inverter (VSI) supplying the motor is subjected to open circuit faults. Therefore, it is necessary to design a fault-tolerant (FT) control algorithm with an embedded fault diagnosis (FD) block. In this paper, finite control set-model predictive control (FCS-MPC) is developed to implement the fault-tolerant control algorithm of a five-phase BLDC motor. The developed control method is fast, simple, and flexible. A FD method based on available information from the control block is proposed; this method is simple, robust to common transients in motor and able to localize multiple open circuit faults. The proposed FD and FT control algorithm are embedded in a five-phase BLDC motor drive. In order to validate the theory presented, simulation and experimental results are conducted on a five-phase two-level VSI supplying a five-phase BLDC motor. Copyright © 2015 ISA. Published by Elsevier Ltd. All rights reserved.
Step-by-step magic state encoding for efficient fault-tolerant quantum computation.

PubMed

Goto, Hayato

2014-12-16

Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation.
Provable Transient Recovery for Frame-Based, Fault-Tolerant Computing Systems

NASA Technical Reports Server (NTRS)

DiVito, Ben L.; Butler, Ricky W.

1992-01-01

We present a formal verification of the transient fault recovery aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system architecture for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization accommodates a wide variety of voting schemes for purging the effects of transients.
Fault tolerant system based on IDDQ testing

NASA Astrophysics Data System (ADS)

Guibane, Badi; Hamdi, Belgacem; Mtibaa, Abdellatif; Bensalem, Brahim

2018-06-01

Offline test is essential to ensure good manufacturing quality. However, for permanent or transient faults that occur during the use of the integrated circuit in an application, an online integrated test is needed as well. This procedure should ensure the detection and possibly the correction or the masking of these faults. This requirement of self-correction is sometimes necessary, especially in critical applications that require high security such as automotive, space or biomedical applications. We propose a fault-tolerant design for analogue and mixed-signal design complementary metal oxide (CMOS) circuits based on the quiescent current supply (IDDQ) testing. A defect can cause an increase in current consumption. IDDQ testing technique is based on the measurement of power supply current to distinguish between functional and failed circuits. The technique has been an effective testing method for detecting physical defects such as gate-oxide shorts, floating gates (open) and bridging defects in CMOS integrated circuits. An architecture called BICS (Built In Current Sensor) is used for monitoring the supply current (IDDQ) of the connected integrated circuit. If the measured current is not within the normal range, a defect is signalled and the system switches connection from the defective to a functional integrated circuit. The fault-tolerant technique is composed essentially by a double mirror built-in current sensor, allowing the detection of abnormal current consumption and blocks allowing the connection to redundant circuits, if a defect occurs. Spices simulations are performed to valid the proposed design.
Fault-tolerant composite Householder reflection

NASA Astrophysics Data System (ADS)

Torosov, Boyan T.; Kyoseva, Elica; Vitanov, Nikolay V.

2015-07-01

We propose a fault-tolerant implementation of the quantum Householder reflection, which is a key operation in various quantum algorithms, quantum-state engineering, generation of arbitrary unitaries, and entanglement characterization. We construct this operation using the modular approach of composite pulses and a relation between the Householder reflection and the quantum phase gate. The proposed implementation is highly insensitive to variations in the experimental parameters, which makes it suitable for high-fidelity quantum information processing.
Evaluation of fault-tolerant parallel-processor architectures over long space missions

NASA Technical Reports Server (NTRS)

Johnson, Sally C.

1989-01-01

The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.
ROSE::FTTransform - A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lidman, J; Quinlan, D; Liao, C

2012-03-26

Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present anmore » open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).« less
Algorithm-Based Fault Tolerance for Numerical Subroutines

NASA Technical Reports Server (NTRS)

Tumon, Michael; Granat, Robert; Lou, John

2007-01-01

A software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.
Full-Authority Fault-Tolerant Electronic Engine Control System for Variable Cycle Engines.

DTIC Science & Technology

1982-04-01

single internally self-checked VLSI micro - processor . The selected configuration is an externally checked pair of com- mercially available...Electronic Engine Control FPMH Failures per Million Hours FTMP Fault Tolerant Multi- Processor FTSC Fault Tolerant Spaceborn Computer GRAMP Generalized...Removal * MTBR Mean Time Between Repair MTTF Mean Time to Failure xiii List of Abbreviations (continued) - NH High Pressure Rotor Speed O&S Operating
Fault Tolerance in ZigBee Wireless Sensor Networks

NASA Technical Reports Server (NTRS)

Alena, Richard; Gilstrap, Ray; Baldwin, Jarren; Stone, Thom; Wilson, Pete

2011-01-01

Wireless sensor networks (WSN) based on the IEEE 802.15.4 Personal Area Network standard are finding increasing use in the home automation and emerging smart energy markets. The network and application layers, based on the ZigBee 2007 PRO Standard, provide a convenient framework for component-based software that supports customer solutions from multiple vendors. This technology is supported by System-on-a-Chip solutions, resulting in extremely small and low-power nodes. The Wireless Connections in Space Project addresses the aerospace flight domain for both flight-critical and non-critical avionics. WSNs provide the inherent fault tolerance required for aerospace applications utilizing such technology. The team from Ames Research Center has developed techniques for assessing the fault tolerance of ZigBee WSNs challenged by radio frequency (RF) interference or WSN node failure.
Advanced Information Processing System (AIPS)-based fault tolerant avionics architecture for launch vehicles

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan H.; Harper, Richard E.; Jaskowiak, Kenneth R.; Rosch, Gene; Alger, Linda S.; Schor, Andrei L.

1990-01-01

An avionics architecture for the advanced launch system (ALS) that uses validated hardware and software building blocks developed under the advanced information processing system program is presented. The AIPS for ALS architecture defined is preliminary, and reliability requirements can be met by the AIPS hardware and software building blocks that are built using the state-of-the-art technology available in the 1992-93 time frame. The level of detail in the architecture definition reflects the level of detail available in the ALS requirements. As the avionics requirements are refined, the architecture can also be refined and defined in greater detail with the help of analysis and simulation tools. A useful methodology is demonstrated for investigating the impact of the avionics suite to the recurring cost of the ALS. It is shown that allowing the vehicle to launch with selected detected failures can potentially reduce the recurring launch costs. A comparative analysis shows that validated fault-tolerant avionics built out of Class B parts can result in lower life-cycle-cost in comparison to simplex avionics built out of Class S parts or other redundant architectures.
Fault-tolerance in Two-dimensional Topological Systems

NASA Astrophysics Data System (ADS)

Anderson, Jonas T.

This thesis is a collection of ideas with the general goal of building, at least in the abstract, a local fault-tolerant quantum computer. The connection between quantum information and topology has proven to be an active area of research in several fields. The introduction of the toric code by Alexei Kitaev demonstrated the usefulness of topology for quantum memory and quantum computation. Many quantum codes used for quantum memory are modeled by spin systems on a lattice, with operators that extract syndrome information placed on vertices or faces of the lattice. It is natural to wonder whether the useful codes in such systems can be classified. This thesis presents work that leverages ideas from topology and graph theory to explore the space of such codes. Homological stabilizer codes are introduced and it is shown that, under a set of reasonable assumptions, any qubit homological stabilizer code is equivalent to either a toric code or a color code. Additionally, the toric code and the color code correspond to distinct classes of graphs. Many systems have been proposed as candidate quantum computers. It is very desirable to design quantum computing architectures with two-dimensional layouts and low complexity in parity-checking circuitry. Kitaev's surface codes provided the first example of codes satisfying this property. They provided a new route to fault tolerance with more modest overheads and thresholds approaching 1%. The recently discovered color codes share many properties with the surface codes, such as the ability to perform syndrome extraction locally in two dimensions. Some families of color codes admit a transversal implementation of the entire Clifford group. This work investigates color codes on the 4.8.8 lattice known as triangular codes. I develop a fault-tolerant error-correction strategy for these codes in which repeated syndrome measurements on this lattice generate a three-dimensional space-time combinatorial structure. I then develop an
Step-by-step magic state encoding for efficient fault-tolerant quantum computation

PubMed Central

Goto, Hayato

2014-01-01

Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation. PMID:25511387
Fault-Tolerant Control For A Robotic Inspection System

NASA Technical Reports Server (NTRS)

Tso, Kam Sing

1995-01-01

Report describes first phase of continuing program of research on fault-tolerant control subsystem of telerobotic visual-inspection system. Goal of program to develop robotic system for remotely controlled visual inspection of structures in outer space.
Cascading Policies Provide Fault Tolerance for Pervasive Clinical Communications.

PubMed

Williams, Rose; Jalan, Srikant; Stern, Edie; Lussier, Yves A

2005-03-21

We implemented an end-to-end notification system that pushed urgent clinical laboratory results to Blackberry 7510 devices over the Nextel cellular network. We designed our system to use user roles and notification policies to abstract and execute clinical notification procedures. We anticipated some problems with dropped and non-delivered messages when the device was out-of-network, however, we did not expect the same problems in other situations like device reconnection to the network. We addressed these problems by creating cascading "fault tolerance" policies to drive notification escalation when messages timed-out or delivery failed. This paper describes our experience in providing an adaptable, fault tolerant pervasive notification system for delivering secure, critical, time-sensitive patient laboratory results.
Advanced I&C for Fault-Tolerant Supervisory Control of Small Modular Reactors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cole, Daniel G.

In this research, we have developed a supervisory control approach to enable automated control of SMRs. By design the supervisory control system has an hierarchical, interconnected, adaptive control architecture. A considerable advantage to this architecture is that it allows subsystems to communicate at different/finer granularity, facilitates monitoring of process at the modular and plant levels, and enables supervisory control. We have investigated the deployment of automation, monitoring, and data collection technologies to enable operation of multiple SMRs. Each unit's controller collects and transfers information from local loops and optimize that unit’s parameters. Information is passed from the each SMR unitmore » controller to the supervisory controller, which supervises the actions of SMR units and manage plant processes. The information processed at the supervisory level will provide operators the necessary information needed for reactor, unit, and plant operation. In conjunction with the supervisory effort, we have investigated techniques for fault-tolerant networks, over which information is transmitted between local loops and the supervisory controller to maintain a safe level of operational normalcy in the presence of anomalies. The fault-tolerance of the supervisory control architecture, the network that supports it, and the impact of fault-tolerance on multi-unit SMR plant control has been a second focus of this research. To this end, we have investigated the deployment of advanced automation, monitoring, and data collection and communications technologies to enable operation of multiple SMRs. We have created a fault-tolerant multi-unit SMR supervisory controller that collects and transfers information from local loops, supervise their actions, and adaptively optimize the controller parameters. The goal of this research has been to develop the methodologies and procedures for fault-tolerant supervisory control of small modular reactors. To
Optimal fault-tolerant control strategy of a solid oxide fuel cell system

NASA Astrophysics Data System (ADS)

Wu, Xiaojuan; Gao, Danhui

2017-10-01

For solid oxide fuel cell (SOFC) development, load tracking, heat management, air excess ratio constraint, high efficiency, low cost and fault diagnosis are six key issues. However, no literature studies the control techniques combining optimization and fault diagnosis for the SOFC system. An optimal fault-tolerant control strategy is presented in this paper, which involves four parts: a fault diagnosis module, a switching module, two backup optimizers and a controller loop. The fault diagnosis part is presented to identify the SOFC current fault type, and the switching module is used to select the appropriate backup optimizer based on the diagnosis result. NSGA-II and TOPSIS are employed to design the two backup optimizers under normal and air compressor fault states. PID algorithm is proposed to design the control loop, which includes a power tracking controller, an anode inlet temperature controller, a cathode inlet temperature controller and an air excess ratio controller. The simulation results show the proposed optimal fault-tolerant control method can track the power, temperature and air excess ratio at the desired values, simultaneously achieving the maximum efficiency and the minimum unit cost in the case of SOFC normal and even in the air compressor fault.
General linear codes for fault-tolerant matrix operations on processor arrays

NASA Technical Reports Server (NTRS)

Nair, V. S. S.; Abraham, J. A.

1988-01-01

Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.

Fault diagnostic instrumentation design for environmental control and life support systems

NASA Technical Reports Server (NTRS)

Yang, P. Y.; You, K. C.; Wynveen, R. A.; Powell, J. D., Jr.

1979-01-01

As a development phase moves toward flight hardware, the system availability becomes an important design aspect which requires high reliability and maintainability. As part of continous development efforts, a program to evaluate, design, and demonstrate advanced instrumentation fault diagnostics was successfully completed. Fault tolerance designs for reliability and other instrumenation capabilities to increase maintainability were evaluated and studied.
Intrinsic Hardware Evolution for the Design and Reconfiguration of Analog Speed Controllers for a DC Motor

NASA Technical Reports Server (NTRS)

Gwaltney, David A.; Ferguson, Michael I.

2003-01-01

Evolvable hardware provides the capability to evolve analog circuits to produce amplifier and filter functions. Conventional analog controller designs employ these same functions. Analog controllers for the control of the shaft speed of a DC motor are evolved on an evolvable hardware platform utilizing a second generation Field Programmable Transistor Array (FPTA2). The performance of an evolved controller is compared to that of a conventional proportional-integral (PI) controller. It is shown that hardware evolution is able to create a compact design that provides good performance, while using considerably less functional electronic components than the conventional design. Additionally, the use of hardware evolution to provide fault tolerance by reconfiguring the design is explored. Experimental results are presented showing that significant recovery of capability can be made in the face of damaging induced faults.
Robust Gain-Scheduled Fault Tolerant Control for a Transport Aircraft

NASA Technical Reports Server (NTRS)

Shin, Jong-Yeob; Gregory, Irene

2007-01-01

This paper presents an application of robust gain-scheduled control concepts using a linear parameter-varying (LPV) control synthesis method to design fault tolerant controllers for a civil transport aircraft. To apply the robust LPV control synthesis method, the nonlinear dynamics must be represented by an LPV model, which is developed using the function substitution method over the entire flight envelope. The developed LPV model associated with the aerodynamic coefficient uncertainties represents nonlinear dynamics including those outside the equilibrium manifold. Passive and active fault tolerant controllers (FTC) are designed for the longitudinal dynamics of the Boeing 747-100/200 aircraft in the presence of elevator failure. Both FTC laws are evaluated in the full nonlinear aircraft simulation in the presence of the elevator fault and the results are compared to show pros and cons of each control law.
Modeling the Fault Tolerant Capability of a Flight Control System: An Exercise in SCR Specification

NASA Technical Reports Server (NTRS)

Alexander, Chris; Cortellessa, Vittorio; DelGobbo, Diego; Mili, Ali; Napolitano, Marcello

2000-01-01

In life-critical and mission-critical applications, it is important to make provisions for a wide range of contingencies, by providing means for fault tolerance. In this paper, we discuss the specification of a flight control system that is fault tolerant with respect to sensor faults. Redundancy is provided by analytical relations that hold between sensor readings; depending on the conditions, this redundancy can be used to detect, identify and accommodate sensor faults.
Fault tolerant attitude sensing and force feedback control for unmanned aerial vehicles

NASA Astrophysics Data System (ADS)

Jagadish, Chirag

Two aspects of an unmanned aerial vehicle are studied in this work. One is fault tolerant attitude determination and the other is to provide force feedback to the joy-stick of the UAV so as to prevent faulty inputs from the pilot. Determination of attitude plays an important role in control of aerial vehicles. One way of defining the attitude is through Euler angles. These angles can be determined based on the measurements of the projections of the gravity and earth magnetic fields on the three body axes of the vehicle. Attitude determination in unmanned aerial vehicles poses additional challenges due to limitations of space, payload, power and cost. Therefore it provides for almost no room for any bulky sensors or extra sensor hardware for backup and as such leaves no room for sensor fault issues either. In the face of these limitations, this study proposes a fault tolerant computing of Euler angles by utilizing multiple different computation methods, with each method utilizing a different subset of the available sensor measurement data. Twenty-five such methods have been presented in this document. The capability of computing the Euler angles in multiple ways provides a diversified redundancy required for fault tolerance. The proposed approach can identify certain sets of sensor failures and even separate the reference fields from the disturbances. A bank-to-turn maneuver of the NASA GTM UAV is used to demonstrate the fault tolerance provided by the proposed method as well as to demonstrate the method of determining the correct Euler angles despite interferences by inertial acceleration disturbances. Attitude computation is essential for stability. But as of today most UAVs are commanded remotely by human pilots. While basic stability control is entrusted to machine or the on-board automatic controller, overall guidance is usually with humans. It is therefore the pilot who sets the command/references through a joy-stick. While this is a good compromise between
Proactive Fault Tolerance Using Preemptive Migration

DOE Office of Scientific and Technical Information (OSTI.GOV)

Engelmann, Christian; Vallee, Geoffroy R; Naughton, III, Thomas J

2009-01-01

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
Fault-tolerant computer study. [logic designs for building block circuits

NASA Technical Reports Server (NTRS)

Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.

1981-01-01

A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.
Self-stabilizing byzantine-fault-tolerant clock synchronization system and method

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R. (Inventor)

2012-01-01

Systems and methods for rapid Byzantine-fault-tolerant self-stabilizing clock synchronization are provided. The systems and methods are based on a protocol comprising a state machine and a set of monitors that execute once every local oscillator tick. The protocol is independent of specific application specific requirements. The faults are assumed to be arbitrary and/or malicious. All timing measures of variables are based on the node's local clock and thus no central clock or externally generated pulse is used. Instances of the protocol are shown to tolerate bursts of transient failures and deterministically converge with a linear convergence time with respect to the synchronization period as predicted.
Fault-tolerant linear optical quantum computing with small-amplitude coherent States.

PubMed

Lund, A P; Ralph, T C; Haselgrove, H L

2008-01-25

Quantum computing using two coherent states as a qubit basis is a proposed alternative architecture with lower overheads but has been questioned as a practical way of performing quantum computing due to the fragility of diagonal states with large coherent amplitudes. We show that using error correction only small amplitudes (alpha>1.2) are required for fault-tolerant quantum computing. We study fault tolerance under the effects of small amplitudes and loss using a Monte Carlo simulation. The first encoding level resources are orders of magnitude lower than the best single photon scheme.
Adaptive sensor-fault tolerant control for a class of multivariable uncertain nonlinear systems.

PubMed

Khebbache, Hicham; Tadjine, Mohamed; Labiod, Salim; Boulkroune, Abdesselem

2015-03-01

This paper deals with the active fault tolerant control (AFTC) problem for a class of multiple-input multiple-output (MIMO) uncertain nonlinear systems subject to sensor faults and external disturbances. The proposed AFTC method can tolerate three additive (bias, drift and loss of accuracy) and one multiplicative (loss of effectiveness) sensor faults. By employing backstepping technique, a novel adaptive backstepping-based AFTC scheme is developed using the fact that sensor faults and system uncertainties (including external disturbances and unexpected nonlinear functions caused by sensor faults) can be on-line estimated and compensated via robust adaptive schemes. The stability analysis of the closed-loop system is rigorously proven using a Lyapunov approach. The effectiveness of the proposed controller is illustrated by two simulation examples. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.
Towards scalable Byzantine fault-tolerant replication

NASA Astrophysics Data System (ADS)

Zbierski, Maciej

2017-08-01

Byzantine fault-tolerant (BFT) replication is a powerful technique, enabling distributed systems to remain available and correct even in the presence of arbitrary faults. Unfortunately, existing BFT replication protocols are mostly load-unscalable, i.e. they fail to respond with adequate performance increase whenever new computational resources are introduced into the system. This article proposes a universal architecture facilitating the creation of load-scalable distributed services based on BFT replication. The suggested approach exploits parallel request processing to fully utilize the available resources, and uses a load balancer module to dynamically adapt to the properties of the observed client workload. The article additionally provides a discussion on selected deployment scenarios, and explains how the proposed architecture could be used to increase the dependability of contemporary large-scale distributed systems.
Fault Tolerant Software Technology for Distributed Computer Systems

DTIC Science & Technology

1989-03-01

RAY.) &-TR-88-296 I Fin;.’ Technical Report ,r 19,39 i A28 3329 F’ULT TOLERANT SOFTWARE TECHNOLOGY FOR DISTRIBUTED COMPUTER SYSTEMS Georgia Institute...GrfisABN 34-70IiWftlI NO0. IN?3. NO IACCESSION NO. 158 21 7 11. TITLE (Incld security Cassification) FAULT TOLERANT SOFTWARE FOR DISTRIBUTED COMPUTER ...Technology for Distributed Computing Systems," a two year effort performed at Georgia Institute of Technology as part of the Clouds Project. The Clouds
Locating hardware faults in a parallel computer

DOEpatents

Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

2010-04-13

Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.
High-Threshold Fault-Tolerant Quantum Computation with Analog Quantum Error Correction

NASA Astrophysics Data System (ADS)

Fukui, Kosuke; Tomita, Akihisa; Okamoto, Atsushi; Fujii, Keisuke

2018-04-01

To implement fault-tolerant quantum computation with continuous variables, the Gottesman-Kitaev-Preskill (GKP) qubit has been recognized as an important technological element. However, it is still challenging to experimentally generate the GKP qubit with the required squeezing level, 14.8 dB, of the existing fault-tolerant quantum computation. To reduce this requirement, we propose a high-threshold fault-tolerant quantum computation with GKP qubits using topologically protected measurement-based quantum computation with the surface code. By harnessing analog information contained in the GKP qubits, we apply analog quantum error correction to the surface code. Furthermore, we develop a method to prevent the squeezing level from decreasing during the construction of the large-scale cluster states for the topologically protected, measurement-based, quantum computation. We numerically show that the required squeezing level can be relaxed to less than 10 dB, which is within the reach of the current experimental technology. Hence, this work can considerably alleviate this experimental requirement and take a step closer to the realization of large-scale quantum computation.
Exploiting data representation for fault tolerance

DOE PAGES

Hoemmen, Mark Frederick; Elliott, J.; Sandia National Lab.; ...

2015-01-06

Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product'smore » result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.« less
H∞ robust fault-tolerant controller design for an autonomous underwater vehicle's navigation control system

NASA Astrophysics Data System (ADS)

Cheng, Xiang-Qin; Qu, Jing-Yuan; Yan, Zhe-Ping; Bian, Xin-Qian

2010-03-01

In order to improve the security and reliability for autonomous underwater vehicle (AUV) navigation, an H∞ robust fault-tolerant controller was designed after analyzing variations in state-feedback gain. Operating conditions and the design method were then analyzed so that the control problem could be expressed as a mathematical optimization problem. This permitted the use of linear matrix inequalities (LMI) to solve for the H∞ controller for the system. When considering different actuator failures, these conditions were then also mathematically expressed, allowing the H∞ robust controller to solve for these events and thus be fault-tolerant. Finally, simulation results showed that the H∞ robust fault-tolerant controller could provide precise AUV navigation control with strong robustness.
Self-adaptive Fault-Tolerance of HLA-Based Simulations in the Grid Environment

NASA Astrophysics Data System (ADS)

Huang, Jijie; Chai, Xudong; Zhang, Lin; Li, Bo Hu

The objects of a HLA-based simulation can access model services to update their attributes. However, the grid server may be overloaded and refuse the model service to handle objects accesses. Because these objects have been accessed this model service during last simulation loop and their medium state are stored in this server, this may terminate the simulation. A fault-tolerance mechanism must be introduced into simulations. But the traditional fault-tolerance methods cannot meet the above needs because the transmission latency between a federate and the RTI in grid environment varies from several hundred milliseconds to several seconds. By adding model service URLs to the OMT and expanding the HLA services and model services with some interfaces, this paper proposes a self-adaptive fault-tolerance mechanism of simulations according to the characteristics of federates accessing model services. Benchmark experiments indicate that the expanded HLA/RTI can make simulations self-adaptively run in the grid environment.
Fault-Tolerant Algorithms for Connectivity Restoration in Wireless Sensor Networks.

PubMed

Zeng, Yali; Xu, Li; Chen, Zhide

2015-12-22

As wireless sensor network (WSN) is often deployed in a hostile environment, nodes in the networks are prone to large-scale failures, resulting in the network not working normally. In this case, an effective restoration scheme is needed to restore the faulty network timely. Most of existing restoration schemes consider more about the number of deployed nodes or fault tolerance alone, but fail to take into account the fact that network coverage and topology quality are also important to a network. To address this issue, we present two algorithms named Full 2-Connectivity Restoration Algorithm (F2CRA) and Partial 3-Connectivity Restoration Algorithm (P3CRA), which restore a faulty WSN in different aspects. F2CRA constructs the fan-shaped topology structure to reduce the number of deployed nodes, while P3CRA constructs the dual-ring topology structure to improve the fault tolerance of the network. F2CRA is suitable when the restoration cost is given the priority, and P3CRA is suitable when the network quality is considered first. Compared with other algorithms, these two algorithms ensure that the network has stronger fault-tolerant function, larger coverage area and better balanced load after the restoration.
Highly fault-tolerant parallel computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Spielman, D.A.

We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less
Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor

NASA Technical Reports Server (NTRS)

Moore, J. Strother

1992-01-01

Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

Modeling and measurement of fault-tolerant multiprocessors

NASA Technical Reports Server (NTRS)

Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

1985-01-01

The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.
Fault tolerant features and experiments of ANTS distributed real-time system

NASA Astrophysics Data System (ADS)

Dominic-Savio, Patrick; Lo, Jien-Chung; Tufts, Donald W.

1995-01-01

The ANTS project at the University of Rhode Island introduces the concept of Active Nodal Task Seeking (ANTS) as a way to efficiently design and implement dependable, high-performance, distributed computing. This paper presents the fault tolerant design features that have been incorporated in the ANTS experimental system implementation. The results of performance evaluations and fault injection experiments are reported. The fault-tolerant version of ANTS categorizes all computing nodes into three groups. They are: the up-and-running green group, the self-diagnosing yellow group and the failed red group. Each available computing node will be placed in the yellow group periodically for a routine diagnosis. In addition, for long-life missions, ANTS uses a monitoring scheme to identify faulty computing nodes. In this monitoring scheme, the communication pattern of each computing node is monitored by two other nodes.
Fault-Tolerant, Real-Time, Multi-Core Computer System

NASA Technical Reports Server (NTRS)

Gostelow, Kim P.

2012-01-01

A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
A Self-Stabilizing Byzantine-Fault-Tolerant Clock Synchronization Protocol

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2009-01-01

This report presents a rapid Byzantine-fault-tolerant self-stabilizing clock synchronization protocol that is independent of application-specific requirements. It is focused on clock synchronization of a system in the presence of Byzantine faults after the cause of any transient faults has dissipated. A model of this protocol is mechanically verified using the Symbolic Model Verifier (SMV) [SMV] where the entire state space is examined and proven to self-stabilize in the presence of one arbitrary faulty node. Instances of the protocol are proven to tolerate bursts of transient failures and deterministically converge with a linear convergence time with respect to the synchronization period. This protocol does not rely on assumptions about the initial state of the system other than the presence of sufficient number of good nodes. All timing measures of variables are based on the node s local clock, and no central clock or externally generated pulse is used. The Byzantine faulty behavior modeled here is a node with arbitrarily malicious behavior that is allowed to influence other nodes at every clock tick. The only constraint is that the interactions are restricted to defined interfaces.
SFTP: A Secure and Fault-Tolerant Paradigm against Blackhole Attack in MANET

NASA Astrophysics Data System (ADS)

KumarRout, Jitendra; Kumar Bhoi, Sourav; Kumar Panda, Sanjaya

2013-02-01

Security issues in MANET are a challenging task nowadays. MANETs are vulnerable to passive attacks and active attacks because of a limited number of resources and lack of centralized authority. Blackhole attack is an attack in network layer which degrade the network performance by dropping the packets. In this paper, we have proposed a Secure Fault-Tolerant Paradigm (SFTP) which checks the Blackhole attack in the network. The three phases used in SFTP algorithm are designing of coverage area to find the area of coverage, Network Connection algorithm to design a fault-tolerant model and Route Discovery algorithm to discover the route and data delivery from source to destination. SFTP gives better network performance by making the network fault free.
Fenix, A Fault Tolerant Programming Framework for MPI Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gamel, Marc; Teranihi, Keita; Valenzuela, Eric

2016-10-05

Fenix provides APIs to allow the users to add fault tolerance capability to MPI-based parallel programs in a transparent manner. Fenix-enabled programs can run through process failures during program execution using a pool of spare processes accommodated by Fenix.
A survey of NASA and military standards on fault tolerance and reliability applied to robotics

NASA Technical Reports Server (NTRS)

Cavallaro, Joseph R.; Walker, Ian D.

1994-01-01

There is currently increasing interest and activity in the area of reliability and fault tolerance for robotics. This paper discusses the application of Standards in robot reliability, and surveys the literature of relevant existing standards. A bibliography of relevant Military and NASA standards for reliability and fault tolerance is included.
Fault tolerance with noisy and slow measurements and preparation.

PubMed

Paz-Silva, Gerardo A; Brennen, Gavin K; Twamley, Jason

2010-09-03

It is not so well known that measurement-free quantum error correction protocols can be designed to achieve fault-tolerant quantum computing. Despite their potential advantages in terms of the relaxation of accuracy, speed, and addressing requirements, they have usually been overlooked since they are expected to yield a very bad threshold. We show that this is not the case. We design fault-tolerant circuits for the 9-qubit Bacon-Shor code and find an error threshold for unitary gates and preparation of p((p,g)thresh)=3.76×10(-5) (30% of the best known result for the same code using measurement) while admitting up to 1/3 error rates for measurements and allocating no constraints on measurement speed. We further show that demanding gate error rates sufficiently below the threshold pushes the preparation threshold up to p((p)thresh)=1/3.
Concurrent development of fault management hardware and software in the SSM/PMAD. [Space Station Module/Power Management And Distribution

NASA Technical Reports Server (NTRS)

Freeman, Kenneth A.; Walsh, Rick; Weeks, David J.

1988-01-01

Space Station issues in fault management are discussed. The system background is described with attention given to design guidelines and power hardware. A contractually developed fault management system, FRAMES, is integrated with the energy management functions, the control switchgear, and the scheduling and operations management functions. The constraints that shaped the FRAMES system and its implementation are considered.
Performance and economy of a fault-tolerant multiprocessor

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, C. J.

1979-01-01

The FTMP (Fault-Tolerant Multiprocessor) is one of two central aircraft fault-tolerant architectures now in the prototype phase under NASA sponsorship. The intended application of the computer includes such critical real-time tasks as 'fly-by-wire' active control and completely automatic Category III landings of commercial aircraft. The FTMP architecture is briefly described and it is shown that it is a viable solution to the multi-faceted problems of safety, speed, and cost. Three job dispatch strategies are described, and their results with respect to job-starting delay are presented. The first strategy is a simple First-Come-First-Serve (FCFS) job dispatch executive. The other two schedulers are an adaptive FCFS and an interrupt driven scheduler. Three failure modes are discussed, and the FTMP survival probability in the face of random hard failures is evaluated. It is noted that the hourly cost of operating two FTMPs in a transport aircraft can be as little as one-to-two percent of the total flight-hour cost of the aircraft.
Fault-tolerant quantum computation with nondeterministic entangling gates

NASA Astrophysics Data System (ADS)

Auger, James M.; Anwar, Hussain; Gimeno-Segovia, Mercedes; Stace, Thomas M.; Browne, Dan E.

2018-03-01

Performing entangling gates between physical qubits is necessary for building a large-scale universal quantum computer, but in some physical implementations—for example, those that are based on linear optics or networks of ion traps—entangling gates can only be implemented probabilistically. In this work, we study the fault-tolerant performance of a topological cluster state scheme with local nondeterministic entanglement generation, where failed entangling gates (which correspond to bonds on the lattice representation of the cluster state) lead to a defective three-dimensional lattice with missing bonds. We present two approaches for dealing with missing bonds; the first is a nonadaptive scheme that requires no additional quantum processing, and the second is an adaptive scheme in which qubits can be measured in an alternative basis to effectively remove them from the lattice, hence eliminating their damaging effect and leading to better threshold performance. We find that a fault-tolerance threshold can still be observed with a bond-loss rate of 6.5% for the nonadaptive scheme, and a bond-loss rate as high as 14.5% for the adaptive scheme.
Fault tolerance issues in nanoelectronics

NASA Astrophysics Data System (ADS)

Spagocci, S. M.

The astonishing success story of microelectronics cannot go on indefinitely. In fact, once devices reach the few-atom scale (nanoelectronics), transient quantum effects are expected to impair their behaviour. Fault tolerant techniques will then be required. The aim of this thesis is to investigate the problem of transient errors in nanoelectronic devices. Transient error rates for a selection of nanoelectronic gates, based upon quantum cellular automata and single electron devices, in which the electrostatic interaction between electrons is used to create Boolean circuits, are estimated. On the bases of such results, various fault tolerant solutions are proposed, for both logic and memory nanochips. As for logic chips, traditional techniques are found to be unsuitable. A new technique, in which the voting approach of triple modular redundancy (TMR) is extended by cascading TMR units composed of nanogate clusters, is proposed and generalised to other voting approaches. For memory chips, an error correcting code approach is found to be suitable. Various codes are considered and a lookup table approach is proposed for encoding and decoding. We are then able to give estimations for the redundancy level to be provided on nanochips, so as to make their mean time between failures acceptable. It is found that, for logic chips, space redundancies up to a few tens are required, if mean times between failures have to be of the order of a few years. Space redundancy can also be traded for time redundancy. As for memory chips, mean times between failures of the order of a few years are found to imply both space and time redundancies of the order of ten.
Robot Position Sensor Fault Tolerance

NASA Technical Reports Server (NTRS)

Aldridge, Hal A.

1997-01-01

Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. A new method is proposed that utilizes analytical redundancy to allow for continued operation during joint position sensor failure. Joint torque sensors are used with a virtual passive torque controller to make the robot joint stable without position feedback and improve position tracking performance in the presence of unknown link dynamics and end-effector loading. Two Cartesian accelerometer based methods are proposed to determine the position of the joint. The joint specific position determination method utilizes two triaxial accelerometers attached to the link driven by the joint with the failed position sensor. The joint specific method is not computationally complex and the position error is bounded. The system wide position determination method utilizes accelerometers distributed on different robot links and the end-effector to determine the position of sets of multiple joints. The system wide method requires fewer accelerometers than the joint specific method to make all joint position sensors fault tolerant but is more computationally complex and has lower convergence properties. Experiments were conducted on a laboratory manipulator. Both position determination methods were shown to track the actual position satisfactorily. A controller using the position determination methods and the virtual passive torque controller was able to servo the joints to a desired position during position sensor failure.
Low-Power Fault Tolerance for Spacecraft FPGA-Based Numerical Computing

DTIC Science & Technology

2006-09-01

Ranganathan , “Power Management – Guest Lecture for CS4135, NPS,” Naval Postgraduate School, Nov 2004 [32] R. L. Phelps, “Operational Experiences with the...4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2...undesirable, are not necessarily harmful. Our intent is to prevent errors by properly managing faults. This research focuses on developing fault-tolerant
SUMC fault tolerant computer system

NASA Technical Reports Server (NTRS)

1980-01-01

The results of the trade studies are presented. These trades cover: establishing the basic configuration, establishing the CPU/memory configuration, establishing an approach to crosstrapping interfaces, defining the requirements of the redundancy management unit (RMU), establishing a spare plane switching strategy for the fault-tolerant memory (FTM), and identifying the most cost effective way of extending the memory addressing capability beyond the 64 K-bytes (K=1024) of SUMC-II B. The results of the design are compiled in Contract End Item (CEI) Specification for the NASA Standard Spacecraft Computer II (NSSC-II), IBM 7934507. The implementation of the FTM and memory address expansion.
FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft

NASA Technical Reports Server (NTRS)

Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.

1978-01-01

The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.
Using concatenated quantum codes for universal fault-tolerant quantum gates.

PubMed

Jochym-O'Connor, Tomas; Laflamme, Raymond

2014-01-10

We propose a method for universal fault-tolerant quantum computation using concatenated quantum error correcting codes. The concatenation scheme exploits the transversal properties of two different codes, combining them to provide a means to protect against low-weight arbitrary errors. We give the required properties of the error correcting codes to ensure universal fault tolerance and discuss a particular example using the 7-qubit Steane and 15-qubit Reed-Muller codes. Namely, other than computational basis state preparation as required by the DiVincenzo criteria, our scheme requires no special ancillary state preparation to achieve universality, as opposed to schemes such as magic state distillation. We believe that optimizing the codes used in such a scheme could provide a useful alternative to state distillation schemes that exhibit high overhead costs.
Fault Tolerant State Machines

NASA Technical Reports Server (NTRS)

Burke, Gary R.; Taft, Stephanie

2004-01-01

State machines are commonly used to control sequential logic in FPGAs and ASKS. An errant state machine can cause considerable damage to the device it is controlling. For example in space applications, the FPGA might be controlling Pyros, which when fired at the wrong time will cause a mission failure. Even a well designed state machine can be subject to random errors us a result of SEUs from the radiation environment in space. There are various ways to encode the states of a state machine, and the type of encoding makes a large difference in the susceptibility of the state machine to radiation. In this paper we compare 4 methods of state machine encoding and find which method gives the best fault tolerance, as well as determining the resources needed for each method.
Integral Sliding Mode Fault-Tolerant Control for Uncertain Linear Systems Over Networks With Signals Quantization.

PubMed

Hao, Li-Ying; Park, Ju H; Ye, Dan

2017-09-01

In this paper, a new robust fault-tolerant compensation control method for uncertain linear systems over networks is proposed, where only quantized signals are assumed to be available. This approach is based on the integral sliding mode (ISM) method where two kinds of integral sliding surfaces are constructed. One is the continuous-state-dependent surface with the aim of sliding mode stability analysis and the other is the quantization-state-dependent surface, which is used for ISM controller design. A scheme that combines the adaptive ISM controller and quantization parameter adjustment strategy is then proposed. Through utilizing H ∞ control analytical technique, once the system is in the sliding mode, the nature of performing disturbance attenuation and fault tolerance from the initial time can be found without requiring any fault information. Finally, the effectiveness of our proposed ISM control fault-tolerant schemes against quantization errors is demonstrated in the simulation.
Design and Experimental Validation for Direct-Drive Fault-Tolerant Permanent-Magnet Vernier Machines

PubMed Central

Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian

2014-01-01

A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis. PMID:25045729

Design and experimental validation for direct-drive fault-tolerant permanent-magnet vernier machines.

PubMed

Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian

2014-01-01

A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis.
Interface For Fault-Tolerant Control System

NASA Technical Reports Server (NTRS)

Shaver, Charles; Williamson, Michael

1989-01-01

Interface unit and controller emulator developed for research on electronic helicopter-flight-control systems equipped with artificial intelligence. Interface unit interrupt-driven system designed to link microprocessor-based, quadruply-redundant, asynchronous, ultra-reliable, fault-tolerant control system (controller) with electronic servocontrol unit that controls set of hydraulic actuators. Receives digital feedforward messages from, and transmits digital feedback messages to, controller through differential signal lines or fiber-optic cables (thus far only differential signal lines have been used). Analog signals transmitted to and from servocontrol unit via coaxial cables.
Software fault tolerance using data diversity

NASA Technical Reports Server (NTRS)

Knight, John C.

1991-01-01

Research on data diversity is discussed. Data diversity relies on a different form of redundancy from existing approaches to software fault tolerance and is substantially less expensive to implement. Data diversity can also be applied to software testing and greatly facilitates the automation of testing. Up to now it has been explored both theoretically and in a pilot study, and has been shown to be a promising technique. The effectiveness of data diversity as an error detection mechanism and the application of data diversity to differential equation solvers are discussed.
About problematic peculiarities of Fault Tolerance digital regulation organization

NASA Astrophysics Data System (ADS)

Rakov, V. I.; Zakharova, O. V.

2018-05-01

The solution of problems concerning estimation of working capacity of regulation chains and possibilities of preventing situations of its violation in three directions are offered. The first direction is working out (creating) the methods of representing the regulation loop (circuit) by means of uniting (combining) diffuse components and forming algorithmic tooling for building predicates of serviceability assessment separately for the components and the for regulation loops (circuits, contours) in general. The second direction is creating methods of Fault Tolerance redundancy in the process of complex assessment of current values of control actions, closure errors and their regulated parameters. The third direction is creating methods of comparing the processes of alteration (change) of control actions, errors of closure and regulating parameters with their standard models or their surroundings. This direction allows one to develop methods and algorithmic tool means, aimed at preventing loss of serviceability and effectiveness of not only a separate digital regulator, but also the whole complex of Fault Tolerance regulation.
Fault-tolerant clock synchronization validation methodology. [in computer systems

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Palumbo, Daniel L.; Johnson, Sally C.

1987-01-01

A validation method for the synchronization subsystem of a fault-tolerant computer system is presented. The high reliability requirement of flight-crucial systems precludes the use of most traditional validation methods. The method presented utilizes formal design proof to uncover design and coding errors and experimentation to validate the assumptions of the design proof. The experimental method is described and illustrated by validating the clock synchronization system of the Software Implemented Fault Tolerance computer. The design proof of the algorithm includes a theorem that defines the maximum skew between any two nonfaulty clocks in the system in terms of specific system parameters. Most of these parameters are deterministic. One crucial parameter is the upper bound on the clock read error, which is stochastic. The probability that this upper bound is exceeded is calculated from data obtained by the measurement of system parameters. This probability is then included in a detailed reliability analysis of the system.
Verification of fault-tolerant clock synchronization systems. M.S. Thesis - College of William and Mary, 1992

NASA Technical Reports Server (NTRS)

Miner, Paul S.

1993-01-01

A critical function in a fault-tolerant computer architecture is the synchronization of the redundant computing elements. The synchronization algorithm must include safeguards to ensure that failed components do not corrupt the behavior of good clocks. Reasoning about fault-tolerant clock synchronization is difficult because of the possibility of subtle interactions involving failed components. Therefore, mechanical proof systems are used to ensure that the verification of the synchronization system is correct. In 1987, Schneider presented a general proof of correctness for several fault-tolerant clock synchronization algorithms. Subsequently, Shankar verified Schneider's proof by using the mechanical proof system EHDM. This proof ensures that any system satisfying its underlying assumptions will provide Byzantine fault-tolerant clock synchronization. The utility of Shankar's mechanization of Schneider's theory for the verification of clock synchronization systems is explored. Some limitations of Shankar's mechanically verified theory were encountered. With minor modifications to the theory, a mechanically checked proof is provided that removes these limitations. The revised theory also allows for proven recovery from transient faults. Use of the revised theory is illustrated with the verification of an abstract design of a clock synchronization system.
Adaptive Fault-Tolerant Control of Uncertain Nonlinear Large-Scale Systems With Unknown Dead Zone.

PubMed

Chen, Mou; Tao, Gang

2016-08-01

In this paper, an adaptive neural fault-tolerant control scheme is proposed and analyzed for a class of uncertain nonlinear large-scale systems with unknown dead zone and external disturbances. To tackle the unknown nonlinear interaction functions in the large-scale system, the radial basis function neural network (RBFNN) is employed to approximate them. To further handle the unknown approximation errors and the effects of the unknown dead zone and external disturbances, integrated as the compounded disturbances, the corresponding disturbance observers are developed for their estimations. Based on the outputs of the RBFNN and the disturbance observer, the adaptive neural fault-tolerant control scheme is designed for uncertain nonlinear large-scale systems by using a decentralized backstepping technique. The closed-loop stability of the adaptive control system is rigorously proved via Lyapunov analysis and the satisfactory tracking performance is achieved under the integrated effects of unknown dead zone, actuator fault, and unknown external disturbances. Simulation results of a mass-spring-damper system are given to illustrate the effectiveness of the proposed adaptive neural fault-tolerant control scheme for uncertain nonlinear large-scale systems.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.

PubMed

Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein

2015-12-01

Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.
Galileo spacecraft power distribution and autonomous fault recovery

NASA Technical Reports Server (NTRS)

Detwiler, R. C.

1982-01-01

There is a trend in current spacecraft design to achieve greater fault tolerance through the implemenation of on-board software dedicated to detecting and isolating failures. A combination of hardware and software is utilized in the Galileo power system for autonomous fault recovery. Galileo is a dual-spun spacecraft designed to carry a number of scientific instruments into a series of orbits around the planet Jupiter. In addition to its self-contained scientific payload, it will also carry a probe system which will be separated from the spacecraft some 150 days prior to Jupiter encounter. The Galileo spacecraft is scheduled to be launched in 1985. Attention is given to the power system, the fault protection requirements, and the power fault recovery implementation.
Fault-tolerant power distribution system

NASA Technical Reports Server (NTRS)

Volp, Jeffrey A. (Inventor)

1987-01-01

A fault-tolerant power distribution system which includes a plurality of power sources and a plurality of nodes responsive thereto for supplying power to one or more loads associated with each node. Each node includes a plurality of switching circuits, each of which preferably uses a power field effect transistor which provides a diode operation when power is first applied to the nodes and which thereafter provides bi-directional current flow through the switching circuit in a manner such that a low voltage drop is produced in each direction. Each switching circuit includes circuitry for disabling the power field effect transistor when the current in the switching circuit exceeds a preselected value.
Economic modeling of fault tolerant flight control systems in commercial applications

NASA Technical Reports Server (NTRS)

Finelli, G. B.

1982-01-01

This paper describes the current development of a comprehensive model which will supply the assessment and analysis capability to investigate the economic viability of Fault Tolerant Flight Control Systems (FTFCS) for commercial aircraft of the 1990's and beyond. An introduction to the unique attributes of fault tolerance and how they will influence aircraft operations and consequent airline costs and benefits is presented. Specific modeling issues and elements necessary for accurate assessment of all costs affected by ownership and operation of FTFCS are delineated. Trade-off factors are presented, aimed at exposing economically optimal realizations of system implementations, resource allocation, and operating policies. A trade-off example is furnished to graphically display some of the analysis capabilities of the comprehensive simulation model now being developed.
Verifiable fault tolerance in measurement-based quantum computation

NASA Astrophysics Data System (ADS)

Fujii, Keisuke; Hayashi, Masahito

2017-09-01

Quantum systems, in general, cannot be simulated efficiently by a classical computer, and hence are useful for solving certain mathematical problems and simulating quantum many-body systems. This also implies, unfortunately, that verification of the output of the quantum systems is not so trivial, since predicting the output is exponentially hard. As another problem, the quantum system is very delicate for noise and thus needs an error correction. Here, we propose a framework for verification of the output of fault-tolerant quantum computation in a measurement-based model. In contrast to existing analyses on fault tolerance, we do not assume any noise model on the resource state, but an arbitrary resource state is tested by using only single-qubit measurements to verify whether or not the output of measurement-based quantum computation on it is correct. Verifiability is equipped by a constant time repetition of the original measurement-based quantum computation in appropriate measurement bases. Since full characterization of quantum noise is exponentially hard for large-scale quantum computing systems, our framework provides an efficient way to practically verify the experimental quantum error correction.
2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Katz, D. S.; Daly, J.; DeBardeleben, N.

2009-02-01

This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults causemore » large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a good solution as systems and memories grow faster than I/O bandwidth to disk. There is interest in both modifications to this, such as checkpoints to memory, partial checkpoints, and message logging, and alternative ideas, such as in-memory recovery using residues. We believe that systematic exploration of these ideas holds the most promise for the scientific applications community. Fault tolerance has been an issue of discussion in the HPC community for at least the past 10 years; but much like other issues, the community has managed to put off addressing it during this period. There is a growing recognition that as systems continue to grow to petascale and beyond, the field is approaching the point where we don
Fault-tolerance of a neural network solving the traveling salesman problem

NASA Technical Reports Server (NTRS)

Protzel, P.; Palumbo, D.; Arras, M.

1989-01-01

This study presents the results of a fault-injection experiment that stimulates a neural network solving the Traveling Salesman Problem (TSP). The network is based on a modified version of Hopfield's and Tank's original method. We define a performance characteristic for the TSP that allows an overall assessment of the solution quality for different city-distributions and problem sizes. Five different 10-, 20-, and 30- city cases are sued for the injection of up to 13 simultaneous stuck-at-0 and stuck-at-1 faults. The results of more than 4000 simulation-runs show the extreme fault-tolerance of the network, especially with respect to stuck-at-0 faults. One possible explanation for the overall surprising result is the redundancy of the problem representation.
Fault tolerant, radiation hard, high performance digital signal processor

NASA Technical Reports Server (NTRS)

Holmann, Edgar; Linscott, Ivan R.; Maurer, Michael J.; Tyler, G. L.; Libby, Vibeke

1990-01-01

An architecture has been developed for a high-performance VLSI digital signal processor that is highly reliable, fault-tolerant, and radiation-hard. The signal processor, part of a spacecraft receiver designed to support uplink radio science experiments at the outer planets, organizes the connections between redundant arithmetic resources, register files, and memory through a shuffle exchange communication network. The configuration of the network and the state of the processor resources are all under microprogram control, which both maps the resources according to algorithmic needs and reconfigures the processing should a failure occur. In addition, the microprogram is reloadable through the uplink to accommodate changes in the science objectives throughout the course of the mission. The processor will be implemented with silicon compiler tools, and its design will be verified through silicon compilation simulation at all levels from the resources to full functionality. By blending reconfiguration with redundancy the processor implementation is fault-tolerant and reliable, and possesses the long expected lifetime needed for a spacecraft mission to the outer planets.
Distributed Evaluation Functions for Fault Tolerant Multi-Rover Systems

NASA Technical Reports Server (NTRS)

Agogino, Adrian; Turner, Kagan

2005-01-01

The ability to evolve fault tolerant control strategies for large collections of agents is critical to the successful application of evolutionary strategies to domains where failures are common. Furthermore, while evolutionary algorithms have been highly successful in discovering single-agent control strategies, extending such algorithms to multiagent domains has proven to be difficult. In this paper we present a method for shaping evaluation functions for agents that provide control strategies that both are tolerant to different types of failures and lead to coordinated behavior in a multi-agent setting. This method neither relies of a centralized strategy (susceptible to single point of failures) nor a distributed strategy where each agent uses a system wide evaluation function (severe credit assignment problem). In a multi-rover problem, we show that agents using our agent-specific evaluation perform up to 500% better than agents using the system evaluation. In addition we show that agents are still able to maintain a high level of performance when up to 60% of the agents fail due to actuator, communication or controller faults.
Measurement and analysis of workload effects on fault latency in real-time systems

NASA Technical Reports Server (NTRS)

Woodbury, Michael H.; Shin, Kang G.

1990-01-01

The authors demonstrate the need to address fault latency in highly reliable real-time control computer systems. It is noted that the effectiveness of all known recovery mechanisms is greatly reduced in the presence of multiple latent faults. The presence of multiple latent faults increases the possibility of multiple errors, which could result in coverage failure. The authors present experimental evidence indicating that the duration of fault latency is dependent on workload. A synthetic workload generator is used to vary the workload, and a hardware fault injector is applied to inject transient faults of varying durations. This method makes it possible to derive the distribution of fault latency duration. Experimental results obtained from the fault-tolerant multiprocessor at the NASA Airlab are presented and discussed.
Experimental Robot Position Sensor Fault Tolerance Using Accelerometers and Joint Torque Sensors

NASA Technical Reports Server (NTRS)

Aldridge, Hal A.; Juang, Jer-Nan

1997-01-01

Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. The proposed method uses joint torque sensors found in most existing advanced robot designs along with easily locatable, lightweight accelerometers to provide a joint position sensor fault recovery mode. This mode uses the torque sensors along with a virtual passive control law for stability and accelerometers for joint position information. Two methods for conversion from Cartesian acceleration to joint position based on robot kinematics, not integration, are presented. The fault tolerant control method was tested on several joints of a laboratory robot. The controllers performed well with noisy, biased data and a model with uncertain parameters.
Fault-tolerant quantum error detection.

PubMed

Linke, Norbert M; Gutierrez, Mauricio; Landsman, Kevin A; Figgatt, Caroline; Debnath, Shantanu; Brown, Kenneth R; Monroe, Christopher

2017-10-01

Quantum computers will eventually reach a size at which quantum error correction becomes imperative. Quantum information can be protected from qubit imperfections and flawed control operations by encoding a single logical qubit in multiple physical qubits. This redundancy allows the extraction of error syndromes and the subsequent detection or correction of errors without destroying the logical state itself through direct measurement. We show the encoding and syndrome measurement of a fault-tolerantly prepared logical qubit via an error detection protocol on four physical qubits, represented by trapped atomic ions. This demonstrates the robustness of a logical qubit to imperfections in the very operations used to encode it. The advantage persists in the face of large added error rates and experimental calibration errors.
Fault-tolerant quantum error detection

PubMed Central

Linke, Norbert M.; Gutierrez, Mauricio; Landsman, Kevin A.; Figgatt, Caroline; Debnath, Shantanu; Brown, Kenneth R.; Monroe, Christopher

2017-01-01

Quantum computers will eventually reach a size at which quantum error correction becomes imperative. Quantum information can be protected from qubit imperfections and flawed control operations by encoding a single logical qubit in multiple physical qubits. This redundancy allows the extraction of error syndromes and the subsequent detection or correction of errors without destroying the logical state itself through direct measurement. We show the encoding and syndrome measurement of a fault-tolerantly prepared logical qubit via an error detection protocol on four physical qubits, represented by trapped atomic ions. This demonstrates the robustness of a logical qubit to imperfections in the very operations used to encode it. The advantage persists in the face of large added error rates and experimental calibration errors. PMID:29062889

The art of fault-tolerant system reliability modeling

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Johnson, Sally C.

1990-01-01

A step-by-step tutorial of the methods and tools used for the reliability analysis of fault-tolerant systems is presented. Emphasis is on the representation of architectural features in mathematical models. Details of the mathematical solution of complex reliability models are not presented. Instead the use of several recently developed computer programs--SURE, ASSIST, STEM, PAWS--which automate the generation and solution of these models is described.
Generating a fault-tolerant global clock using high-speed control signals for the MetaNet architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ofek, Y.

1994-05-01

This work describes a new technique, based on exchanging control signals between neighboring nodes, for constructing a stable and fault-tolerant global clock in a distributed system with an arbitrary topology. It is shown that it is possible to construct a global clock reference with time step that is much smaller than the propagation delay over the network's links. The synchronization algorithm ensures that the global clock tick' has a stable periodicity, and therefore, it is possible to tolerate failures of links and clocks that operate faster and/or slower than nominally specified, as well as hard failures. The approach taken inmore » this work is to generate a global clock from the ensemble of the local transmission clocks and not to directly synchronize these high-speed clocks. The steady-state algorithm, which generates the global clock, is executed in hardware by the network interface of each node. At the network interface, it is possible to measure accurately the propagation delay between neighboring nodes with a small error or uncertainty and thereby to achieve global synchronization that is proportional to these error measurements. It is shown that the local clock drift (or rate uncertainty) has only a secondary effect on the maximum global clock rate. The synchronization algorithm can tolerate any physical failure. 18 refs.« less
Adaptive extended-state observer-based fault tolerant attitude control for spacecraft with reaction wheels

NASA Astrophysics Data System (ADS)

Ran, Dechao; Chen, Xiaoqian; de Ruiter, Anton; Xiao, Bing

2018-04-01

This study presents an adaptive second-order sliding control scheme to solve the attitude fault tolerant control problem of spacecraft subject to system uncertainties, external disturbances and reaction wheel faults. A novel fast terminal sliding mode is preliminarily designed to guarantee that finite-time convergence of the attitude errors can be achieved globally. Based on this novel sliding mode, an adaptive second-order observer is then designed to reconstruct the system uncertainties and the actuator faults. One feature of the proposed observer is that the design of the observer does not necessitate any priori information of the upper bounds of the system uncertainties and the actuator faults. In view of the reconstructed information supplied by the designed observer, a second-order sliding mode controller is developed to accomplish attitude maneuvers with great robustness and precise tracking accuracy. Theoretical stability analysis proves that the designed fault tolerant control scheme can achieve finite-time stability of the closed-loop system, even in the presence of reaction wheel faults and system uncertainties. Numerical simulations are also presented to demonstrate the effectiveness and superiority of the proposed control scheme over existing methodologies.
A Decentralized Adaptive Approach to Fault Tolerant Flight Control

NASA Technical Reports Server (NTRS)

Wu, N. Eva; Nikulin, Vladimir; Heimes, Felix; Shormin, Victor

2000-01-01

This paper briefly reports some results of our study on the application of a decentralized adaptive control approach to a 6 DOF nonlinear aircraft model. The simulation results showed the potential of using this approach to achieve fault tolerant control. Based on this observation and some analysis, the paper proposes a multiple channel adaptive control scheme that makes use of the functionally redundant actuating and sensing capabilities in the model, and explains how to implement the scheme to tolerate actuator and sensor failures. The conditions, under which the scheme is applicable, are stated in the paper.
From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators

NASA Astrophysics Data System (ADS)

Yim, Keun Soo

program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
A Fault Tolerant System for an Integrated Avionics Sensor Configuration

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Lancraft, R. E.

1984-01-01

An aircraft sensor fault tolerant system methodology for the Transport Systems Research Vehicle in a Microwave Landing System (MLS) environment is described. The fault tolerant system provides reliable estimates in the presence of possible failures both in ground-based navigation aids, and in on-board flight control and inertial sensors. Sensor failures are identified by utilizing the analytic relationships between the various sensors arising from the aircraft point mass equations of motion. The estimation and failure detection performance of the software implementation (called FINDS) of the developed system was analyzed on a nonlinear digital simulation of the research aircraft. Simulation results showing the detection performance of FINDS, using a dual redundant sensor compliment, are presented for bias, hardover, null, ramp, increased noise and scale factor failures. In general, the results show that FINDS can distinguish between normal operating sensor errors and failures while providing an excellent detection speed for bias failures in the MLS, indicated airspeed, attitude and radar altimeter sensors.
Transistor Level Circuit Experiments using Evolvable Hardware

NASA Technical Reports Server (NTRS)

Stoica, A.; Zebulum, R. S.; Keymeulen, D.; Ferguson, M. I.; Daud, Taher; Thakoor, A.

2005-01-01

The Jet Propulsion Laboratory (JPL) performs research in fault tolerant, long life, and space survivable electronics for the National Aeronautics and Space Administration (NASA). With that focus, JPL has been involved in Evolvable Hardware (EHW) technology research for the past several years. We have advanced the technology not only by simulation and evolution experiments, but also by designing, fabricating, and evolving a variety of transistor-based analog and digital circuits at the chip level. EHW refers to self-configuration of electronic hardware by evolutionary/genetic search mechanisms, thereby maintaining existing functionality in the presence of degradations due to aging, temperature, and radiation. In addition, EHW has the capability to reconfigure itself for new functionality when required for mission changes or encountered opportunities. Evolution experiments are performed using a genetic algorithm running on a DSP as the reconfiguration mechanism and controlling the evolvable hardware mounted on a self-contained circuit board. Rapid reconfiguration allows convergence to circuit solutions in the order of seconds. The paper illustrates hardware evolution results of electronic circuits and their ability to perform under 230 C temperature as well as radiations of up to 250 kRad.
Validation Methods Research for Fault-Tolerant Avionics and Control Systems: Working Group Meeting, 2

NASA Technical Reports Server (NTRS)

Gault, J. W. (Editor); Trivedi, K. S. (Editor); Clary, J. B. (Editor)

1980-01-01

The validation process comprises the activities required to insure the agreement of system realization with system specification. A preliminary validation methodology for fault tolerant systems documented. A general framework for a validation methodology is presented along with a set of specific tasks intended for the validation of two specimen system, SIFT and FTMP. Two major areas of research are identified. First, are those activities required to support the ongoing development of the validation process itself, and second, are those activities required to support the design, development, and understanding of fault tolerant systems.
Redundant and fault-tolerant algorithms for real-time measurement and control systems for weapon equipment.

PubMed

Li, Dan; Hu, Xiaoguang

2017-03-01

Because of the high availability requirements from weapon equipment, an in-depth study has been conducted on the real-time fault-tolerance of the widely applied Compact PCI (CPCI) bus measurement and control system. A redundancy design method that uses heartbeat detection to connect the primary and alternate devices has been developed. To address the low successful execution rate and relatively large waste of time slices in the primary version of the task software, an improved algorithm for real-time fault-tolerant scheduling is proposed based on the Basic Checking available time Elimination idle time (BCE) algorithm, applying a single-neuron self-adaptive proportion sum differential (PSD) controller. The experimental validation results indicate that this system has excellent redundancy and fault-tolerance, and the newly developed method can effectively improve the system availability. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
A fault-tolerant multiprocessor architecture for aircraft, volume 1. [autopilot configuration

NASA Technical Reports Server (NTRS)

Smith, T. B.; Hopkins, A. L.; Taylor, W.; Ausrotas, R. A.; Lala, J. H.; Hanley, L. D.; Martin, J. H.

1978-01-01

A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed.
Fault Tolerant Cache Schemes

NASA Astrophysics Data System (ADS)

Tu, H.-Yu.; Tasneem, Sarah

Most of modern microprocessors employ on—chip cache memories to meet the memory bandwidth demand. These caches are now occupying a greater real es tate of chip area. Also, continuous down scaling of transistors increases the possi bility of defects in the cache area which already starts to occupies more than 50% of chip area. For this reason, various techniques have been proposed to tolerate defects in cache blocks. These techniques can be classified into three different cat egories, namely, cache line disabling, replacement with spare block, and decoder reconfiguration without spare blocks. This chapter examines each of those fault tol erant techniques with a fixed typical size and organization of L1 cache, through extended simulation using SPEC2000 benchmark on individual techniques. The de sign and characteristics of each technique are summarized with a view to evaluate the scheme. We then present our simulation results and comparative study of the three different methods.
Hypothetical Scenario Generator for Fault-Tolerant Diagnosis

NASA Technical Reports Server (NTRS)

James, Mark

2007-01-01

The Hypothetical Scenario Generator for Fault-tolerant Diagnostics (HSG) is an algorithm being developed in conjunction with other components of artificial- intelligence systems for automated diagnosis and prognosis of faults in spacecraft, aircraft, and other complex engineering systems. By incorporating prognostic capabilities along with advanced diagnostic capabilities, these developments hold promise to increase the safety and affordability of the affected engineering systems by making it possible to obtain timely and accurate information on the statuses of the systems and predicting impending failures well in advance. The HSG is a specific instance of a hypothetical- scenario generator that implements an innovative approach for performing diagnostic reasoning when data are missing. The special purpose served by the HSG is to (1) look for all possible ways in which the present state of the engineering system can be mapped with respect to a given model and (2) generate a prioritized set of future possible states and the scenarios of which they are parts.
Software Implemented Fault-Tolerant (SIFT) user's guide

NASA Technical Reports Server (NTRS)

Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.

1984-01-01

Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.
Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lifflander, Jonathan; Meneses, Esteban; Menon, Harshita

2014-09-22

Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application in order to maintain generality. However, in many cases, only a partial order must be recorded due to determinism intrinsic in the code, ordering constraints imposed bymore » the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different concurrent orderings and interleavings. By exploiting this theory, we improve on an existing scalable message-logging fault tolerance scheme. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2x lower overhead than one that records a total order.« less
Proprioceptive Sensors' Fault Tolerant Control Strategy for an Autonomous Vehicle.

PubMed

Boukhari, Mohamed Riad; Chaibet, Ahmed; Boukhnifer, Moussa; Glaser, Sébastien

2018-06-09

In this contribution, a fault-tolerant control strategy for the longitudinal dynamics of an autonomous vehicle is presented. The aim is to be able to detect potential failures of the vehicle's speed sensor and then to keep the vehicle in a safe state. For this purpose, the separation principle, composed of a static output feedback controller and fault estimation observers, is designed. Indeed, two observer techniques were proposed: the proportional and integral observer and the descriptor observer. The effectiveness of the proposed scheme is validated by means of the experimental demonstrator of the VEDECOM (Véhicle Décarboné et Communinicant) Institut.
Fault-Tolerant Control of ANPC Three-Level Inverter Based on Order-Reduction Optimal Control Strategy under Multi-Device Open-Circuit Fault.

PubMed

Xu, Shi-Zhou; Wang, Chun-Jie; Lin, Fang-Li; Li, Shi-Xiang

2017-10-31

The multi-device open-circuit fault is a common fault of ANPC (Active Neutral-Point Clamped) three-level inverter and effect the operation stability of the whole system. To improve the operation stability, this paper summarized the main solutions currently firstly and analyzed all the possible states of multi-device open-circuit fault. Secondly, an order-reduction optimal control strategy was proposed under multi-device open-circuit fault to realize fault-tolerant control based on the topology and control requirement of ANPC three-level inverter and operation stability. This control strategy can solve the faults with different operation states, and can works in order-reduction state under specific open-circuit faults with specific combined devices, which sacrifices the control quality to obtain the stability priority control. Finally, the simulation and experiment proved the effectiveness of the proposed strategy.
Interplanetary Radiation and Fault Tolerant Mini-Star Tracker System

NASA Technical Reports Server (NTRS)

Rakoczy, John; Paceley, Pete

2015-01-01

The Charles Stark Draper Laboratory, Inc. is partnering with the NASA Marshall Space Flight Center (MSFC) Engineering Directorate's Avionics Design Division and Flight Mechanics & Analysis Division to develop and test a prototype small, low-weight, low-power, radiation-hardened, fault-tolerant mini-star tracker (fig. 1). The project is expected to enable Draper Laboratory and its small business partner, L-1 Standards and Technologies, Inc., to develop a new guidance, navigation, and control sensor product for the growing small sat technology market. The project also addresses MSFC's need for sophisticated small sat technologies to support a variety of science missions in Earth orbit and beyond. The prototype star tracker will be tested on the night sky on MSFC's Automated Lunar and Meteor Observatory (ALAMO) telescope. The specific goal of the project is to address the need for a compact, low size, weight, and power, yet radiation hardened and fault tolerant star tracker system that can be used as a stand-alone attitude determination system or incorporated into a complete attitude determination and control system for emerging interplanetary and operational CubeSat and small sat missions.
Reliability model derivation of a fault-tolerant, dual, spare-switching, digital computer system

NASA Technical Reports Server (NTRS)

1974-01-01

A computer based reliability projection aid, tailored specifically for application in the design of fault-tolerant computer systems, is described. Its more pronounced characteristics include the facility for modeling systems with two distinct operational modes, measuring the effect of both permanent and transient faults, and calculating conditional system coverage factors. The underlying conceptual principles, mathematical models, and computer program implementation are presented.
ALLIANCE: An architecture for fault tolerant, cooperative control of heterogeneous mobile robots

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parker, L.E.

1995-02-01

This research addresses the problem of achieving fault tolerant cooperation within small- to medium-sized teams of heterogeneous mobile robots. The author describes a novel behavior-based, fully distributed architecture, called ALLIANCE, that utilizes adaptive action selection to achieve fault tolerant cooperative control in robot missions involving loosely coupled, largely independent tasks. The robots in this architecture possess a variety of high-level functions that they can perform during a mission, and must at all times select an appropriate action based on the requirements of the mission, the activities of other robots, the current environmental conditions, and their own internal states. Since suchmore » cooperative teams often work in dynamic and unpredictable environments, the software architecture allows the team members to respond robustly and reliably to unexpected environmental changes and modifications in the robot team that may occur due to mechanical failure, the learning of new skills, or the addition or removal of robots from the team by human intervention. After presenting ALLIANCE, the author describes in detail experimental results of an implementation of this architecture on a team of physical mobile robots performing a cooperative box pushing demonstration. These experiments illustrate the ability of ALLIANCE to achieve adaptive, fault-tolerant cooperative control amidst dynamic changes in the capabilities of the robot team.« less
A Fault Tolerant Self-Routing Computer Network Topology

DTIC Science & Technology

1987-01-01

Herr and Thomas J. Plevyak, *ISDN: The Opportunity Beginso, IEEECommunicationsMaqaz I t, pp. 6-10, November 1986. 5. Mario Gerla and Rodolfo A . Pazos ...WOLAVER a Dean for Research anProfessional Development Air Force Institute Bf Technology Wright-Patterson AFB OH 45433-6583 19. KEY WORDS (Continue...DD I 1473 EDITION OF I NOV 65 IS OBSOLETE UM!C[ASSIFIEy SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) 41 ,." 5.’ A Fault Tolerant Self

Gain-Scheduled Fault Tolerance Control Under False Identification

NASA Technical Reports Server (NTRS)

Shin, Jong-Yeob; Belcastro, Christine (Technical Monitor)

2006-01-01

An active fault tolerant control (FTC) law is generally sensitive to false identification since the control gain is reconfigured for fault occurrence. In the conventional FTC law design procedure, dynamic variations due to false identification are not considered. In this paper, an FTC synthesis method is developed in order to consider possible variations of closed-loop dynamics under false identification into the control design procedure. An active FTC synthesis problem is formulated into an LMI optimization problem to minimize the upper bound of the induced-L2 norm which can represent the worst-case performance degradation due to false identification. The developed synthesis method is applied for control of the longitudinal motions of FASER (Free-flying Airplane for Subscale Experimental Research). The designed FTC law of the airplane is simulated for pitch angle command tracking under a false identification case.
Software reliability models for fault-tolerant avionics computers and related topics

NASA Technical Reports Server (NTRS)

Miller, Douglas R.

1987-01-01

Software reliability research is briefly described. General research topics are reliability growth models, quality of software reliability prediction, the complete monotonicity property of reliability growth, conceptual modelling of software failure behavior, assurance of ultrahigh reliability, and analysis techniques for fault-tolerant systems.
Advanced information processing system for advanced launch system: Hardware technology survey and projections

NASA Technical Reports Server (NTRS)

Cole, Richard

1991-01-01

The major goals of this effort are as follows: (1) to examine technology insertion options to optimize Advanced Information Processing System (AIPS) performance in the Advanced Launch System (ALS) environment; (2) to examine the AIPS concepts to ensure that valuable new technologies are not excluded from the AIPS/ALS implementations; (3) to examine advanced microprocessors applicable to AIPS/ALS, (4) to examine radiation hardening technologies applicable to AIPS/ALS; (5) to reach conclusions on AIPS hardware building blocks implementation technologies; and (6) reach conclusions on appropriate architectural improvements. The hardware building blocks are the Fault-Tolerant Processor, the Input/Output Sequencers (IOS), and the Intercomputer Interface Sequencers (ICIS).
Fault tolerant and lifetime control architecture for autonomous vehicles

NASA Astrophysics Data System (ADS)

Bogdanov, Alexander; Chen, Yi-Liang; Sundareswaran, Venkataraman; Altshuler, Thomas

2008-04-01

Increased vehicle autonomy, survivability and utility can provide an unprecedented impact on mission success and are one of the most desirable improvements for modern autonomous vehicles. We propose a general architecture of intelligent resource allocation, reconfigurable control and system restructuring for autonomous vehicles. The architecture is based on fault-tolerant control and lifetime prediction principles, and it provides improved vehicle survivability, extended service intervals, greater operational autonomy through lower rate of time-critical mission failures and lesser dependence on supplies and maintenance. The architecture enables mission distribution, adaptation and execution constrained on vehicle and payload faults and desirable lifetime. The proposed architecture will allow managing missions more efficiently by weighing vehicle capabilities versus mission objectives and replacing the vehicle only when it is necessary.
Summary: Experimental validation of real-time fault-tolerant systems

NASA Technical Reports Server (NTRS)

Iyer, R. K.; Choi, G. S.

1992-01-01

Testing and validation of real-time systems is always difficult to perform since neither the error generation process nor the fault propagation problem is easy to comprehend. There is no better substitute to results based on actual measurements and experimentation. Such results are essential for developing a rational basis for evaluation and validation of real-time systems. However, with physical experimentation, controllability and observability are limited to external instrumentation that can be hooked-up to the system under test. And this process is quite a difficult, if not impossible, task for a complex system. Also, to set up such experiments for measurements, physical hardware must exist. On the other hand, a simulation approach allows flexibility that is unequaled by any other existing method for system evaluation. A simulation methodology for system evaluation was successfully developed and implemented and the environment was demonstrated using existing real-time avionic systems. The research was oriented toward evaluating the impact of permanent and transient faults in aircraft control computers. Results were obtained for the Bendix BDX 930 system and Hamilton Standard EEC131 jet engine controller. The studies showed that simulated fault injection is valuable, in the design stage, to evaluate the susceptibility of computing sytems to different types of failures.
Fault tolerant data management system

NASA Technical Reports Server (NTRS)

Gustin, W. M.; Smither, M. A.

1972-01-01

Described in detail are: (1) results obtained in modifying the onboard data management system software to a multiprocessor fault tolerant system; (2) a functional description of the prototype buffer I/O units; (3) description of modification to the ACADC and stimuli generating unit of the DTS; and (4) summaries and conclusions on techniques implemented in the rack and prototype buffers. Also documented is the work done in investigating techniques of high speed (5 Mbps) digital data transmission in the data bus environment. The application considered is a multiport data bus operating with the following constraints: no preferred stations; random bus access by all stations; all stations equally likely to source or sink data; no limit to the number of stations along the bus; no branching of the bus; and no restriction on station placement along the bus.
Fault tolerant multi-sensor fusion based on the information gain

NASA Astrophysics Data System (ADS)

Hage, Joelle Al; El Najjar, Maan E.; Pomorski, Denis

2017-01-01

In the last decade, multi-robot systems are used in several applications like for example, the army, the intervention areas presenting danger to human life, the management of natural disasters, the environmental monitoring, exploration and agriculture. The integrity of localization of the robots must be ensured in order to achieve their mission in the best conditions. Robots are equipped with proprioceptive (encoders, gyroscope) and exteroceptive sensors (Kinect). However, these sensors could be affected by various faults types that can be assimilated to erroneous measurements, bias, outliers, drifts,… In absence of a sensor fault diagnosis step, the integrity and the continuity of the localization are affected. In this work, we present a muti-sensors fusion approach with Fault Detection and Exclusion (FDE) based on the information theory. In this context, we are interested by the information gain given by an observation which may be relevant when dealing with the fault tolerance aspect. Moreover, threshold optimization based on the quantity of information given by a decision on the true hypothesis is highlighted.
A survey of provably correct fault-tolerant clock synchronization techniques

NASA Technical Reports Server (NTRS)

Butler, Ricky W.

1988-01-01

Six provably correct fault-tolerant clock synchronization algorithms are examined. These algorithms are all presented in the same notation to permit easier comprehension and comparison. The advantages and disadvantages of the different techniques are examined and issues related to the implementation of these algorithms are discussed. The paper argues for the use of such algorithms in life-critical applications.
An approximation formula for a class of fault-tolerant computers

NASA Technical Reports Server (NTRS)

White, A. L.

1986-01-01

An approximation formula is derived for the probability of failure for fault-tolerant process-control computers. These computers use redundancy and reconfiguration to achieve high reliability. Finite-state Markov models capture the dynamic behavior of component failure and system recovery, and the approximation formula permits an estimation of system reliability by an easy examination of the model.
A fault-tolerant control architecture for unmanned aerial vehicles

NASA Astrophysics Data System (ADS)

Drozeski, Graham R.

Research has presented several approaches to achieve varying degrees of fault-tolerance in unmanned aircraft. Approaches in reconfigurable flight control are generally divided into two categories: those which incorporate multiple non-adaptive controllers and switch between them based on the output of a fault detection and identification element, and those that employ a single adaptive controller capable of compensating for a variety of fault modes. Regardless of the approach for reconfigurable flight control, certain fault modes dictate system restructuring in order to prevent a catastrophic failure. System restructuring enables active control of actuation not employed by the nominal system to recover controllability of the aircraft. After system restructuring, continued operation requires the generation of flight paths that adhere to an altered flight envelope. The control architecture developed in this research employs a multi-tiered hierarchy to allow unmanned aircraft to generate and track safe flight paths despite the occurrence of potentially catastrophic faults. The hierarchical architecture increases the level of autonomy of the system by integrating five functionalities with the baseline system: fault detection and identification, active system restructuring, reconfigurable flight control; reconfigurable path planning, and mission adaptation. Fault detection and identification algorithms continually monitor aircraft performance and issue fault declarations. When the severity of a fault exceeds the capability of the baseline flight controller, active system restructuring expands the controllability of the aircraft using unconventional control strategies not exploited by the baseline controller. Each of the reconfigurable flight controllers and the baseline controller employ a proven adaptive neural network control strategy. A reconfigurable path planner employs an adaptive model of the vehicle to re-shape the desired flight path. Generation of the revised
The use of automatic programming techniques for fault tolerant computing systems

NASA Technical Reports Server (NTRS)

Wild, C.

1985-01-01

It is conjectured that the production of software for ultra-reliable computing systems such as required by Space Station, aircraft, nuclear power plants and the like will require a high degree of automation as well as fault tolerance. In this paper, the relationship between automatic programming techniques and fault tolerant computing systems is explored. Initial efforts in the automatic synthesis of code from assertions to be used for error detection as well as the automatic generation of assertions and test cases from abstract data type specifications is outlined. Speculation on the ability to generate truly diverse designs capable of recovery from errors by exploring alternate paths in the program synthesis tree is discussed. Some initial thoughts on the use of knowledge based systems for the global detection of abnormal behavior using expectations and the goal-directed reconfiguration of resources to meet critical mission objectives are given. One of the sources of information for these systems would be the knowledge captured during the automatic programming process.
Trust index based fault tolerant multiple event localization algorithm for WSNs.

PubMed

Xu, Xianghua; Gao, Xueyong; Wan, Jian; Xiong, Naixue

2011-01-01

This paper investigates the use of wireless sensor networks for multiple event source localization using binary information from the sensor nodes. The events could continually emit signals whose strength is attenuated inversely proportional to the distance from the source. In this context, faults occur due to various reasons and are manifested when a node reports a wrong decision. In order to reduce the impact of node faults on the accuracy of multiple event localization, we introduce a trust index model to evaluate the fidelity of information which the nodes report and use in the event detection process, and propose the Trust Index based Subtract on Negative Add on Positive (TISNAP) localization algorithm, which reduces the impact of faulty nodes on the event localization by decreasing their trust index, to improve the accuracy of event localization and performance of fault tolerance for multiple event source localization. The algorithm includes three phases: first, the sink identifies the cluster nodes to determine the number of events occurred in the entire region by analyzing the binary data reported by all nodes; then, it constructs the likelihood matrix related to the cluster nodes and estimates the location of all events according to the alarmed status and trust index of the nodes around the cluster nodes. Finally, the sink updates the trust index of all nodes according to the fidelity of their information in the previous reporting cycle. The algorithm improves the accuracy of localization and performance of fault tolerance in multiple event source localization. The experiment results show that when the probability of node fault is close to 50%, the algorithm can still accurately determine the number of the events and have better accuracy of localization compared with other algorithms.
Trust Index Based Fault Tolerant Multiple Event Localization Algorithm for WSNs

PubMed Central

Xu, Xianghua; Gao, Xueyong; Wan, Jian; Xiong, Naixue

2011-01-01

This paper investigates the use of wireless sensor networks for multiple event source localization using binary information from the sensor nodes. The events could continually emit signals whose strength is attenuated inversely proportional to the distance from the source. In this context, faults occur due to various reasons and are manifested when a node reports a wrong decision. In order to reduce the impact of node faults on the accuracy of multiple event localization, we introduce a trust index model to evaluate the fidelity of information which the nodes report and use in the event detection process, and propose the Trust Index based Subtract on Negative Add on Positive (TISNAP) localization algorithm, which reduces the impact of faulty nodes on the event localization by decreasing their trust index, to improve the accuracy of event localization and performance of fault tolerance for multiple event source localization. The algorithm includes three phases: first, the sink identifies the cluster nodes to determine the number of events occurred in the entire region by analyzing the binary data reported by all nodes; then, it constructs the likelihood matrix related to the cluster nodes and estimates the location of all events according to the alarmed status and trust index of the nodes around the cluster nodes. Finally, the sink updates the trust index of all nodes according to the fidelity of their information in the previous reporting cycle. The algorithm improves the accuracy of localization and performance of fault tolerance in multiple event source localization. The experiment results show that when the probability of node fault is close to 50%, the algorithm can still accurately determine the number of the events and have better accuracy of localization compared with other algorithms. PMID:22163972
Dual-quaternion based fault-tolerant control for spacecraft formation flying with finite-time convergence.

PubMed

Dong, Hongyang; Hu, Qinglei; Ma, Guangfu

2016-03-01

Study results of developing control system for spacecraft formation proximity operations between a target and a chaser are presented. In particular, a coupled model using dual quaternion is employed to describe the proximity problem of spacecraft formation, and a nonlinear adaptive fault-tolerant feedback control law is developed to enable the chaser spacecraft to track the position and attitude of the target even though its actuator occurs fault. Multiple-task capability of the proposed control system is further demonstrated in the presence of disturbances and parametric uncertainties as well. In addition, the practical finite-time stability feature of the closed-loop system is guaranteed theoretically under the designed control law. Numerical simulation of the proposed method is presented to demonstrate the advantages with respect to interference suppression, fast tracking, fault tolerant and practical finite-time stability. Copyright © 2015 ISA. Published by Elsevier Ltd. All rights reserved.
Neural-Network-Based Adaptive Decentralized Fault-Tolerant Control for a Class of Interconnected Nonlinear Systems.

PubMed

Li, Xiao-Jian; Yang, Guang-Hong

2018-01-01

This paper is concerned with the adaptive decentralized fault-tolerant tracking control problem for a class of uncertain interconnected nonlinear systems with unknown strong interconnections. An algebraic graph theory result is introduced to address the considered interconnections. In addition, to achieve the desirable tracking performance, a neural-network-based robust adaptive decentralized fault-tolerant control (FTC) scheme is given to compensate the actuator faults and system uncertainties. Furthermore, via the Lyapunov analysis method, it is proven that all the signals of the resulting closed-loop system are semiglobally bounded, and the tracking errors of each subsystem exponentially converge to a compact set, whose radius is adjustable by choosing different controller design parameters. Finally, the effectiveness and advantages of the proposed FTC approach are illustrated with two simulated examples.
FTMP (Fault Tolerant Multiprocessor) programmer's manual

NASA Technical Reports Server (NTRS)

Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

1986-01-01

The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.
Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

NASA Technical Reports Server (NTRS)

Harper, Richard

1989-01-01

In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.
High Speed Operation and Testing of a Fault Tolerant Magnetic Bearing

NASA Technical Reports Server (NTRS)

DeWitt, Kenneth; Clark, Daniel

2004-01-01

Research activities undertaken to upgrade the fault-tolerant facility, continue testing high-speed fault-tolerant operation, and assist in the commission of the high temperature (1000 degrees F) thrust magnetic bearing as described. The fault-tolerant magnetic bearing test facility was upgraded to operate to 40,000 RPM. The necessary upgrades included new state-of-the art position sensors with high frequency modulation and new power edge filtering of amplifier outputs. A comparison study of the new sensors and the previous system was done as well as a noise assessment of the sensor-to-controller signals. Also a comparison study of power edge filtering for amplifier-to-actuator signals was done; this information is valuable for all position sensing and motor actuation applications. After these facility upgrades were completed, the rig is believed to have capabilities for 40,000 RPM operation, though this has yet to be demonstrated. Other upgrades included verification and upgrading of safety shielding, and upgrading control algorithms. The rig will now also be used to demonstrate motoring capabilities and control algorithms are in the process of being created. Recently an extreme temperature thrust magnetic bearing was designed from the ground up. The thrust bearing was designed to fit within the existing high temperature facility. The retrofit began near the end of the summer, 04, and continues currently. Contract staff authored a NASA-TM entitled "An Overview of Magnetic Bearing Technology for Gas Turbine Engines", containing a compilation of bearing data as it pertains to operation in the regime of the gas turbine engine and a presentation of how magnetic bearings can become a viable candidate for use in future engine technology.
Cost and benefits design optimization model for fault tolerant flight control systems

NASA Technical Reports Server (NTRS)

Rose, J.

1982-01-01

Requirements and specifications for a method of optimizing the design of fault-tolerant flight control systems are provided. Algorithms that could be used for developing new and modifying existing computer programs are also provided, with recommendations for follow-on work.
The Design of Fault Tolerant Quantum Dot Cellular Automata Based Logic

NASA Technical Reports Server (NTRS)

Armstrong, C. Duane; Humphreys, William M.; Fijany, Amir

2002-01-01

As transistor geometries are reduced, quantum effects begin to dominate device performance. At some point, transistors cease to have the properties that make them useful computational components. New computing elements must be developed in order to keep pace with Moore s Law. Quantum dot cellular automata (QCA) represent an alternative paradigm to transistor-based logic. QCA architectures that are robust to manufacturing tolerances and defects must be developed. We are developing software that allows the exploration of fault tolerant QCA gate architectures by automating the specification, simulation, analysis and documentation processes.

An Autonomous Distributed Fault-Tolerant Local Positioning System

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2017-01-01

We describe a fault-tolerant, GPS-independent (Global Positioning System) distributed autonomous positioning system for static/mobile objects and present solutions for providing highly-accurate geo-location data for the static/mobile objects in dynamic environments. The reliability and accuracy of a positioning system fundamentally depends on two factors; its timeliness in broadcasting signals and the knowledge of its geometry, i.e., locations and distances of the beacons. Existing distributed positioning systems either synchronize to a common external source like GPS or establish their own time synchrony using a scheme similar to a master-slave by designating a particular beacon as the master and other beacons synchronize to it, resulting in a single point of failure. Another drawback of existing positioning systems is their lack of addressing various fault manifestations, in particular, communication link failures, which, as in wireless networks, are increasingly dominating the process failures and are typically transient and mobile, in the sense that they typically affect different messages to/from different processes over time.
Roads towards fault-tolerant universal quantum computation

NASA Astrophysics Data System (ADS)

Campbell, Earl T.; Terhal, Barbara M.; Vuillot, Christophe

2017-09-01

A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Roads towards fault-tolerant universal quantum computation.

PubMed

Campbell, Earl T; Terhal, Barbara M; Vuillot, Christophe

2017-09-13

A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Validation of fault-free behavior of a reliable multiprocessor system - FTMP: A case study. [Fault-Tolerant Multi-Processor avionics

NASA Technical Reports Server (NTRS)

Clune, E.; Segall, Z.; Siewiorek, D.

1984-01-01

A program of experiments has been conducted at NASA-Langley to test the fault-free performance of a Fault-Tolerant Multiprocessor (FTMP) avionics system for next-generation aircraft. Baseline measurements of an operating FTMP system were obtained with respect to the following parameters: instruction execution time, frame size, and the variation of clock ticks. The mechanisms of frame stretching were also investigated. The experimental results are summarized in a table. Areas of interest for future tests are identified, with emphasis given to the implementation of a synthetic workload generation mechanism on FTMP.
Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

1994-01-01

The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
Sensor fault-tolerant control for gear-shifting engaging process of automated manual transmission

NASA Astrophysics Data System (ADS)

Li, Liang; He, Kai; Wang, Xiangyu; Liu, Yahui

2018-01-01

Angular displacement sensor on the actuator of automated manual transmission (AMT) is sensitive to fault, and the sensor fault will disturb its normal control, which affects the entire gear-shifting process of AMT and results in awful riding comfort. In order to solve this problem, this paper proposes a method of fault-tolerant control for AMT gear-shifting engaging process. By using the measured current of actuator motor and angular displacement of actuator, the gear-shifting engaging load torque table is built and updated before the occurrence of the sensor fault. Meanwhile, residual between estimated and measured angular displacements is used to detect the sensor fault. Once the residual exceeds a determined fault threshold, the sensor fault is detected. Then, switch control is triggered, and the current observer and load torque table estimates an actual gear-shifting position to replace the measured one to continue controlling the gear-shifting process. Numerical and experiment tests are carried out to evaluate the reliability and feasibility of proposed methods, and the results show that the performance of estimation and control is satisfactory.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.

1973-01-01

This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
Distributed fault-tolerant time-varying formation control for high-order linear multi-agent systems with actuator failures.

PubMed

Hua, Yongzhao; Dong, Xiwang; Li, Qingdong; Ren, Zhang

2017-11-01

This paper investigates the fault-tolerant time-varying formation control problems for high-order linear multi-agent systems in the presence of actuator failures. Firstly, a fully distributed formation control protocol is presented to compensate for the influences of both bias fault and loss of effectiveness fault. Using the adaptive online updating strategies, no global knowledge about the communication topology is required and the bounds of actuator failures can be unknown. Then an algorithm is proposed to determine the control parameters of the fault-tolerant formation protocol, where the time-varying formation feasible conditions and an approach to expand the feasible formation set are given. Furthermore, the stability of the proposed algorithm is proven based on the Lyapunov-like theory. Finally, two simulation examples are given to demonstrate the effectiveness of the theoretical results. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
Observer-Based Adaptive Fault-Tolerant Tracking Control of Nonlinear Nonstrict-Feedback Systems.

PubMed

Wu, Chengwei; Liu, Jianxing; Xiong, Yongyang; Wu, Ligang

2017-06-28

This paper studies an output-based adaptive fault-tolerant control problem for nonlinear systems with nonstrict-feedback form. Neural networks are utilized to identify the unknown nonlinear characteristics in the system. An observer and a general fault model are constructed to estimate the unavailable states and describe the fault, respectively. Adaptive parameters are constructed to overcome the difficulties in the design process for nonstrict-feedback systems. Meanwhile, dynamic surface control technique is introduced to avoid the problem of ''explosion of complexity''. Furthermore, based on adaptive backstepping control method, an output-based adaptive neural tracking control strategy is developed for the considered system against actuator fault, which can ensure that all the signals in the resulting closed-loop system are bounded, and the system output signal can be regulated to follow the response of the given reference signal with a small error. Finally, the simulation results are provided to validate the effectiveness of the control strategy proposed in this paper.
An Analysis of Failure Handling in Chameleon, A Framework for Supporting Cost-Effective Fault Tolerant Services

NASA Technical Reports Server (NTRS)

Haakensen, Erik Edward

1998-01-01

The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios.
Robust fault tolerant control based on sliding mode method for uncertain linear systems with quantization.

PubMed

Hao, Li-Ying; Yang, Guang-Hong

2013-09-01

This paper is concerned with the problem of robust fault-tolerant compensation control problem for uncertain linear systems subject to both state and input signal quantization. By incorporating novel matrix full-rank factorization technique with sliding surface design successfully, the total failure of certain actuators can be coped with, under a special actuator redundancy assumption. In order to compensate for quantization errors, an adjustment range of quantization sensitivity for a dynamic uniform quantizer is given through the flexible choices of design parameters. Comparing with the existing results, the derived inequality condition leads to the fault tolerance ability stronger and much wider scope of applicability. With a static adjustment policy of quantization sensitivity, an adaptive sliding mode controller is then designed to maintain the sliding mode, where the gain of the nonlinear unit vector term is updated automatically to compensate for the effects of actuator faults, quantization errors, exogenous disturbances and parameter uncertainties without the need for a fault detection and isolation (FDI) mechanism. Finally, the effectiveness of the proposed design method is illustrated via a model of a rocket fairing structural-acoustic. Copyright © 2013 ISA. Published by Elsevier Ltd. All rights reserved.
Techniques for modeling the reliability of fault-tolerant systems with the Markov state-space approach

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Johnson, Sally C.

1995-01-01

This paper presents a step-by-step tutorial of the methods and the tools that were used for the reliability analysis of fault-tolerant systems. The approach used in this paper is the Markov (or semi-Markov) state-space method. The paper is intended for design engineers with a basic understanding of computer architecture and fault tolerance, but little knowledge of reliability modeling. The representation of architectural features in mathematical models is emphasized. This paper does not present details of the mathematical solution of complex reliability models. Instead, it describes the use of several recently developed computer programs SURE, ASSIST, STEM, and PAWS that automate the generation and the solution of these models.
Influence of slot number and pole number in fault-tolerant brushless dc motors having unequal tooth widths

NASA Astrophysics Data System (ADS)

Ishak, D.; Zhu, Z. Q.; Howe, D.

2005-05-01

The electromagnetic performance of fault-tolerant three-phase permanent magnet brushless dc motors, in which the wound teeth are wider than the unwound teeth and their tooth tips span approximately one pole pitch and which have similar numbers of slots and poles, is investigated. It is shown that they have a more trapezoidal phase back-emf wave form, a higher torque capability, and a lower torque ripple than similar fault-tolerant machines with equal tooth widths. However, these benefits gradually diminish as the pole number is increased, due to the effect of interpole leakage flux.
COTS-Based Fault Tolerance in Deep Space: Qualitative and Quantitative Analyses of a Bus Network Architecture

NASA Technical Reports Server (NTRS)

Tai, Ann T.; Chau, Savio N.; Alkalai, Leon

2000-01-01

Using COTS products, standards and intellectual properties (IPs) for all the system and component interfaces is a crucial step toward significant reduction of both system cost and development cost as the COTS interfaces enable other COTS products and IPs to be readily accommodated by the target system architecture. With respect to the long-term survivable systems for deep-space missions, the major challenge for us is, under stringent power and mass constraints, to achieve ultra-high reliability of the system comprising COTS products and standards that are not developed for mission-critical applications. The spirit of our solution is to exploit the pertinent standard features of a COTS product to circumvent its shortcomings, though these standard features may not be originally designed for highly reliable systems. In this paper, we discuss our experiences and findings on the design of an IEEE 1394 compliant fault-tolerant COTS-based bus architecture. We first derive and qualitatively analyze a -'stacktree topology" that not only complies with IEEE 1394 but also enables the implementation of a fault-tolerant bus architecture without node redundancy. We then present a quantitative evaluation that demonstrates significant reliability improvement from the COTS-based fault tolerance.
A fault-tolerant information processing concept for space vehicles.

NASA Technical Reports Server (NTRS)

Hopkins, A. L., Jr.

1971-01-01

A distributed fault-tolerant information processing system is proposed, comprising a central multiprocessor, dedicated local processors, and multiplexed input-output buses connecting them together. The processors in the multiprocessor are duplicated for error detection, which is felt to be less expensive than using coded redundancy of comparable effectiveness. Error recovery is made possible by a triplicated scratchpad memory in each processor. The main multiprocessor memory uses replicated memory for error detection and correction. Local processors use any of three conventional redundancy techniques: voting, duplex pairs with backup, and duplex pairs in independent subsystems.
Advanced reliability modeling of fault-tolerant computer-based systems

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1982-01-01

Two methodologies for the reliability assessment of fault tolerant digital computer based systems are discussed. The computer-aided reliability estimation 3 (CARE 3) and gate logic software simulation (GLOSS) are assessment technologies that were developed to mitigate a serious weakness in the design and evaluation process of ultrareliable digital systems. The weak link is based on the unavailability of a sufficiently powerful modeling technique for comparing the stochastic attributes of one system against others. Some of the more interesting attributes are reliability, system survival, safety, and mission success.
Method and apparatus for fault tolerance

NASA Technical Reports Server (NTRS)

Masson, Gerald M. (Inventor); Sullivan, Gregory F. (Inventor)

1993-01-01

A method and apparatus for achieving fault tolerance in a computer system having at least a first central processing unit and a second central processing unit. The method comprises the steps of first executing a first algorithm in the first central processing unit on input which produces a first output as well as a certification trail. Next, executing a second algorithm in the second central processing unit on the input and on at least a portion of the certification trail which produces a second output. The second algorithm has a faster execution time than the first algorithm for a given input. Then, comparing the first and second outputs such that an error result is produced if the first and second outputs are not the same. The step of executing a first algorithm and the step of executing a second algorithm preferably takes place over essentially the same time period.
Implementation Of The Configurable Fault Tolerant System Experiment On NPSAT 1

DTIC Science & Technology

2016-03-01

REPORT TYPE AND DATES COVERED Master’s thesis 4. TITLE AND SUBTITLE IMPLEMENTATION OF THE CONFIGURABLE FAULT TOLERANT SYSTEM EXPERIMENT ON NPSAT...open-source microprocessor without interlocked pipeline stages (MIPS) based processor softcore, a cached memory structure capable of accessing double...data rate type three and secure digital card memories, an interface to the main satellite bus, and XILINX’s soft error mitigation softcore. The
Fault-tolerant control of large space structures using the stable factorization approach

NASA Technical Reports Server (NTRS)

Razavi, H. C.; Mehra, R. K.; Vidyasagar, M.

1986-01-01

Large space structures are characterized by the following features: they are in general infinite-dimensional systems, and have large numbers of undamped or lightly damped poles. Any attempt to apply linear control theory to large space structures must therefore take into account these features. Phase I consisted of an attempt to apply the recently developed Stable Factorization (SF) design philosophy to problems of large space structures, with particular attention to the aspects of robustness and fault tolerance. The final report on the Phase I effort consists of four sections, each devoted to one task. The first three sections report theoretical results, while the last consists of a design example. Significant results were obtained in all four tasks of the project. More specifically, an innovative approach to order reduction was obtained, stabilizing controller structures for plants with an infinite number of unstable poles were determined under some conditions, conditions for simultaneous stabilizability of an infinite number of plants were explored, and a fault tolerance controller design that stabilizes a flexible structure model was obtained which is robust against one failure condition.
Reconfigurable fault tolerant avionics system

NASA Astrophysics Data System (ADS)

Ibrahim, M. M.; Asami, K.; Cho, Mengu

This paper presents the design of a reconfigurable avionics system based on modern Static Random Access Memory (SRAM)-based Field Programmable Gate Array (FPGA) to be used in future generations of nano satellites. A major concern in satellite systems and especially nano satellites is to build robust systems with low-power consumption profiles. The system is designed to be flexible by providing the capability of reconfiguring itself based on its orbital position. As Single Event Upsets (SEU) do not have the same severity and intensity in all orbital locations, having the maximum at the South Atlantic Anomaly (SAA) and the polar cusps, the system does not have to be fully protected all the time in its orbit. An acceptable level of protection against high-energy cosmic rays and charged particles roaming in space is provided within the majority of the orbit through software fault tolerance. Check pointing and roll back, besides control flow assertions, is used for that level of protection. In the minority part of the orbit where severe SEUs are expected to exist, a reconfiguration for the system FPGA is initiated where the processor systems are triplicated and protection through Triple Modular Redundancy (TMR) with feedback is provided. This technique of reconfiguring the system as per the level of the threat expected from SEU-induced faults helps in reducing the average dynamic power consumption of the system to one-third of its maximum. This technique can be viewed as a smart protection through system reconfiguration. The system is built on the commercial version of the (XC5VLX50) Xilinx Virtex5 FPGA on bulk silicon with 324 IO. Simulations of orbit SEU rates were carried out using the SPENVIS web-based software package.

Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumbhare, Alok; Frincu, Marc; Simmhan, Yogesh

2015-06-29

The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the runtime variations in data characteristics such as data-rates and key-distribution cause resource overload, that inturn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful reducers, whose state depends on the complete tuple history, necessitates efficient fault-recovery mechanisms to maintain the desired QoS in the presence ofmore » resource failures. We propose an integrated streaming MapReduce architecture leveraging the concept of consistent hashing to support runtime elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2:8 improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700 -1500 ms from multiple failures.« less
Adaptive Fuzzy Output-Constrained Fault-Tolerant Control of Nonlinear Stochastic Large-Scale Systems With Actuator Faults.

PubMed

Li, Yongming; Ma, Zhiyao; Tong, Shaocheng

2017-09-01

The problem of adaptive fuzzy output-constrained tracking fault-tolerant control (FTC) is investigated for the large-scale stochastic nonlinear systems of pure-feedback form. The nonlinear systems considered in this paper possess the unstructured uncertainties, unknown interconnected terms and unknown nonaffine nonlinear faults. The fuzzy logic systems are employed to identify the unknown lumped nonlinear functions so that the problems of structured uncertainties can be solved. An adaptive fuzzy state observer is designed to solve the nonmeasurable state problem. By combining the barrier Lyapunov function theory, adaptive decentralized and stochastic control principles, a novel fuzzy adaptive output-constrained FTC approach is constructed. All the signals in the closed-loop system are proved to be bounded in probability and the system outputs are constrained in a given compact set. Finally, the applicability of the proposed controller is well carried out by a simulation example.
An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems

PubMed Central

Idris, Hajara; Junaidu, Sahalu B.; Adewumi, Aderemi O.

2017-01-01

The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user’s Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user’s QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time. PMID:28545075
A hybrid robust fault tolerant control based on adaptive joint unscented Kalman filter.

PubMed

Shabbouei Hagh, Yashar; Mohammadi Asl, Reza; Cocquempot, Vincent

2017-01-01

In this paper, a new hybrid robust fault tolerant control scheme is proposed. A robust H ∞ control law is used in non-faulty situation, while a Non-Singular Terminal Sliding Mode (NTSM) controller is activated as soon as an actuator fault is detected. Since a linear robust controller is designed, the system is first linearized through the feedback linearization method. To switch from one controller to the other, a fuzzy based switching system is used. An Adaptive Joint Unscented Kalman Filter (AJUKF) is used for fault detection and diagnosis. The proposed method is based on the simultaneous estimation of the system states and parameters. In order to show the efficiency of the proposed scheme, a simulated 3-DOF robotic manipulator is used. Copyright © 2016 ISA. Published by Elsevier Ltd. All rights reserved.
Holonomic surface codes for fault-tolerant quantum computation

NASA Astrophysics Data System (ADS)

Zhang, Jiang; Devitt, Simon J.; You, J. Q.; Nori, Franco

2018-02-01

Surface codes can protect quantum information stored in qubits from local errors as long as the per-operation error rate is below a certain threshold. Here we propose holonomic surface codes by harnessing the quantum holonomy of the system. In our scheme, the holonomic gates are built via auxiliary qubits rather than the auxiliary levels in multilevel systems used in conventional holonomic quantum computation. The key advantage of our approach is that the auxiliary qubits are in their ground state before and after each gate operation, so they are not involved in the operation cycles of surface codes. This provides an advantageous way to implement surface codes for fault-tolerant quantum computation.
Fault tolerant vector control of induction motor drive

NASA Astrophysics Data System (ADS)

Odnokopylov, G.; Bragin, A.

2014-10-01

For electric composed of technical objects hazardous industries, such as nuclear, military, chemical, etc. an urgent task is to increase their resiliency and survivability. The construction principle of vector control system fault-tolerant asynchronous electric. Displaying recovery efficiency three-phase induction motor drive in emergency mode using two-phase vector control system. The process of formation of a simulation model of the asynchronous electric unbalance in emergency mode. When modeling used coordinate transformation, providing emergency operation electric unbalance work. The results of modeling transient phase loss motor stator. During a power failure phase induction motor cannot save circular rotating field in the air gap of the motor and ensure the restoration of its efficiency at rated torque and speed.
A Test Generation Framework for Distributed Fault-Tolerant Algorithms

NASA Technical Reports Server (NTRS)

Goodloe, Alwyn; Bushnell, David; Miner, Paul; Pasareanu, Corina S.

2009-01-01

Heavyweight formal methods such as theorem proving have been successfully applied to the analysis of safety critical fault-tolerant systems. Typically, the models and proofs performed during such analysis do not inform the testing process of actual implementations. We propose a framework for generating test vectors from specifications written in the Prototype Verification System (PVS). The methodology uses a translator to produce a Java prototype from a PVS specification. Symbolic (Java) PathFinder is then employed to generate a collection of test cases. A small example is employed to illustrate how the framework can be used in practice.
Fault-tolerant logical gates in quantum error-correcting codes

NASA Astrophysics Data System (ADS)

Pastawski, Fernando; Yoshida, Beni

2015-01-01

Recently, S. Bravyi and R. König [Phys. Rev. Lett. 110, 170503 (2013), 10.1103/PhysRevLett.110.170503] have shown that there is a trade-off between fault-tolerantly implementable logical gates and geometric locality of stabilizer codes. They consider locality-preserving operations which are implemented by a constant-depth geometrically local circuit and are thus fault tolerant by construction. In particular, they show that, for local stabilizer codes in D spatial dimensions, locality-preserving gates are restricted to a set of unitary gates known as the D th level of the Clifford hierarchy. In this paper, we explore this idea further by providing several extensions and applications of their characterization to qubit stabilizer and subsystem codes. First, we present a no-go theorem for self-correcting quantum memory. Namely, we prove that a three-dimensional stabilizer Hamiltonian with a locality-preserving implementation of a non-Clifford gate cannot have a macroscopic energy barrier. This result implies that non-Clifford gates do not admit such implementations in Haah's cubic code and Michnicki's welded code. Second, we prove that the code distance of a D -dimensional local stabilizer code with a nontrivial locality-preserving m th -level Clifford logical gate is upper bounded by O (LD +1 -m) . For codes with non-Clifford gates (m >2 ), this improves the previous best bound by S. Bravyi and B. Terhal [New. J. Phys. 11, 043029 (2009), 10.1088/1367-2630/11/4/043029]. Topological color codes, introduced by H. Bombin and M. A. Martin-Delgado [Phys. Rev. Lett. 97, 180501 (2006), 10.1103/PhysRevLett.97.180501; Phys. Rev. Lett. 98, 160502 (2007), 10.1103/PhysRevLett.98.160502; Phys. Rev. B 75, 075103 (2007), 10.1103/PhysRevB.75.075103], saturate the bound for m =D . Third, we prove that the qubit erasure threshold for codes with a nontrivial transversal m th -level Clifford logical gate is upper bounded by 1 /m . This implies that no family of fault-tolerant codes with
Experiments in fault tolerant software reliability

NASA Technical Reports Server (NTRS)

Mcallister, David F.; Vouk, Mladen A.

1989-01-01

Twenty functionally equivalent programs were built and tested in a multiversion software experiment. Following unit testing, all programs were subjected to an extensive system test. In the process sixty-one distinct faults were identified among the versions. Less than 12 percent of the faults exhibited varying degrees of positive correlation. The common-cause (or similar) faults spanned as many as 14 components. However, a majority of these faults were trivial, and easily detected by proper unit and/or system testing. Only two of the seven similar faults were difficult faults, and both were caused by specification ambiguities. One of these faults exhibited variable identical-and-wrong response span, i.e. response span which varied with the testing conditions and input data. Techniques that could have been used to avoid the faults are discussed. For example, it was determined that back-to-back testing of 2-tuples could have been used to eliminate about 90 percent of the faults. In addition, four of the seven similar faults could have been detected by using back-to-back testing of 5-tuples. It is believed that most, if not all, similar faults could have been avoided had the specifications been written using more formal notation, the unit testing phase was subject to more stringent standards and controls, and better tools for measuring the quality and adequacy of the test data (e.g. coverage) were used.
Different-Level Simultaneous Minimization Scheme for Fault Tolerance of Redundant Manipulator Aided with Discrete-Time Recurrent Neural Network

PubMed Central

Jin, Long; Liao, Bolin; Liu, Mei; Xiao, Lin; Guo, Dongsheng; Yan, Xiaogang

2017-01-01

By incorporating the physical constraints in joint space, a different-level simultaneous minimization scheme, which takes both the robot kinematics and robot dynamics into account, is presented and investigated for fault-tolerant motion planning of redundant manipulator in this paper. The scheme is reformulated as a quadratic program (QP) with equality and bound constraints, which is then solved by a discrete-time recurrent neural network. Simulative verifications based on a six-link planar redundant robot manipulator substantiate the efficacy and accuracy of the presented acceleration fault-tolerant scheme, the resultant QP and the corresponding discrete-time recurrent neural network. PMID:28955217
Indirect adaptive fuzzy fault-tolerant tracking control for MIMO nonlinear systems with actuator and sensor failures.

PubMed

Bounemeur, Abdelhamid; Chemachema, Mohamed; Essounbouli, Najib

2018-05-10

In this paper, an active fuzzy fault tolerant tracking control (AFFTTC) scheme is developed for a class of multi-input multi-output (MIMO) unknown nonlinear systems in the presence of unknown actuator faults, sensor failures and external disturbance. The developed control scheme deals with four kinds of faults for both sensors and actuators. The bias, drift, and loss of accuracy additive faults are considered along with the loss of effectiveness multiplicative fault. A fuzzy adaptive controller based on back-stepping design is developed to deal with actuator failures and unknown system dynamics. However, an additional robust control term is added to deal with sensor faults, approximation errors, and external disturbances. Lyapunov theory is used to prove the stability of the closed loop system. Numerical simulations on a quadrotor are presented to show the effectiveness of the proposed approach. Copyright © 2018 ISA. Published by Elsevier Ltd. All rights reserved.
Design and analysis of new fault-tolerant permanent magnet motors for four-wheel-driving electric vehicles

NASA Astrophysics Data System (ADS)

Liu, Guohai; Gong, Wensheng; Chen, Qian; Jian, Linni; Shen, Yue; Zhao, Wenxiang

2012-04-01

In this paper, a novel in-wheel permanent-magnet (PM) motor for four-wheel-driving electrical vehicles is proposed. It adopts an outer-rotor topology, which can help generate a large drive torque, in order to achieve prominent dynamic performance of the vehicle. Moreover, by adopting single-layer concentrated-windings, fault-tolerant teeth, and the optimal combination of slot and pole numbers, the proposed motor inherently offers negligible electromagnetic coupling between different phase windings, hence, it possesses a fault-tolerant characteristic. Meanwhile, the phase back electromotive force waveforms can be designed to be sinusoidal by employing PMs with a trapezoidal shape, eccentric armature teeth, and unequal tooth widths. The electromagnetic performance is comprehensively investigated and the optimal design is conducted by using the finite-element method.
Noise Threshold and Resource Cost of Fault-Tolerant Quantum Computing with Majorana Fermions in Hybrid Systems.

PubMed

Li, Ying

2016-09-16

Fault-tolerant quantum computing in systems composed of both Majorana fermions and topologically unprotected quantum systems, e.g., superconducting circuits or quantum dots, is studied in this Letter. Errors caused by topologically unprotected quantum systems need to be corrected with error-correction schemes, for instance, the surface code. We find that the error-correction performance of such a hybrid topological quantum computer is not superior to a normal quantum computer unless the topological charge of Majorana fermions is insusceptible to noise. If errors changing the topological charge are rare, the fault-tolerance threshold is much higher than the threshold of a normal quantum computer and a surface-code logical qubit could be encoded in only tens of topological qubits instead of about 1,000 normal qubits.
A review of fault tolerant control strategies applied to proton exchange membrane fuel cell systems

NASA Astrophysics Data System (ADS)

Dijoux, Etienne; Steiner, Nadia Yousfi; Benne, Michel; Péra, Marie-Cécile; Pérez, Brigitte Grondin

2017-08-01

Fuel cells are powerful systems for power generation. They have a good efficiency and do not generate greenhouse gases. This technology involves a lot of scientific fields, which leads to the appearance of strongly inter-dependent parameters. This makes the system particularly hard to control and increases fault's occurrence frequency. These two issues call for the necessity to maintain the system performance at the expected level, even in faulty operating conditions. It is called "fault tolerant control" (FTC). The present paper aims to give the state of the art of FTC applied to the proton exchange membrane fuel cell (PEMFC). The FTC approach is composed of two parts. First, a diagnosis part allows the identification and the isolation of a fault; it requires a good a priori knowledge of all the possible faults. Then, a control part allows an optimal control strategy to find the best operating point to recover/mitigate the fault; it requires the knowledge of the degradation phenomena and their mitigation strategies.
Diagnosing a Failed Proof in Fault-Tolerance: A Disproving Challenge Problem

NASA Technical Reports Server (NTRS)

Pike, Lee; Miner, Paul; Torres-Pomales, Wilfredo

2006-01-01

This paper proposes a challenge problem in disproving. We describe a fault-tolerant distributed protocol designed at NASA for use in a fly-by-wire system for next-generation commercial aircraft. An early design of the protocol contains a subtle bug that is highly unlikely to be caught in fault injection testing. We describe a failed proof of the protocol's correctness in a mechanical theorem prover (PVS) with a complex unfinished proof conjecture. We use a model checking suite (SAL) to generate a concrete counterexample to the unproven conjecture to demonstrate the existence of a bug. However, we argue that the effort required in our approach is too high and propose what conditions a better solution would satisfy. We carefully describe the protocol and bug to provide a challenging but feasible case study for disproving research.
Evolution of shuttle avionics redundancy management/fault tolerance

NASA Technical Reports Server (NTRS)

Boykin, J. C.; Thibodeau, J. R.; Schneider, H. E.

1985-01-01

The challenge of providing redundancy management (RM) and fault tolerance to meet the Shuttle Program requirements of fail operational/fail safe for the avionics systems was complicated by the critical program constraints of weight, cost, and schedule. The basic and sometimes false effectivity of less than pure RM designs is addressed. Evolution of the multiple input selection filter (the heart of the RM function) is discussed with emphasis on the subtle interactions of the flight control system that were found to be potentially catastrophic. Several other general RM development problems are discussed, with particular emphasis on the inertial measurement unit RM, indicative of the complexity of managing that three string system and its critical interfaces with the guidance and control systems.
Position, Attitude, and Fault-Tolerant Control of Tilting-Rotor Quadcopter

NASA Astrophysics Data System (ADS)

Kumar, Rumit

The aim of this thesis is to present algorithms for autonomous control of tilt-rotor quadcopter UAV. In particular, this research work describes position, attitude and fault tolerant control in tilt-rotor quadcopter. Quadcopters are one of the most popular and reliable unmanned aerial systems because of the design simplicity, hovering capabilities and minimal operational cost. Numerous applications for quadcopters have been explored all over the world but very little work has been done to explore design enhancements and address the fault-tolerant capabilities of the quadcopters. The tilting rotor quadcopter is a structural advancement of traditional quadcopter and it provides additional actuated controls as the propeller motors are actuated for tilt which can be utilized to improve efficiency of the aerial vehicle during flight. The tilting rotor quadcopter design is accomplished by using an additional servo motor for each rotor that enables the rotor to tilt about the axis of the quadcopter arm. Tilting rotor quadcopter is a more agile version of conventional quadcopter and it is a fully actuated system. The tilt-rotor quadcopter is capable of following complex trajectories with ease. The control strategy in this work is to use the propeller tilts for position and orientation control during autonomous flight of the quadcopter. In conventional quadcopters, two propellers rotate in clockwise direction and other two propellers rotate in counter clockwise direction to cancel out the effective yawing moment of the system. The variation in rotational speeds of these four propellers is utilized for maneuvering. On the other hand, this work incorporates use of varying propeller rotational speeds along with tilting of the propellers for maneuvering during flight. The rotational motion of propellers work in sync with propeller tilts to control the position and orientation of the UAV during the flight. A PD flight controller is developed to achieve various modes of the
Adaptive and technology-independent architecture for fault-tolerant distributed AAL solutions.

PubMed

Schmidt, Michael; Obermaisser, Roman

2018-04-01

Today's architectures for Ambient Assisted Living (AAL) must cope with a variety of challenges like flawless sensor integration and time synchronization (e.g. for sensor data fusion) while abstracting from the underlying technologies at the same time. Furthermore, an architecture for AAL must be capable to manage distributed application scenarios in order to support elderly people in all situations of their everyday life. This encompasses not just life at home but in particular the mobility of elderly people (e.g. when going for a walk or having sports) as well. Within this paper we will introduce a novel architecture for distributed AAL solutions whose design follows a modern Microservices approach by providing small core services instead of a monolithic application framework. The architecture comprises core services for sensor integration, and service discovery while supporting several communication models (periodic, sporadic, streaming). We extend the state-of-the-art by introducing a fault-tolerance model for our architecture on the basis of a fault-hypothesis describing the fault-containment regions (FCRs) with their respective failure modes and failure rates in order to support safety-critical AAL applications. Copyright © 2017 Elsevier Ltd. All rights reserved.
Fault Model Development for Fault Tolerant VLSI Design

DTIC Science & Technology

1988-05-01

0 % .%. . BEIDGING FAULTS A bridging fault in a digital circuit connects two or more conducting paths of the circuit. The resistance...Melvin Breuer and Arthur Friedman, "Diagnosis and Reliable Design of Digital Systems", Computer Science Press, Inc., 1976. 4. [Chandramouli,1983] R...2138 AEDC LIBARY (TECH REPORTS FILE) MS-O0 ARNOLD AFS TN 37389-9998 USAG1 Attn: ASH-PCA-CRT Ft Huachuca AZ 85613-6000 DOT LIBRARY/iQA SECTION - ATTN
Verification of a Byzantine-Fault-Tolerant Self-stabilizing Protocol for Clock Synchronization

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2008-01-01

This paper presents the mechanical verification of a simplified model of a rapid Byzantine-fault-tolerant self-stabilizing protocol for distributed clock synchronization systems. This protocol does not rely on any assumptions about the initial state of the system except for the presence of sufficient good nodes, thus making the weakest possible assumptions and producing the strongest results. This protocol tolerates bursts of transient failures, and deterministically converges within a time bound that is a linear function of the self-stabilization period. A simplified model of the protocol is verified using the Symbolic Model Verifier (SMV). The system under study consists of 4 nodes, where at most one of the nodes is assumed to be Byzantine faulty. The model checking effort is focused on verifying correctness of the simplified model of the protocol in the presence of a permanent Byzantine fault as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period. Although model checking results of the simplified model of the protocol confirm the theoretical predictions, these results do not necessarily confirm that the protocol solves the general case of this problem. Modeling challenges of the protocol and the system are addressed. A number of abstractions are utilized in order to reduce the state space.

A Fault Oblivious Extreme-Scale Execution Environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKie, Jim

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massivemore » data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Prescribed-performance fault-tolerant control for feedback linearisable systems with an aircraft application

NASA Astrophysics Data System (ADS)

Gao, Gang; Wang, Jinzhi; Wang, Xianghua

2017-05-01

This paper investigates fault-tolerant control (FTC) for feedback linearisable systems (FLSs) and its application to an aircraft. To ensure desired transient and steady-state behaviours of the tracking error under actuator faults, the dynamic effect caused by the actuator failures on the error dynamics of a transformed model is analysed, and three control strategies are designed. The first FTC strategy is proposed as a robust controller, which relies on the explicit information about several parameters of the actuator faults. To eliminate the need for these parameters and the input chattering phenomenon, the robust control law is later combined with the adaptive technique to generate the adaptive FTC law. Next, the adaptive control law is further improved to achieve the prescribed performance under more severe input disturbance. Finally, the proposed control laws are applied to an air-breathing hypersonic vehicle (AHV) subject to actuator failures, which confirms the effectiveness of the proposed strategies.
Fault-tolerant simple quantum-bit commitment unbreakable by individual attacks

NASA Astrophysics Data System (ADS)

Shimizu, Kaoru; Imoto, Nobuyuki

2002-03-01

This paper proposes a simple scheme for quantum-bit commitment that is secure against individual particle attacks, where a sender is unable to use quantum logical operations to manipulate multiparticle entanglement for performing quantum collective and coherent attacks. Our scheme employs a cryptographic quantum communication channel defined in a four-dimensional Hilbert space and can be implemented by using single-photon interference. For an ideal case of zero-loss and noiseless quantum channels, our basic scheme relies only on the physical features of quantum states. Moreover, as long as the bit-flip error rates are sufficiently small (less than a few percent), we can improve our scheme and make it fault tolerant by adopting simple error-correcting codes with a short length. Compared with the well-known Brassard-Crepeau-Jozsa-Langlois 1993 (BCJL93) protocol, our scheme is mathematically far simpler, more efficient in terms of transmitted photon number, and better tolerant of bit-flip errors.
Robust adaptive fault-tolerant control for leader-follower flocking of uncertain multi-agent systems with actuator failure.

PubMed

Yazdani, Sahar; Haeri, Mohammad

2017-11-01

In this work, we study the flocking problem of multi-agent systems with uncertain dynamics subject to actuator failure and external disturbances. By considering some standard assumptions, we propose a robust adaptive fault tolerant protocol for compensating of the actuator bias fault, the partial loss of actuator effectiveness fault, the model uncertainties, and external disturbances. Under the designed protocol, velocity convergence of agents to that of virtual leader is guaranteed while the connectivity preservation of network and collision avoidance among agents are ensured as well. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration

NASA Technical Reports Server (NTRS)

Obando, Rodrigo A.; Stoughton, John W.

1995-01-01

The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

DTIC Science & Technology

1991-12-07

is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the i .nd~ividual components May be, the...The scale of parallel computing systems is rapidly approaching dimensions where 41to’- erance can no longer be ignored. No matter how relitble the...those employed in the Tandem [71 and Stratus [35] systems, is clearly impractical. * No matter how reliable the individual components are, the sheer
Design Trade-off Between Performance and Fault-Tolerance of Space Onboard Computers

NASA Astrophysics Data System (ADS)

Gorbunov, M. S.; Antonov, A. A.

2017-01-01

It is well known that there is a trade-off between performance and power consumption in onboard computers. The fault-tolerance is another important factor affecting performance, chip area and power consumption. Involving special SRAM cells and error-correcting codes is often too expensive with relation to the performance needed. We discuss the possibility of finding the optimal solutions for modern onboard computer for scientific apparatus focusing on multi-level cache memory design.
Reset Tree-Based Optical Fault Detection

PubMed Central

Lee, Dong-Geon; Choi, Dooho; Seo, Jungtaek; Kim, Howon

2013-01-01

In this paper, we present a new reset tree-based scheme to protect cryptographic hardware against optical fault injection attacks. As one of the most powerful invasive attacks on cryptographic hardware, optical fault attacks cause semiconductors to misbehave by injecting high-energy light into a decapped integrated circuit. The contaminated result from the affected chip is then used to reveal secret information, such as a key, from the cryptographic hardware. Since the advent of such attacks, various countermeasures have been proposed. Although most of these countermeasures are strong, there is still the possibility of attack. In this paper, we present a novel optical fault detection scheme that utilizes the buffers on a circuit's reset signal tree as a fault detection sensor. To evaluate our proposal, we model radiation-induced currents into circuit components and perform a SPICE simulation. The proposed scheme is expected to be used as a supplemental security tool. PMID:23698267
Fault-tolerant conversion between adjacent Reed-Muller quantum codes based on gauge fixing

NASA Astrophysics Data System (ADS)

Quan, Dong-Xiao; Zhu, Li-Li; Pei, Chang-Xing; Sanders, Barry C.

2018-03-01

We design forward and backward fault-tolerant conversion circuits, which convert between the Steane code and the 15-qubit Reed-Muller quantum code so as to provide a universal transversal gate set. In our method, only seven out of a total 14 code stabilizers need to be measured, and we further enhance the circuit by simplifying some stabilizers; thus, we need only to measure eight weight-4 stabilizers for one round of forward conversion and seven weight-4 stabilizers for one round of backward conversion. For conversion, we treat random single-qubit errors and their influence on syndromes of gauge operators, and our novel single-step process enables more efficient fault-tolerant conversion between these two codes. We make our method quite general by showing how to convert between any two adjacent Reed-Muller quantum codes \\overline{\\textsf{RM}}(1,m) and \\overline{\\textsf{RM}}≤ft(1,m+1\\right) , for which we need only measure stabilizers whose number scales linearly with m rather than exponentially with m obtained in previous work. We provide the explicit mathematical expression for the necessary stabilizers and the concomitant resources required.
Efficient preparation of large-block-code ancilla states for fault-tolerant quantum computation

NASA Astrophysics Data System (ADS)

Zheng, Yi-Cong; Lai, Ching-Yi; Brun, Todd A.

2018-03-01

Fault-tolerant quantum computation (FTQC) schemes that use multiqubit large block codes can potentially reduce the resource overhead to a great extent. A major obstacle is the requirement for a large number of clean ancilla states of different types without correlated errors inside each block. These ancilla states are usually logical stabilizer states of the data-code blocks, which are generally difficult to prepare if the code size is large. Previously, we have proposed an ancilla distillation protocol for Calderbank-Shor-Steane (CSS) codes by classical error-correcting codes. It was assumed that the quantum gates in the distillation circuit were perfect; however, in reality, noisy quantum gates may introduce correlated errors that are not treatable by the protocol. In this paper, we show that additional postselection by another classical error-detecting code can be applied to remove almost all correlated errors. Consequently, the revised protocol is fully fault tolerant and capable of preparing a large set of stabilizer states sufficient for FTQC using large block codes. At the same time, the yield rate can be boosted from O (t-2) to O (1 ) in practice for an [[n ,k ,d =2 t +1
Intelligent on-line fault tolerant control for unanticipated catastrophic failures.

PubMed

Yen, Gary G; Ho, Liang-Wei

2004-10-01

As dynamic systems become increasingly complex, experience rapidly changing environments, and encounter a greater variety of unexpected component failures, solving the control problems of such systems is a grand challenge for control engineers. Traditional control design techniques are not adequate to cope with these systems, which may suffer from unanticipated dynamic failures. In this research work, we investigate the on-line fault tolerant control problem and propose an intelligent on-line control strategy to handle the desired trajectories tracking problem for systems suffering from various unanticipated catastrophic faults. Through theoretical analysis, the sufficient condition of system stability has been derived and two different on-line control laws have been developed. The approach of the proposed intelligent control strategy is to continuously monitor the system performance and identify what the system's current state is by using a fault detection method based upon our best knowledge of the nominal system and nominal controller. Once a fault is detected, the proposed intelligent controller will adjust its control signal to compensate for the unknown system failure dynamics by using an artificial neural network as an on-line estimator to approximate the unexpected and unknown failure dynamics. The first control law is derived directly from the Lyapunov stability theory, while the second control law is derived based upon the discrete-time sliding mode control technique. Both control laws have been implemented in a variety of failure scenarios to validate the proposed intelligent control scheme. The simulation results, including a three-tank benchmark problem, comply with theoretical analysis and demonstrate a significant improvement in trajectory following performance based upon the proposed intelligent control strategy.
A design fix to supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks

NASA Astrophysics Data System (ADS)

Devaraj, Rajesh; Sarkar, Arnab; Biswas, Santosh

2015-11-01

In the article 'Supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks', Park and Cho presented a systematic way of computing a largest fault-tolerant and schedulable language that provides information on whether the scheduler (i.e., supervisor) should accept or reject a newly arrived aperiodic task. The computation of such a language is mainly dependent on the task execution model presented in their paper. However, the task execution model is unable to capture the situation when the fault of a processor occurs even before the task has arrived. Consequently, a task execution model that does not capture this fact may possibly be assigned for execution on a faulty processor. This problem has been illustrated with an appropriate example. Then, the task execution model of Park and Cho has been modified to strengthen the requirement that none of the tasks are assigned for execution on a faulty processor.
SPANNER: A Self-Repairing Spiking Neural Network Hardware Architecture.

PubMed

Liu, Junxiu; Harkin, Jim; Maguire, Liam P; McDaid, Liam J; Wade, John J

2018-04-01

Recent research has shown that a glial cell of astrocyte underpins a self-repair mechanism in the human brain, where spiking neurons provide direct and indirect feedbacks to presynaptic terminals. These feedbacks modulate the synaptic transmission probability of release (PR). When synaptic faults occur, the neuron becomes silent or near silent due to the low PR of synapses; whereby the PRs of remaining healthy synapses are then increased by the indirect feedback from the astrocyte cell. In this paper, a novel hardware architecture of Self-rePAiring spiking Neural NEtwoRk (SPANNER) is proposed, which mimics this self-repairing capability in the human brain. This paper demonstrates that the hardware can self-detect and self-repair synaptic faults without the conventional components for the fault detection and fault repairing. Experimental results show that SPANNER can maintain the system performance with fault densities of up to 40%, and more importantly SPANNER has only a 20% performance degradation when the self-repairing architecture is significantly damaged at a fault density of 80%.
Integrated Data and Control Level Fault Tolerance Techniques for Signal Processing Computer Design

DTIC Science & Technology

1990-09-01

TOLERANCE TECHNIQUES FOR SIGNAL PROCESSING COMPUTER DESIGN G. Robert Redinbo I. INTRODUCTION High-speed signal processing is an important application of...techniques and mathematical approaches will be expanded later to the situation where hardware errors and roundoff and quantization noise affect all...detect errors equal in number to the degree of g(X), the maximum permitted by the Singleton bound [13]. Real cyclic codes, primarily applicable to
A General theory of Signal Integration for Fault-Tolerant Dynamic Distributed Sensor Networks

DTIC Science & Technology

1993-10-01

related to a) the architecture and fault- tolerance of the distributed sensor network, b) the proper synchronisation of sensor signals, c) the...Computational complexities of the problem of distributed detection. 5) Issues related to recording of events and synchronization in distributed sensor...Intervals for Synchronization in Real Time Distributed Systems", Submitted to Electronic Encyclopedia. 3. V. G. Hegde and S. S. Iyengar "Efficient
Superconducting quantum circuits at the surface code threshold for fault tolerance.

PubMed

Barends, R; Kelly, J; Megrant, A; Veitia, A; Sank, D; Jeffrey, E; White, T C; Mutus, J; Fowler, A G; Campbell, B; Chen, Y; Chen, Z; Chiaro, B; Dunsworth, A; Neill, C; O'Malley, P; Roushan, P; Vainsencher, A; Wenner, J; Korotkov, A N; Cleland, A N; Martinis, John M

2014-04-24

A quantum computer can solve hard problems, such as prime factoring, database searching and quantum simulation, at the cost of needing to protect fragile quantum states from error. Quantum error correction provides this protection by distributing a logical state among many physical quantum bits (qubits) by means of quantum entanglement. Superconductivity is a useful phenomenon in this regard, because it allows the construction of large quantum circuits and is compatible with microfabrication. For superconducting qubits, the surface code approach to quantum computing is a natural choice for error correction, because it uses only nearest-neighbour coupling and rapidly cycled entangling gates. The gate fidelity requirements are modest: the per-step fidelity threshold is only about 99 per cent. Here we demonstrate a universal set of logic gates in a superconducting multi-qubit processor, achieving an average single-qubit gate fidelity of 99.92 per cent and a two-qubit gate fidelity of up to 99.4 per cent. This places Josephson quantum computing at the fault-tolerance threshold for surface code error correction. Our quantum processor is a first step towards the surface code, using five qubits arranged in a linear array with nearest-neighbour coupling. As a further demonstration, we construct a five-qubit Greenberger-Horne-Zeilinger state using the complete circuit and full set of gates. The results demonstrate that Josephson quantum computing is a high-fidelity technology, with a clear path to scaling up to large-scale, fault-tolerant quantum circuits.
Event-triggered decentralized adaptive fault-tolerant control of uncertain interconnected nonlinear systems with actuator failures.

PubMed

Choi, Yun Ho; Yoo, Sung Jin

2018-06-01

This paper investigates the event-triggered decentralized adaptive tracking problem of a class of uncertain interconnected nonlinear systems with unexpected actuator failures. It is assumed that local control signals are transmitted to local actuators with time-varying faults whenever predefined conditions for triggering events are satisfied. Compared with the existing control-input-based event-triggering strategy for adaptive control of uncertain nonlinear systems, the aim of this paper is to propose a tracking-error-based event-triggering strategy in the decentralized adaptive fault-tolerant tracking framework. The proposed approach can relax drastic changes in control inputs caused by actuator faults in the existing triggering strategy. The stability of the proposed event-triggering control system is analyzed in the Lyapunov sense. Finally, simulation comparisons of the proposed and existing approaches are provided to show the effectiveness of the proposed theoretical result in the presence of actuator faults. Copyright © 2018 ISA. Published by Elsevier Ltd. All rights reserved.
Fault Diagnostics for Turbo-Shaft Engine Sensors Based on a Simplified On-Board Model

PubMed Central

Lu, Feng; Huang, Jinquan; Xing, Yaodong

2012-01-01

Combining a simplified on-board turbo-shaft model with sensor fault diagnostic logic, a model-based sensor fault diagnosis method is proposed. The existing fault diagnosis method for turbo-shaft engine key sensors is mainly based on a double redundancies technique, and this can't be satisfied in some occasions as lack of judgment. The simplified on-board model provides the analytical third channel against which the dual channel measurements are compared, while the hardware redundancy will increase the structure complexity and weight. The simplified turbo-shaft model contains the gas generator model and the power turbine model with loads, this is built up via dynamic parameters method. Sensor fault detection, diagnosis (FDD) logic is designed, and two types of sensor failures, such as the step faults and the drift faults, are simulated. When the discrepancy among the triplex channels exceeds a tolerance level, the fault diagnosis logic determines the cause of the difference. Through this approach, the sensor fault diagnosis system achieves the objectives of anomaly detection, sensor fault diagnosis and redundancy recovery. Finally, experiments on this method are carried out on a turbo-shaft engine, and two types of faults under different channel combinations are presented. The experimental results show that the proposed method for sensor fault diagnostics is efficient. PMID:23112645
Fault diagnostics for turbo-shaft engine sensors based on a simplified on-board model.

PubMed

Lu, Feng; Huang, Jinquan; Xing, Yaodong

2012-01-01

Combining a simplified on-board turbo-shaft model with sensor fault diagnostic logic, a model-based sensor fault diagnosis method is proposed. The existing fault diagnosis method for turbo-shaft engine key sensors is mainly based on a double redundancies technique, and this can't be satisfied in some occasions as lack of judgment. The simplified on-board model provides the analytical third channel against which the dual channel measurements are compared, while the hardware redundancy will increase the structure complexity and weight. The simplified turbo-shaft model contains the gas generator model and the power turbine model with loads, this is built up via dynamic parameters method. Sensor fault detection, diagnosis (FDD) logic is designed, and two types of sensor failures, such as the step faults and the drift faults, are simulated. When the discrepancy among the triplex channels exceeds a tolerance level, the fault diagnosis logic determines the cause of the difference. Through this approach, the sensor fault diagnosis system achieves the objectives of anomaly detection, sensor fault diagnosis and redundancy recovery. Finally, experiments on this method are carried out on a turbo-shaft engine, and two types of faults under different channel combinations are presented. The experimental results show that the proposed method for sensor fault diagnostics is efficient.

Formal specification and mechanical verification of SIFT - A fault-tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1982-01-01

The paper describes the methodology being employed to demonstrate rigorously that the SIFT (software-implemented fault-tolerant) computer meets its requirements. The methodology uses a hierarchy of design specifications, expressed in the mathematical domain of multisorted first-order predicate calculus. The most abstract of these, from which almost all details of mechanization have been removed, represents the requirements on the system for reliability and intended functionality. Successive specifications in the hierarchy add design and implementation detail until the PASCAL programs implementing the SIFT executive are reached. A formal proof that a SIFT system in a 'safe' state operates correctly despite the presence of arbitrary faults has been completed all the way from the most abstract specifications to the PASCAL program.
Rigorously modeling self-stabilizing fault-tolerant circuits: An ultra-robust clocking scheme for systems-on-chip.

PubMed

Dolev, Danny; Függer, Matthias; Posch, Markus; Schmid, Ulrich; Steininger, Andreas; Lenzen, Christoph

2014-06-01

We present the first implementation of a distributed clock generation scheme for Systems-on-Chip that recovers from an unbounded number of arbitrary transient faults despite a large number of arbitrary permanent faults. We devise self-stabilizing hardware building blocks and a hybrid synchronous/asynchronous state machine enabling metastability-free transitions of the algorithm's states. We provide a comprehensive modeling approach that permits to prove, given correctness of the constructed low-level building blocks, the high-level properties of the synchronization algorithm (which have been established in a more abstract model). We believe this approach to be of interest in its own right, since this is the first technique permitting to mathematically verify, at manageable complexity, high-level properties of a fault-prone system in terms of its very basic components. We evaluate a prototype implementation, which has been designed in VHDL, using the Petrify tool in conjunction with some extensions, and synthesized for an Altera Cyclone FPGA.
Rigorously modeling self-stabilizing fault-tolerant circuits: An ultra-robust clocking scheme for systems-on-chip☆

PubMed Central

Dolev, Danny; Függer, Matthias; Posch, Markus; Schmid, Ulrich; Steininger, Andreas; Lenzen, Christoph

2014-01-01

We present the first implementation of a distributed clock generation scheme for Systems-on-Chip that recovers from an unbounded number of arbitrary transient faults despite a large number of arbitrary permanent faults. We devise self-stabilizing hardware building blocks and a hybrid synchronous/asynchronous state machine enabling metastability-free transitions of the algorithm's states. We provide a comprehensive modeling approach that permits to prove, given correctness of the constructed low-level building blocks, the high-level properties of the synchronization algorithm (which have been established in a more abstract model). We believe this approach to be of interest in its own right, since this is the first technique permitting to mathematically verify, at manageable complexity, high-level properties of a fault-prone system in terms of its very basic components. We evaluate a prototype implementation, which has been designed in VHDL, using the Petrify tool in conjunction with some extensions, and synthesized for an Altera Cyclone FPGA. PMID:26516290
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

NASA Technical Reports Server (NTRS)

Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

1992-01-01

Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation.
Integrated multiple-model adaptive fault identification and reconfigurable fault-tolerant control for Lead-Wing close formation systems

NASA Astrophysics Data System (ADS)

Liu, Chun; Jiang, Bin; Zhang, Ke

2018-03-01

This paper investigates the attitude and position tracking control problem for Lead-Wing close formation systems in the presence of loss of effectiveness and lock-in-place or hardover failure. In close formation flight, Wing unmanned aerial vehicle movements are influenced by vortex effects of the neighbouring Lead unmanned aerial vehicle. This situation allows modelling of aerodynamic coupling vortex-effects and linearisation based on optimal close formation geometry. Linearised Lead-Wing close formation model is transformed into nominal robust H-infinity models with respect to Mach hold, Heading hold, and Altitude hold autopilots; static feedback H-infinity controller is designed to guarantee effective tracking of attitude and position while manoeuvring Lead unmanned aerial vehicle. Based on H-infinity control design, an integrated multiple-model adaptive fault identification and reconfigurable fault-tolerant control scheme is developed to guarantee asymptotic stability of close-loop systems, error signal boundedness, and attitude and position tracking properties. Simulation results for Lead-Wing close formation systems validate the efficiency of the proposed integrated multiple-model adaptive control algorithm.
A fault-tolerant addressable spin qubit in a natural silicon quantum dot

PubMed Central

Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R.; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

2016-01-01

Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot–based qubits. This result can inspire contributions to quantum computing from industrial communities. PMID:27536725
A fault-tolerant addressable spin qubit in a natural silicon quantum dot.

PubMed

Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

2016-08-01

Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot-based qubits. This result can inspire contributions to quantum computing from industrial communities.
A framework for software fault tolerance in real-time systems

NASA Technical Reports Server (NTRS)

Anderson, T.; Knight, J. C.

1983-01-01

A classification scheme for errors and a technique for the provision of software fault tolerance in cyclic real-time systems is presented. The technique requires that the process structure of a system be represented by a synchronization graph which is used by an executive as a specification of the relative times at which they will communicate during execution. Communication between concurrent processes is severely limited and may only take place between processes engaged in an exchange. A history of error occurrences is maintained by an error handler. When an error is detected, the error handler classifies it using the error history information and then initiates appropriate recovery action.
Fault-tolerant Remote Quantum Entanglement Establishment for Secure Quantum Communications

NASA Astrophysics Data System (ADS)

Tsai, Chia-Wei; Lin, Jason

2016-07-01

This work presents a strategy for constructing long-distance quantum communications among a number of remote users through collective-noise channel. With the assistance of semi-honest quantum certificate authorities (QCAs), the remote users can share a secret key through fault-tolerant entanglement swapping. The proposed protocol is feasible for large-scale distributed quantum networks with numerous users. Each pair of communicating parties only needs to establish the quantum channels and the classical authenticated channels with his/her local QCA. Thus, it enables any user to communicate freely without point-to-point pre-establishing any communication channels, which is efficient and feasible for practical environments.
Analysis and design of algorithm-based fault-tolerant systems

NASA Technical Reports Server (NTRS)

Nair, V. S. Sukumaran

1990-01-01

An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems.
Runtime Speculative Software-Only Fault Tolerance

DTIC Science & Technology

2012-06-01

reliability of RSFT, a in-depth analysis on its window of vulnerability is also discussed and measured via simulated fault injection. The performance...propagation of faults through the entire program. For optimal performance, these techniques have to use herotic alias analysis to find the minimum set of...affect program output. No program source code or alias analysis is needed to analyze the fault propagation ahead of time. 2.3 Limitations of Existing
Hierarchical Simulation to Assess Hardware and Software Dependability

NASA Technical Reports Server (NTRS)

Ries, Gregory Lawrence

1997-01-01

This thesis presents a method for conducting hierarchical simulations to assess system hardware and software dependability. The method is intended to model embedded microprocessor systems. A key contribution of the thesis is the idea of using fault dictionaries to propagate fault effects upward from the level of abstraction where a fault model is assumed to the system level where the ultimate impact of the fault is observed. A second important contribution is the analysis of the software behavior under faults as well as the hardware behavior. The simulation method is demonstrated and validated in four case studies analyzing Myrinet, a commercial, high-speed networking system. One key result from the case studies shows that the simulation method predicts the same fault impact 87.5% of the time as is obtained by similar fault injections into a real Myrinet system. Reasons for the remaining discrepancy are examined in the thesis. A second key result shows the reduction in the number of simulations needed due to the fault dictionary method. In one case study, 500 faults were injected at the chip level, but only 255 propagated to the system level. Of these 255 faults, 110 shared identical fault dictionary entries at the system level and so did not need to be resimulated. The necessary number of system-level simulations was therefore reduced from 500 to 145. Finally, the case studies show how the simulation method can be used to improve the dependability of the target system. The simulation analysis was used to add recovery to the target software for the most common fault propagation mechanisms that would cause the software to hang. After the modification, the number of hangs was reduced by 60% for fault injections into the real system.
Integral Sensor Fault Detection and Isolation for Railway Traction Drive.

PubMed

Garramiola, Fernando; Del Olmo, Jon; Poza, Javier; Madina, Patxi; Almandoz, Gaizka

2018-05-13

Due to the increasing importance of reliability and availability of electric traction drives in Railway applications, early detection of faults has become an important key for Railway traction drive manufacturers. Sensor faults are important sources of failures. Among the different fault diagnosis approaches, in this article an integral diagnosis strategy for sensors in traction drives is presented. Such strategy is composed of an observer-based approach for direct current (DC)-link voltage and catenary current sensors, a frequency analysis approach for motor current phase sensors and a hardware redundancy solution for speed sensors. None of them requires any hardware change requirement in the actual traction drive. All the fault detection and isolation approaches have been validated in a Hardware-in-the-loop platform comprising a Real Time Simulator and a commercial Traction Control Unit for a tram. In comparison to safety-critical systems in Aerospace applications, Railway applications do not need instantaneous detection, and the diagnosis is validated in a short time period for reliable decision. Combining the different approaches and existing hardware redundancy, an integral fault diagnosis solution is provided, to detect and isolate faults in all the sensors installed in the traction drive.
Integral Sensor Fault Detection and Isolation for Railway Traction Drive

PubMed Central

del Olmo, Jon; Poza, Javier; Madina, Patxi; Almandoz, Gaizka

2018-01-01

Due to the increasing importance of reliability and availability of electric traction drives in Railway applications, early detection of faults has become an important key for Railway traction drive manufacturers. Sensor faults are important sources of failures. Among the different fault diagnosis approaches, in this article an integral diagnosis strategy for sensors in traction drives is presented. Such strategy is composed of an observer-based approach for direct current (DC)-link voltage and catenary current sensors, a frequency analysis approach for motor current phase sensors and a hardware redundancy solution for speed sensors. None of them requires any hardware change requirement in the actual traction drive. All the fault detection and isolation approaches have been validated in a Hardware-in-the-loop platform comprising a Real Time Simulator and a commercial Traction Control Unit for a tram. In comparison to safety-critical systems in Aerospace applications, Railway applications do not need instantaneous detection, and the diagnosis is validated in a short time period for reliable decision. Combining the different approaches and existing hardware redundancy, an integral fault diagnosis solution is provided, to detect and isolate faults in all the sensors installed in the traction drive. PMID:29757251
Inexact hardware for modelling weather & climate

NASA Astrophysics Data System (ADS)

Düben, Peter D.; McNamara, Hugh; Palmer, Tim

2014-05-01

The use of stochastic processing hardware and low precision arithmetic in atmospheric models is investigated. Stochastic processors allow hardware-induced faults in calculations, sacrificing exact calculations in exchange for improvements in performance and potentially accuracy and a reduction in power consumption. A similar trade-off is achieved using low precision arithmetic, with improvements in computation and communication speed and savings in storage and memory requirements. As high-performance computing becomes more massively parallel and power intensive, these two approaches may be important stepping stones in the pursuit of global cloud resolving atmospheric modelling. The impact of both, hardware induced faults and low precision arithmetic is tested in the dynamical core of a global atmosphere model. Our simulations show that both approaches to inexact calculations do not substantially affect the quality of the model simulations, provided they are restricted to act only on smaller scales. This suggests that inexact calculations at the small scale could reduce computation and power costs without adversely affecting the quality of the simulations.
A second generation experiment in fault-tolerant software

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

Information was collected on the efficacy of fault-tolerant software by conducting two large-scale controlled experiments. In the first, an empirical study of multi-version software (MVS) was conducted. The second experiment is an empirical evaluation of self testing as a method of error detection (STED). The purpose ot the MVS experiment was to obtain empirical measurement of the performance of multi-version systems. Twenty versions of a program were prepared at four different sites under reasonably realistic development conditions from the same specifications. The purpose of the STED experiment was to obtain empirical measurements of the performance of assertions in error detection. Eight versions of a program were modified to include assertions at two different sites under controlled conditions. The overall structure of the testing environment for the MVS experiment and its status are described. Work to date in the STED experiment is also presented.
Modular Adder Designs Using Optimal Reversible and Fault Tolerant Gates in Field-Coupled QCA Nanocomputing

NASA Astrophysics Data System (ADS)

Bilal, Bisma; Ahmed, Suhaib; Kakkar, Vipan

2018-02-01

The challenges which the CMOS technology is facing toward the end of the technology roadmap calls for an investigation of various logical and technological solutions to CMOS at the nano scale. Two such paradigms which are considered in this paper are the reversible logic and the quantum-dot cellular automata (QCA) nanotechnology. Firstly, a new 3 × 3 reversible and universal gate, RG-QCA, is proposed and implemented in QCA technology using conventional 3-input majority voter based logic. Further the gate is optimized by using explicit interaction of cells and this optimized gate is then used to design an optimized modular full adder in QCA. Another configuration of RG-QCA gate, CRG-QCA, is then proposed which is a 4 × 4 gate and includes the fault tolerant characteristics and parity preserving nature. The proposed CRG-QCA gate is then tested to design a fault tolerant full adder circuit. Extensive comparisons of gate and adder circuits are drawn with the existing literature and it is envisaged that our proposed designs perform better and are cost efficient in QCA technology.
Software-Implemented Fault Tolerance in Communications Systems

NASA Technical Reports Server (NTRS)

Gantenbein, Rex E.

1994-01-01

Software-implemented fault tolerance (SIFT) is used in many computer-based command, control, and communications (C(3)) systems to provide the nearly continuous availability that they require. In the communications subsystem of Space Station Alpha, SIFT algorithms are used to detect and recover from failures in the data and command link between the Station and its ground support. The paper presents a review of these algorithms and discusses how such techniques can be applied to similar systems found in applications such as manufacturing control, military communications, and programmable devices such as pacemakers. With support from the Tracking and Communication Division of NASA's Johnson Space Center, researchers at the University of Wyoming are developing a testbed for evaluating the effectiveness of these algorithms prior to their deployment. This testbed will be capable of simulating a variety of C(3) system failures and recording the response of the Space Station SIFT algorithms to these failures. The design of this testbed and the applicability of the approach in other environments is described.
High Speed, High Temperature, Fault Tolerant Operation of a Combination Magnetic-Hydrostatic Bearing Rotor Support System for Turbomachinery

NASA Technical Reports Server (NTRS)

Jansen, Mark; Montague, Gerald; Provenza, Andrew; Palazzolo, Alan

2004-01-01

Closed loop operation of a single, high temperature magnetic radial bearing to 30,000 RPM (2.25 million DN) and 540 C (1000 F) is discussed. Also, high temperature, fault tolerant operation for the three axis system is examined. A novel, hydrostatic backup bearing system was employed to attain high speed, high temperature, lubrication free support of the entire rotor system. The hydrostatic bearings were made of a high lubricity material and acted as journal-type backup bearings. New, high temperature displacement sensors were successfully employed to monitor shaft position throughout the entire temperature range and are described in this paper. Control of the system was accomplished through a stand alone, high speed computer controller and it was used to run both the fault-tolerant PID and active vibration control algorithms.
MAGMA: A Liquid Software Approach to Fault Tolerance, Computer Network Security, and Survivable Networking

DTIC Science & Technology

2001-12-01

and Lieutenant Namik Kaplan , Turkish Navy. Maj Tiefert’s thesis, “Modeling Control Channel Dynamics of SAAM using NS Network Simulation”, helped lay...DEC99] Deconinck , Dr. ir. Geert, Fault Tolerant Systems, ESAT / Division ACCA , Katholieke Universiteit Leuven, October 1999. [FRE00] Freed...Systems”, Addison-Wesley, 1989. [KAP99] Kaplan , Namik, “Prototyping of an Active and Lightweight Router,” March 1999 [KAT99] Kati, Effraim

Implementation of a Configurable Fault Tolerant Processor (CFTP) Using Internal Triple Modular Redundancy (TMR)

DTIC Science & Technology

2005-12-01

Upsets in SRAM FPGAs,” Military and Aerospace Applications of Programmable Logic Devices, September 2002. 8. Wakerly , John F,. “Microcomputer...change. The goal of the Configurable Fault Tolerant Processor (CFTP) Project is to explore, develop and demonstrate the applicability of using off-the...develop and demonstrate the applicability of using commercial-of-the-shelf (COTS) Field Programmable Gate Arrays (FPGA) in the design of
Experimental magic state distillation for fault-tolerant quantum computing.

PubMed

Souza, Alexandre M; Zhang, Jingfu; Ryan, Colm A; Laflamme, Raymond

2011-01-25

Any physical quantum device for quantum information processing (QIP) is subject to errors in implementation. In order to be reliable and efficient, quantum computers will need error-correcting or error-avoiding methods. Fault-tolerance achieved through quantum error correction will be an integral part of quantum computers. Of the many methods that have been discovered to implement it, a highly successful approach has been to use transversal gates and specific initial states. A critical element for its implementation is the availability of high-fidelity initial states, such as |0〉 and the 'magic state'. Here, we report an experiment, performed in a nuclear magnetic resonance (NMR) quantum processor, showing sufficient quantum control to improve the fidelity of imperfect initial magic states by distilling five of them into one with higher fidelity.
Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Design of a fault tolerant airborne digital computer. Volume 2: Computational requirements and technology

NASA Technical Reports Server (NTRS)

Ratner, R. S.; Shapiro, E. B.; Zeidler, H. M.; Wahlstrom, S. E.; Clark, C. B.; Goldberg, J.

1973-01-01

This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation.
Two fault tolerant toggle-hook release

NASA Technical Reports Server (NTRS)

Graves, Thomas Joseph (Inventor); Brown, Christopher William (Inventor)

1991-01-01

A coupling device is disclosed which is mechanically two fault tolerant for release. The device comprises a fastener plate and fastener body, each of which is attachable to a different one of a pair of structures to be joined. The fastener plate and body are coupled by an elongate toggle mounted at one end in a socket on the fastener plate for universal pivotal movement thereon. The other end of the toggle is received in an opening in the fastener body and adapted for limited pivotal movement therein. The toggle is adapted to be restrained by three latch hooks arranged in symmetrical equiangular spacing about the axis of the toggle, each hook being mounted on the fastener body for pivotal movement between an unlatching non-contact position with respect to the toggle and a latching position in engagement with a latching surface of the toggle. The device includes releasable lock means for locking each latch hook in its latching position whereby the toggle couples the fastener plate to the fastener body and means for releasing the lock means to unlock each said latch hook from the latch position whereby the unlocking of at least one of the latch hooks from its latching position results in the decoupling of the fastener plate from the fastener body.
Fault tolerant testbed evaluation, phase 1

NASA Technical Reports Server (NTRS)

Caluori, V., Jr.; Newberry, T.

1993-01-01

In recent years, avionics systems development costs have become the driving factor in the development of space systems, military aircraft, and commercial aircraft. A method of reducing avionics development costs is to utilize state-of-the-art software application generator (autocode) tools and methods. The recent maturity of application generator technology has the potential to dramatically reduce development costs by eliminating software development steps that have historically introduced errors and the need for re-work. Application generator tools have been demonstrated to be an effective method for autocoding non-redundant, relatively low-rate input/output (I/O) applications on the Space Station Freedom (SSF) program; however, they have not been demonstrated for fault tolerant, high-rate I/O, flight critical environments. This contract will evaluate the use of application generators in these harsh environments. Using Boeing's quad-redundant avionics system controller as the target system, Space Shuttle Guidance, Navigation, and Control (GN&C) software will be autocoded, tested, and evaluated in the Johnson (Space Center) Avionics Engineering Laboratory (JAEL). The response of the autocoded system will be shown to match the response of the existing Shuttle General Purpose Computers (GPC's), thereby demonstrating the viability of using autocode techniques in the development of future avionics systems.
Fault Detection and Correction for the Solar Dynamics Observatory Attitude Control System

NASA Technical Reports Server (NTRS)

Starin, Scott R.; Vess, Melissa F.; Kenney, Thomas M.; Maldonado, Manuel D.; Morgenstern, Wendy M.

2007-01-01

The Solar Dynamics Observatory is an Explorer-class mission that will launch in early 2009. The spacecraft will operate in a geosynchronous orbit, sending data 24 hours a day to a devoted ground station in White Sands, New Mexico. It will carry a suite of instruments designed to observe the Sun in multiple wavelengths at unprecedented resolution. The Atmospheric Imaging Assembly includes four telescopes with focal plane CCDs that can image the full solar disk in four different visible wavelengths. The Extreme-ultraviolet Variability Experiment will collect time-correlated data on the activity of the Sun's corona. The Helioseismic and Magnetic Imager will enable study of pressure waves moving through the body of the Sun. The attitude control system on Solar Dynamics Observatory is responsible for four main phases of activity. The physical safety of the spacecraft after separation must be guaranteed. Fine attitude determination and control must be sufficient for instrument calibration maneuvers. The mission science mode requires 2-arcsecond control according to error signals provided by guide telescopes on the Atmospheric Imaging Assembly, one of the three instruments to be carried. Lastly, accurate execution of linear and angular momentum changes to the spacecraft must be provided for momentum management and orbit maintenance. In thsp aper, single-fault tolerant fault detection and correction of the Solar Dynamics Observatory attitude control system is described. The attitude control hardware suite for the mission is catalogued, with special attention to redundancy at the hardware level. Four reaction wheels are used where any three are satisfactory. Four pairs of redundant thrusters are employed for orbit change maneuvers and momentum management. Three two-axis gyroscopes provide full redundancy for rate sensing. A digital Sun sensor and two autonomous star trackers provide two-out-of-three redundancy for fine attitude determination. The use of software to maximize
Robust fault-tolerant tracking control design for spacecraft under control input saturation.

PubMed

Bustan, Danyal; Pariz, Naser; Sani, Seyyed Kamal Hosseini

2014-07-01

In this paper, a continuous globally stable tracking control algorithm is proposed for a spacecraft in the presence of unknown actuator failure, control input saturation, uncertainty in inertial matrix and external disturbances. The design method is based on variable structure control and has the following properties: (1) fast and accurate response in the presence of bounded disturbances; (2) robust to the partial loss of actuator effectiveness; (3) explicit consideration of control input saturation; and (4) robust to uncertainty in inertial matrix. In contrast to traditional fault-tolerant control methods, the proposed controller does not require knowledge of the actuator faults and is implemented without explicit fault detection and isolation processes. In the proposed controller a single parameter is adjusted dynamically in such a way that it is possible to prove that both attitude and angular velocity errors will tend to zero asymptotically. The stability proof is based on a Lyapunov analysis and the properties of the singularity free quaternion representation of spacecraft dynamics. Results of numerical simulations state that the proposed controller is successful in achieving high attitude performance in the presence of external disturbances, actuator failures, and control input saturation. Copyright © 2014 ISA. Published by Elsevier Ltd. All rights reserved.
A Fault Tolerance Mechanism for On-Road Sensor Networks

PubMed Central

Feng, Lei; Guo, Shaoyong; Sun, Jialu; Yu, Peng; Li, Wenjing

2016-01-01

On-Road Sensor Networks (ORSNs) play an important role in capturing traffic flow data for predicting short-term traffic patterns, driving assistance and self-driving vehicles. However, this kind of network is prone to large-scale communication failure if a few sensors physically fail. In this paper, to ensure that the network works normally, an effective fault-tolerance mechanism for ORSNs which mainly consists of backup on-road sensor deployment, redundant cluster head deployment and an adaptive failure detection and recovery method is proposed. Firstly, based on the N − x principle and the sensors’ failure rate, this paper formulates the backup sensor deployment problem in the form of a two-objective optimization, which explains the trade-off between the cost and fault resumption. In consideration of improving the network resilience further, this paper introduces a redundant cluster head deployment model according to the coverage constraint. Then a common solving method combining integer-continuing and sequential quadratic programming is explored to determine the optimal location of these two deployment problems. Moreover, an Adaptive Detection and Resume (ADR) protocol is deigned to recover the system communication through route and cluster adjustment if there is a backup on-road sensor mismatch. The final experiments show that our proposed mechanism can achieve an average 90% recovery rate and reduce the average number of failed sensors at most by 35.7%. PMID:27918483
Tutorial: Advanced fault tree applications using HARP

NASA Technical Reports Server (NTRS)

Dugan, Joanne Bechta; Bavuso, Salvatore J.; Boyd, Mark A.

1993-01-01

Reliability analysis of fault tolerant computer systems for critical applications is complicated by several factors. These modeling difficulties are discussed and dynamic fault tree modeling techniques for handling them are described and demonstrated. Several advanced fault tolerant computer systems are described, and fault tree models for their analysis are presented. HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that is capable of solving the fault tree models presented.
Fault-tolerant three-level inverter

DOEpatents

Edwards, John; Xu, Longya; Bhargava, Brij B.

2006-12-05

A method for driving a neutral point clamped three-level inverter is provided. In one exemplary embodiment, DC current is received at a neutral point-clamped three-level inverter. The inverter has a plurality of nodes including first, second and third output nodes. The inverter also has a plurality of switches. Faults are checked for in the inverter and predetermined switches are automatically activated responsive to a detected fault such that three-phase electrical power is provided at the output nodes.
Implementation of a Fault Tolerant Control Unit within an FPGA for Space Applications

DTIC Science & Technology

2006-12-01

Conference 2002, September 2002. [20] M. Alderighi, A. Candelori, F. Casini, S. D’Angelo, M. Mancini, A. Paccagnella, S. Pastore , G.R. Sechi, “Heavy...Luigi Carro and Ricardo Reis , “Designing and Testing Fault-Tolerant Techniques for SRAM-based FPGAs,” in Proc. 1st Conference on Computer Frontiers, pp...susceptibility,” in IEEE Proc. 12th IEEE Intl. Symposium on On-Line Testing, pp. 89-91, 2006. [45] Fernanda Lima, Luigi Carro and Ricardo Reis
Fault tolerance techniques to assure data integrity in high-volume PACS image archives

NASA Astrophysics Data System (ADS)

He, Yutao; Huang, Lu J.; Valentino, Daniel J.; Wingate, W. Keith; Avizienis, Algirdas

1995-05-01

Picture archiving and communication systems (PACS) perform the systematic acquisition, archiving, and presentation of large quantities of radiological image and text data. In the UCLA Radiology PACS, for example, the volume of image data archived currently exceeds 2500 gigabytes. Furthermore, the distributed heterogeneous PACS is expected to have near real-time response, be continuously available, and assure the integrity and privacy of patient data. The off-the-shelf subsystems that compose the current PACS cannot meet these expectations; therefore fault tolerance techniques had to be incorporated into the system. This paper is to report our first-step efforts towards the goal and is organized as follows: First we discuss data integrity and identify fault classes under the PACS operational environment, then we describe auditing and accounting schemes developed for error-detection and analyze operational data collected. Finally, we outline plans for future research.
Fault tolerant computing: A preamble for assuring viability of large computer systems

NASA Technical Reports Server (NTRS)

Lim, R. S.

1977-01-01

The need for fault-tolerant computing is addressed from the viewpoints of (1) why it is needed, (2) how to apply it in the current state of technology, and (3) what it means in the context of the Phoenix computer system and other related systems. To this end, the value of concurrent error detection and correction is described. User protection, program retry, and repair are among the factors considered. The technology of algebraic codes to protect memory systems and arithmetic codes to protect memory systems and arithmetic codes to protect arithmetic operations is discussed.
Adaptive-gain fast super-twisting sliding mode fault tolerant control for a reusable launch vehicle in reentry phase.

PubMed

Zhang, Yao; Tang, Shengjing; Guo, Jie

2017-11-01

In this paper, a novel adaptive-gain fast super-twisting (AGFST) sliding mode attitude control synthesis is carried out for a reusable launch vehicle subject to actuator faults and unknown disturbances. According to the fast nonsingular terminal sliding mode surface (FNTSMS) and adaptive-gain fast super-twisting algorithm, an adaptive fault tolerant control law for the attitude stabilization is derived to protect against the actuator faults and unknown uncertainties. Firstly, a second-order nonlinear control-oriented model for the RLV is established by feedback linearization method. And on the basis a fast nonsingular terminal sliding mode (FNTSM) manifold is designed, which provides fast finite-time global convergence and avoids singularity problem as well as chattering phenomenon. Based on the merits of the standard super-twisting (ST) algorithm and fast reaching law with adaption, a novel adaptive-gain fast super-twisting (AGFST) algorithm is proposed for the finite-time fault tolerant attitude control problem of the RLV without any knowledge of the bounds of uncertainties and actuator faults. The important feature of the AGFST algorithm includes non-overestimating the values of the control gains and faster convergence speed than the standard ST algorithm. A formal proof of the finite-time stability of the closed-loop system is derived using the Lyapunov function technique. An estimation of the convergence time and accurate expression of convergence region are also provided. Finally, simulations are presented to illustrate the effectiveness and superiority of the proposed control scheme. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
Fractional-order active fault-tolerant force-position controller design for the legged robots using saturated actuator with unknown bias and gain degradation

NASA Astrophysics Data System (ADS)

Farid, Yousef; Majd, Vahid Johari; Ehsani-Seresht, Abbas

2018-05-01

In this paper, a novel fault accommodation strategy is proposed for the legged robots subject to the actuator faults including actuation bias and effective gain degradation as well as the actuator saturation. First, the combined dynamics of two coupled subsystems consisting of the dynamics of the legs subsystem and the body subsystem are developed. Then, the interaction of the robot with the environment is formulated as the contact force optimization problem with equality and inequality constraints. The desired force is obtained by a dynamic model. A robust super twisting fault estimator is proposed to precisely estimate the defective torque amplitude of the faulty actuator in finite time. Defining a novel fractional sliding surface, a fractional nonsingular terminal sliding mode control law is developed. Moreover, by introducing a suitable auxiliary system and using its state vector in the designed controller, the proposed fault-tolerant control (FTC) scheme guarantees the finite-time stability of the closed-loop control system. The robustness and finite-time convergence of the proposed control law is established using the Lyapunov stability theory. Finally, numerical simulations are performed on a quadruped robot to demonstrate the stable walking of the robot with and without actuator faults, and actuator saturation constraints, and the results are compared to results with an integer order fault-tolerant controller.
Bound states for magic state distillation in fault-tolerant quantum computation.

PubMed

Campbell, Earl T; Browne, Dan E

2010-01-22

Magic state distillation is an important primitive in fault-tolerant quantum computation. The magic states are pure nonstabilizer states which can be distilled from certain mixed nonstabilizer states via Clifford group operations alone. Because of the Gottesman-Knill theorem, mixtures of Pauli eigenstates are not expected to be magic state distillable, but it has been an open question whether all mixed states outside this set may be distilled. In this Letter we show that, when resources are finitely limited, nondistillable states exist outside the stabilizer octahedron. In analogy with the bound entangled states, which arise in entanglement theory, we call such states bound states for magic state distillation.
Fault tolerance analysis and applications to microwave modules and MMIC's

NASA Astrophysics Data System (ADS)

Boggan, Garry H.

A project whose objective was to provide an overview of built-in-test (BIT) considerations applicable to microwave systems, modules, and MMICs (monolithic microwave integrated circuits) is discussed. Available analytical techniques and software for assessing system failure characteristics were researched, and the resulting investigation provides a review of two techniques which have applicability to microwave systems design. A system-level approach to fault tolerance and redundancy management is presented in its relationship to the subsystem/element design. An overview of the microwave BIT focus from the Air Force Integrated Diagnostics program is presented. The technical reports prepared by the GIMADS team were reviewed for applicability to microwave modules and components. A review of MIMIC (millimeter and microwave integrated circuit) program activities relative to BIT/BITE is given.
Power maximization of variable-speed variable-pitch wind turbines using passive adaptive neural fault tolerant control

NASA Astrophysics Data System (ADS)

Habibi, Hamed; Rahimi Nohooji, Hamed; Howard, Ian

2017-09-01

Power maximization has always been a practical consideration in wind turbines. The question of how to address optimal power capture, especially when the system dynamics are nonlinear and the actuators are subject to unknown faults, is significant. This paper studies the control methodology for variable-speed variable-pitch wind turbines including the effects of uncertain nonlinear dynamics, system fault uncertainties, and unknown external disturbances. The nonlinear model of the wind turbine is presented, and the problem of maximizing extracted energy is formulated by designing the optimal desired states. With the known system, a model-based nonlinear controller is designed; then, to handle uncertainties, the unknown nonlinearities of the wind turbine are estimated by utilizing radial basis function neural networks. The adaptive neural fault tolerant control is designed passively to be robust on model uncertainties, disturbances including wind speed and model noises, and completely unknown actuator faults including generator torque and pitch actuator torque. The Lyapunov direct method is employed to prove that the closed-loop system is uniformly bounded. Simulation studies are performed to verify the effectiveness of the proposed method.
Observer-based distributed adaptive fault-tolerant containment control of multi-agent systems with general linear dynamics.

PubMed

Ye, Dan; Chen, Mengmeng; Li, Kui

2017-11-01

In this paper, we consider the distributed containment control problem of multi-agent systems with actuator bias faults based on observer method. The objective is to drive the followers into the convex hull spanned by the dynamic leaders, where the input is unknown but bounded. By constructing an observer to estimate the states and bias faults, an effective distributed adaptive fault-tolerant controller is developed. Different from the traditional method, an auxiliary controller gain is designed to deal with the unknown inputs and bias faults together. Moreover, the coupling gain can be adjusted online through the adaptive mechanism without using the global information. Furthermore, the proposed control protocol can guarantee that all the signals of the closed-loop systems are bounded and all the followers converge to the convex hull with bounded residual errors formed by the dynamic leaders. Finally, a decoupled linearized longitudinal motion model of the F-18 aircraft is used to demonstrate the effectiveness. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.

Specification, Synthesis, and Verification of Software-based Control Protocols for Fault-Tolerant Space Systems

DTIC Science & Technology

2016-08-16

Force Research Laboratory Space Vehicles Directorate AFRL /RVSV 3550 Aberdeen Ave, SE 11. SPONSOR/MONITOR’S REPORT Kirtland AFB, NM 87117-5776 NUMBER...Ft Belvoir, VA 22060-6218 1 cy AFRL /RVIL Kirtland AFB, NM 87117-5776 2 cys Official Record Copy AFRL /RVSV/Richard S. Erwin 1 cy... AFRL -RV-PS- AFRL -RV-PS- TR-2016-0112 TR-2016-0112 SPECIFICATION, SYNTHESIS, AND VERIFICATION OF SOFTWARE-BASED CONTROL PROTOCOLS FOR FAULT-TOLERANT
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

Multiversion or N-version programming was proposed as a method of providing fault tolerance in software. The approach requires the separate, independent preparation of multiple versions of a piece of software for some application. Specific topics addressed are: failure probabilities in N-version systems, consistent comparison in N-version systems, descriptions of the faults found in the Knight and Leveson experiment, analytic models of comparison testing, characteristics of the input regions that trigger faults, fault tolerance through data diversity, and the relationship between failures caused by automatically seeded faults.
An Investigative Redesign of the ECG and EMG Signal Conditioning Circuits for Two-fault Tolerance and Circuit Improvement

NASA Technical Reports Server (NTRS)

Obrien, Edward M.

1991-01-01

An investigation was undertaken to make the elctrocardiography (ECG) and the electromyography (EMG) signal conditioning circuits two-fault tolerant and to update the circuitry. The present signal conditioning circuits provide at least one level of subject protection against electrical shock hazard but at a level of 100 micro-A (for voltages of up to 200 V). However, it is necessary to provide catastrophic fault tolerance protection for the astronauts and to provide protection at a current level of less that 100 micro-A. For this study, protection at the 10 micro-A level was sought. This is the generally accepted value below which no possibility of microshock exists. Only the possibility of macroshock exists in the case of the signal conditioners. However, this extra amount of protection is desirable. The initial part deals with current limiter circuits followed by an investigation into the signal conditioner specifications and circuit design.
Fault-tolerant continuous flow systems modelling

NASA Astrophysics Data System (ADS)

Tolbi, B.; Tebbikh, H.; Alla, H.

2017-01-01

This paper presents a structural modelling of faults with hybrid Petri nets (HPNs) for the analysis of a particular class of hybrid dynamic systems, continuous flow systems. HPNs are first used for the behavioural description of continuous flow systems without faults. Then, faults' modelling is considered using a structural method without having to rebuild the model to new. A translation method is given in hierarchical way, it gives a hybrid automata (HA) from an elementary HPN. This translation preserves the behavioural semantics (timed bisimilarity), and reflects the temporal behaviour by giving semantics for each model in terms of timed transition systems. Thus, advantages of the power modelling of HPNs and the analysis ability of HA are taken. A simple example is used to illustrate the ideas.
Design of a fault-tolerant reversible control unit in molecular quantum-dot cellular automata

NASA Astrophysics Data System (ADS)

Bahadori, Golnaz; Houshmand, Monireh; Zomorodi-Moghadam, Mariam

Quantum-dot cellular automata (QCA) is a promising emerging nanotechnology that has been attracting considerable attention due to its small feature size, ultra-low power consuming, and high clock frequency. Therefore, there have been many efforts to design computational units based on this technology. Despite these advantages of the QCA-based nanotechnologies, their implementation is susceptible to a high error rate. On the other hand, using the reversible computing leads to zero bit erasures and no energy dissipation. As the reversible computation does not lose information, the fault detection happens with a high probability. In this paper, first we propose a fault-tolerant control unit using reversible gates which improves on the previous design. The proposed design is then synthesized to the QCA technology and is simulated by the QCADesigner tool. Evaluation results indicate the performance of the proposed approach.
Low cost management of replicated data in fault-tolerant distributed systems

NASA Technical Reports Server (NTRS)

Joseph, Thomas A.; Birman, Kenneth P.

1990-01-01

Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to keep the copies of replicated data consistent introduce a delay when operations are performed. A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. How this technique performs in conjunction with a roll-back and a roll-forward failure recovery mechanism is also discussed.
Fault-tolerant Greenberger-Horne-Zeilinger paradox based on non-Abelian anyons.

PubMed

Deng, Dong-Ling; Wu, Chunfeng; Chen, Jing-Ling; Oh, C H

2010-08-06

We propose a scheme to test the Greenberger-Horne-Zeilinger paradox based on braidings of non-Abelian anyons, which are exotic quasiparticle excitations of topological states of matter. Because topological ordered states are robust against local perturbations, this scheme is in some sense "fault-tolerant" and might close the detection inefficiency loophole problem in previous experimental tests of the Greenberger-Horne-Zeilinger paradox. In turn, the construction of the Greenberger-Horne-Zeilinger paradox reveals the nonlocal property of non-Abelian anyons. Our results indicate that the non-Abelian fractional statistics is a pure quantum effect and cannot be described by local realistic theories. Finally, we present a possible experimental implementation of the scheme based on the anyonic interferometry technologies.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Making classical ground-state spin computing fault-tolerant.

PubMed

Crosson, I J; Bacon, D; Brown, K R

2010-09-01

We examine a model of classical deterministic computing in which the ground state of the classical system is a spatial history of the computation. This model is relevant to quantum dot cellular automata as well as to recent universal adiabatic quantum computing constructions. In its most primitive form, systems constructed in this model cannot compute in an error-free manner when working at nonzero temperature. However, by exploiting a mapping between the partition function for this model and probabilistic classical circuits we are able to show that it is possible to make this model effectively error-free. We achieve this by using techniques in fault-tolerant classical computing and the result is that the system can compute effectively error-free if the temperature is below a critical temperature. We further link this model to computational complexity and show that a certain problem concerning finite temperature classical spin systems is complete for the complexity class Merlin-Arthur. This provides an interesting connection between the physical behavior of certain many-body spin systems and computational complexity.
Design and implementation of a fault-tolerant and dynamic metadata database for clinical trials

NASA Astrophysics Data System (ADS)

Lee, J.; Zhou, Z.; Talini, E.; Documet, J.; Liu, B.

2007-03-01

In recent imaging-based clinical trials, quantitative image analysis (QIA) and computer-aided diagnosis (CAD) methods are increasing in productivity due to higher resolution imaging capabilities. A radiology core doing clinical trials have been analyzing more treatment methods and there is a growing quantity of metadata that need to be stored and managed. These radiology centers are also collaborating with many off-site imaging field sites and need a way to communicate metadata between one another in a secure infrastructure. Our solution is to implement a data storage grid with a fault-tolerant and dynamic metadata database design to unify metadata from different clinical trial experiments and field sites. Although metadata from images follow the DICOM standard, clinical trials also produce metadata specific to regions-of-interest and quantitative image analysis. We have implemented a data access and integration (DAI) server layer where multiple field sites can access multiple metadata databases in the data grid through a single web-based grid service. The centralization of metadata database management simplifies the task of adding new databases into the grid and also decreases the risk of configuration errors seen in peer-to-peer grids. In this paper, we address the design and implementation of a data grid metadata storage that has fault-tolerance and dynamic integration for imaging-based clinical trials.
Model Checking a Byzantine-Fault-Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2007-01-01

This report presents the mechanical verification of a simplified model of a rapid Byzantine-fault-tolerant self-stabilizing protocol for distributed clock synchronization systems. This protocol does not rely on any assumptions about the initial state of the system. This protocol tolerates bursts of transient failures, and deterministically converges within a time bound that is a linear function of the self-stabilization period. A simplified model of the protocol is verified using the Symbolic Model Verifier (SMV) [SMV]. The system under study consists of 4 nodes, where at most one of the nodes is assumed to be Byzantine faulty. The model checking effort is focused on verifying correctness of the simplified model of the protocol in the presence of a permanent Byzantine fault as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period. Although model checking results of the simplified model of the protocol confirm the theoretical predictions, these results do not necessarily confirm that the protocol solves the general case of this problem. Modeling challenges of the protocol and the system are addressed. A number of abstractions are utilized in order to reduce the state space. Also, additional innovative state space reduction techniques are introduced that can be used in future verification efforts applied to this and other protocols.
A highly reliable, high performance open avionics architecture for real time Nap-of-the-Earth operations

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Elks, Carl

1995-01-01

An Army Fault Tolerant Architecture (AFTA) has been developed to meet real-time fault tolerant processing requirements of future Army applications. AFTA is the enabling technology that will allow the Army to configure existing processors and other hardware to provide high throughput and ultrahigh reliability necessary for TF/TA/NOE flight control and other advanced Army applications. A comprehensive conceptual study of AFTA has been completed that addresses a wide range of issues including requirements, architecture, hardware, software, testability, producibility, analytical models, validation and verification, common mode faults, VHDL, and a fault tolerant data bus. A Brassboard AFTA for demonstration and validation has been fabricated, and two operating systems and a flight-critical Army application have been ported to it. Detailed performance measurements have been made of fault tolerance and operating system overheads while AFTA was executing the flight application in the presence of faults.
Combining Topological Hardware and Topological Software: Color-Code Quantum Computing with Topological Superconductor Networks

NASA Astrophysics Data System (ADS)

Litinski, Daniel; Kesselring, Markus S.; Eisert, Jens; von Oppen, Felix

2017-07-01

We present a scalable architecture for fault-tolerant topological quantum computation using networks of voltage-controlled Majorana Cooper pair boxes and topological color codes for error correction. Color codes have a set of transversal gates which coincides with the set of topologically protected gates in Majorana-based systems, namely, the Clifford gates. In this way, we establish color codes as providing a natural setting in which advantages offered by topological hardware can be combined with those arising from topological error-correcting software for full-fledged fault-tolerant quantum computing. We provide a complete description of our architecture, including the underlying physical ingredients. We start by showing that in topological superconductor networks, hexagonal cells can be employed to serve as physical qubits for universal quantum computation, and we present protocols for realizing topologically protected Clifford gates. These hexagonal-cell qubits allow for a direct implementation of open-boundary color codes with ancilla-free syndrome read-out and logical T gates via magic-state distillation. For concreteness, we describe how the necessary operations can be implemented using networks of Majorana Cooper pair boxes, and we give a feasibility estimate for error correction in this architecture. Our approach is motivated by nanowire-based networks of topological superconductors, but it could also be realized in alternative settings such as quantum-Hall-superconductor hybrids.
Optimal Fault-Tolerant Control for Discrete-Time Nonlinear Strict-Feedback Systems Based on Adaptive Critic Design.

PubMed

Wang, Zhanshan; Liu, Lei; Wu, Yanming; Zhang, Huaguang

2018-06-01

This paper investigates the problem of optimal fault-tolerant control (FTC) for a class of unknown nonlinear discrete-time systems with actuator fault in the framework of adaptive critic design (ACD). A pivotal highlight is the adaptive auxiliary signal of the actuator fault, which is designed to offset the effect of the fault. The considered systems are in strict-feedback forms and involve unknown nonlinear functions, which will result in the causal problem. To solve this problem, the original nonlinear systems are transformed into a novel system by employing the diffeomorphism theory. Besides, the action neural networks (ANNs) are utilized to approximate a predefined unknown function in the backstepping design procedure. Combined the strategic utility function and the ACD technique, a reinforcement learning algorithm is proposed to set up an optimal FTC, in which the critic neural networks (CNNs) provide an approximate structure of the cost function. In this case, it not only guarantees the stability of the systems, but also achieves the optimal control performance as well. In the end, two simulation examples are used to show the effectiveness of the proposed optimal FTC strategy.
Backstepping Design of Adaptive Neural Fault-Tolerant Control for MIMO Nonlinear Systems.

PubMed

Gao, Hui; Song, Yongduan; Wen, Changyun

In this paper, an adaptive controller is developed for a class of multi-input and multioutput nonlinear systems with neural networks (NNs) used as a modeling tool. It is shown that all the signals in the closed-loop system with the proposed adaptive neural controller are globally uniformly bounded for any external input in . In our control design, the upper bound of the NN modeling error and the gains of external disturbance are characterized by unknown upper bounds, which is more rational to establish the stability in the adaptive NN control. Filter-based modification terms are used in the update laws of unknown parameters to improve the transient performance. Finally, fault-tolerant control is developed to accommodate actuator failure. An illustrative example applying the adaptive controller to control a rigid robot arm shows the validation of the proposed controller.In this paper, an adaptive controller is developed for a class of multi-input and multioutput nonlinear systems with neural networks (NNs) used as a modeling tool. It is shown that all the signals in the closed-loop system with the proposed adaptive neural controller are globally uniformly bounded for any external input in . In our control design, the upper bound of the NN modeling error and the gains of external disturbance are characterized by unknown upper bounds, which is more rational to establish the stability in the adaptive NN control. Filter-based modification terms are used in the update laws of unknown parameters to improve the transient performance. Finally, fault-tolerant control is developed to accommodate actuator failure. An illustrative example applying the adaptive controller to control a rigid robot arm shows the validation of the proposed controller.
Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks

PubMed Central

Wu, Zhao; Xiong, Naixue; Huang, Yannong; Xu, Degang; Hu, Chunyang

2015-01-01

The services composition technology provides flexible methods for building service composition applications (SCAs) in wireless sensor networks (WSNs). The high reliability and high performance of SCAs help services composition technology promote the practical application of WSNs. The optimization methods for reliability and performance used for traditional software systems are mostly based on the instantiations of software components, which are inapplicable and inefficient in the ever-changing SCAs in WSNs. In this paper, we consider the SCAs with fault tolerance in WSNs. Based on a Universal Generating Function (UGF) we propose a reliability and performance model of SCAs in WSNs, which generalizes a redundancy optimization problem to a multi-state system. Based on this model, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is developed based on a Genetic Algorithm (GA) to find the optimal structure of SCAs with fault-tolerance in WSNs. In order to examine the feasibility of our algorithm, we have evaluated the performance. Furthermore, the interrelationships between the reliability, performance and cost are investigated. In addition, a distinct approach to determine the most suitable parameters in the suggested algorithm is proposed. PMID:26561818
Damage Tolerance of Composites

NASA Technical Reports Server (NTRS)

Hodge, Andy

2007-01-01

Fracture control requirements have been developed to address damage tolerance of composites for manned space flight hardware. The requirements provide the framework for critical and noncritical hardware assessment and testing. The need for damage threat assessments, impact damage protection plans, and nondestructive evaluation are also addressed. Hardware intended to be damage tolerant have extensive coupon, sub-element, and full-scale testing requirements in-line with the Building Block Approach concept from the MIL-HDBK-17, Department of Defense Composite Materials Handbook.
Reliability and maintainability assessment factors for reliable fault-tolerant systems

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1984-01-01

A long term goal of the NASA Langley Research Center is the development of a reliability assessment methodology of sufficient power to enable the credible comparison of the stochastic attributes of one ultrareliable system design against others. This methodology, developed over a 10 year period, is a combined analytic and simulative technique. An analytic component is the Computer Aided Reliability Estimation capability, third generation, or simply CARE III. A simulative component is the Gate Logic Software Simulator capability, or GLOSS. The numerous factors that potentially have a degrading effect on system reliability and the ways in which these factors that are peculiar to highly reliable fault tolerant systems are accounted for in credible reliability assessments. Also presented are the modeling difficulties that result from their inclusion and the ways in which CARE III and GLOSS mitigate the intractability of the heretofore unworkable mathematics.
Fault tolerant system with imperfect coverage, reboot and server vacation

NASA Astrophysics Data System (ADS)

Jain, Madhu; Meena, Rakesh Kumar

2017-06-01

This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe state from which it is cleared by the reboot and recovery action. The server is allowed to go for vacation if there is no failed unit present in the system. Markov model is developed to obtain the transient probabilities associated with the system states. Runge-Kutta method is used to evaluate the system state probabilities and queueing measures. To explore the sensitivity and cost associated with the system, numerical simulation is conducted.

A fault-tolerant avionics suite for an entry research vehicle

NASA Technical Reports Server (NTRS)

Dzwonczyk, Mark; Stone, Howard

1988-01-01

A highly-reliable avionics suite has been designed for an Entry Research Vehicle. The autonomous spacecraft would be deployed from the Space Shuttle Orbiter and perform a variety of aerodynamic and propulsive maneuvers which may be required for future space transportation system vehicles. The flight electronics consist of a central fault-tolerant processor, which is resilient to all first failures, reliably cross-strapped to redundant and distributed sets of sensors and effectors. This paper describes the preliminary design and analysis of the architecture which resulted from a fifteen month study by the Charles Stark Draper Laboratory for the NASA Langley Research Center. After a brief introduction to the design task, the architecture of the central flight computer and its interface to the vehicle are discussed. Following this, the method and results of the baseline reliability study for the avionic suite are presented.
A fault-tolerant avionics suite for an entry research vehicle

NASA Astrophysics Data System (ADS)

Dzwonczyk, Mark; Stone, Howard

A highly-reliable avionics suite has been designed for an Entry Research Vehicle. The autonomous spacecraft would be deployed from the Space Shuttle Orbiter and perform a variety of aerodynamic and propulsive maneuvers which may be required for future space transportation system vehicles. The flight electronics consist of a central fault-tolerant processor, which is resilient to all first failures, reliably cross-strapped to redundant and distributed sets of sensors and effectors. This paper describes the preliminary design and analysis of the architecture which resulted from a fifteen month study by the Charles Stark Draper Laboratory for the NASA Langley Research Center. After a brief introduction to the design task, the architecture of the central flight computer and its interface to the vehicle are discussed. Following this, the method and results of the baseline reliability study for the avionic suite are presented.
Disjointness of Stabilizer Codes and Limitations on Fault-Tolerant Logical Gates

NASA Astrophysics Data System (ADS)

Jochym-O'Connor, Tomas; Kubica, Aleksander; Yoder, Theodore J.

2018-04-01

Stabilizer codes are among the most successful quantum error-correcting codes, yet they have important limitations on their ability to fault tolerantly compute. Here, we introduce a new quantity, the disjointness of the stabilizer code, which, roughly speaking, is the number of mostly nonoverlapping representations of any given nontrivial logical Pauli operator. The notion of disjointness proves useful in limiting transversal gates on any error-detecting stabilizer code to a finite level of the Clifford hierarchy. For code families, we can similarly restrict logical operators implemented by constant-depth circuits. For instance, we show that it is impossible, with a constant-depth but possibly geometrically nonlocal circuit, to implement a logical non-Clifford gate on the standard two-dimensional surface code.
A Byzantine-Fault Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2006-01-01

Embedded distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate fault tolerant clock synchronization in order to achieve ultra-reliable assurance levels. Many efficient clock synchronization protocols do not, however, address Byzantine failures, and most protocols that do tolerate Byzantine failures do not self-stabilize. Of the Byzantine self-stabilizing clock synchronization algorithms that exist in the literature, they are based on either unjustifiably strong assumptions about initial synchrony of the nodes or on the existence of a common pulse at the nodes. The Byzantine self-stabilizing clock synchronization protocol presented here does not rely on any assumptions about the initial state of the clocks. Furthermore, there is neither a central clock nor an externally generated pulse system. The proposed protocol converges deterministically, is scalable, and self-stabilizes in a short amount of time. The convergence time is linear with respect to the self-stabilization period. Proofs of the correctness of the protocol as well as the results of formal verification efforts are reported.
Safety Verification of a Fault Tolerant Reconfigurable Autonomous Goal-Based Robotic Control System

NASA Technical Reports Server (NTRS)

Braman, Julia M. B.; Murray, Richard M; Wagner, David A.

2007-01-01

Fault tolerance and safety verification of control systems are essential for the success of autonomous robotic systems. A control architecture called Mission Data System (MDS), developed at the Jet Propulsion Laboratory, takes a goal-based control approach. In this paper, a method for converting goal network control programs into linear hybrid systems is developed. The linear hybrid system can then be verified for safety in the presence of failures using existing symbolic model checkers. An example task is simulated in MDS and successfully verified using HyTech, a symbolic model checking software for linear hybrid systems.
Simulation verification techniques study: Simulation self test hardware design and techniques report

NASA Technical Reports Server (NTRS)

1974-01-01

The final results are presented of the hardware verification task. The basic objectives of the various subtasks are reviewed along with the ground rules under which the overall task was conducted and which impacted the approach taken in deriving techniques for hardware self test. The results of the first subtask and the definition of simulation hardware are presented. The hardware definition is based primarily on a brief review of the simulator configurations anticipated for the shuttle training program. The results of the survey of current self test techniques are presented. The data sources that were considered in the search for current techniques are reviewed, and results of the survey are presented in terms of the specific types of tests that are of interest for training simulator applications. Specifically, these types of tests are readiness tests, fault isolation tests and incipient fault detection techniques. The most applicable techniques were structured into software flows that are then referenced in discussions of techniques for specific subsystems.
Research, Development and Testing of a Fault-Tolerant FPGA-Based Sequencer for CubeSat Launching Applications

DTIC Science & Technology

2013-03-01

amounts of time and effort to implement. Future testing with commercial, fault-tolerant synthesis software, under a radiation environment, will yield ...initial viewpoint of the author is to take the flash-based FPGA route. This will yield a simple, reconfigurable circuit while providing the added...structure seen in Figure 30. Each of these full adder blocks were replaced in subsequent iterations to yield proper comparison with this baseline
Plan for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.

2008-01-01

This report presents the plan for the characterization of the effects of high intensity radiated fields on a prototype implementation of a fault-tolerant data communication system. Various configurations of the communication system will be tested. The prototype system is implemented using off-the-shelf devices. The system will be tested in a closed-loop configuration with extensive real-time monitoring. This test is intended to generate data suitable for the design of avionics health management systems, as well as redundancy management mechanisms and policies for robust distributed processing architectures.
Model prototype utilization in the analysis of fault tolerant control and data processing systems

NASA Astrophysics Data System (ADS)

Kovalev, I. V.; Tsarev, R. Yu; Gruzenkin, D. V.; Prokopenko, A. V.; Knyazkov, A. N.; Laptenok, V. D.

2016-04-01

The procedure assessing the profit of control and data processing system implementation is presented in the paper. The reasonability of model prototype creation and analysis results from the implementing of the approach of fault tolerance provision through the inclusion of structural and software assessment redundancy. The developed procedure allows finding the best ratio between the development cost and the analysis of model prototype and earnings from the results of this utilization and information produced. The suggested approach has been illustrated by the model example of profit assessment and analysis of control and data processing system.
Fault Injection and Monitoring Capability for a Fault-Tolerant Distributed Computation System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Yates, Amy M.; Malekpour, Mahyar R.

2010-01-01

The Configurable Fault-Injection and Monitoring System (CFIMS) is intended for the experimental characterization of effects caused by a variety of adverse conditions on a distributed computation system running flight control applications. A product of research collaboration between NASA Langley Research Center and Old Dominion University, the CFIMS is the main research tool for generating actual fault response data with which to develop and validate analytical performance models and design methodologies for the mitigation of fault effects in distributed flight control systems. Rather than a fixed design solution, the CFIMS is a flexible system that enables the systematic exploration of the problem space and can be adapted to meet the evolving needs of the research. The CFIMS has the capabilities of system-under-test (SUT) functional stimulus generation, fault injection and state monitoring, all of which are supported by a configuration capability for setting up the system as desired for a particular experiment. This report summarizes the work accomplished so far in the development of the CFIMS concept and documents the first design realization.
Sliding mode fault tolerant control dealing with modeling uncertainties and actuator faults.

PubMed

Wang, Tao; Xie, Wenfang; Zhang, Youmin

2012-05-01

In this paper, two sliding mode control algorithms are developed for nonlinear systems with both modeling uncertainties and actuator faults. The first algorithm is developed under an assumption that the uncertainty bounds are known. Different design parameters are utilized to deal with modeling uncertainties and actuator faults, respectively. The second algorithm is an adaptive version of the first one, which is developed to accommodate uncertainties and faults without utilizing exact bounds information. The stability of the overall control systems is proved by using a Lyapunov function. The effectiveness of the developed algorithms have been verified on a nonlinear longitudinal model of Boeing 747-100/200. Copyright © 2012 ISA. Published by Elsevier Ltd. All rights reserved.
Interface Circuits for Self-Checking Microprocessors

NASA Technical Reports Server (NTRS)

Rennels, D. A.; Chandramouli, R.

1986-01-01

Fault-tolerant-microcomputer concept based on enhancing "simple" computer with redundancy and self-checking logic circuits detect hardware faults. Interface and checking logic and redundant processors confer on 16-bit microcomputer ability to check itself for hardware faults. Checking circuitry also checks itself. Concept of self-checking complementary pairs (SCCP's) employed throughout ICL unit.
Adaptive robust fault tolerant control design for a class of nonlinear uncertain MIMO systems with quantization.

PubMed

Ao, Wei; Song, Yongdong; Wen, Changyun

2017-05-01

In this paper, we investigate the adaptive control problem for a class of nonlinear uncertain MIMO systems with actuator faults and quantization effects. Under some mild conditions, an adaptive robust fault-tolerant control is developed to compensate the affects of uncertainties, actuator failures and errors caused by quantization, and a range of the parameters for these quantizers is established. Furthermore, a Lyapunov-like approach is adopted to demonstrate that the ultimately uniformly bounded output tracking error is guaranteed by the controller, and the signals of the closed-loop system are ensured to be bounded, even in the presence of at most m-q actuators stuck or outage. Finally, numerical simulations are provided to verify and illustrate the effectiveness of the proposed adaptive schemes. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
High Temperature, Permanent Magnet Biased, Fault Tolerant, Homopolar Magnetic Bearing Development

NASA Technical Reports Server (NTRS)

Palazzolo, Alan; Tucker, Randall; Kenny, Andrew; Kang, Kyung-Dae; Ghandi, Varun; Liu, Jinfang; Choi, Heeju; Provenza, Andrew

2008-01-01

This paper summarizes the development of a magnetic bearing designed to operate at 1,000 F. A novel feature of this high temperature magnetic bearing is its homopolar construction which incorporates state of the art high temperature, 1,000 F, permanent magnets. A second feature is its fault tolerance capability which provides the desired control forces with over one-half of the coils failed. The construction and design methodology of the bearing is outlined and test results are shown. The agreement between a 3D finite element, magnetic field based prediction for force is shown to be in good agreement with predictions at room and high temperature. A 5 axis test rig will be complete soon to provide a means to test the magnetic bearings at high temperature and speed.
Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation

PubMed Central

Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana; Schulze Grotthoff, Stefan

2013-01-01

The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. PMID:24748902
Design-for-Hardware-Trust Techniques, Detection Strategies and Metrics for Hardware Trojans

DTIC Science & Technology

2015-12-14

down both rising and falling transitions. For Trojan detection , one fault , slow-‐to-‐rise or slow-‐to...in Jan. 2016. Through the course of this project we developed novel hardware Trojan detection techniques based on clock sweeping. The technique takes...algorithms to detect minor changes due to Trojan and compared them with those changes made by process variations. This technique was implemented on
On the use of inexact, pruned hardware in atmospheric modelling

PubMed Central

Düben, Peter D.; Joven, Jaume; Lingamneni, Avinash; McNamara, Hugh; De Micheli, Giovanni; Palem, Krishna V.; Palmer, T. N.

2014-01-01

Inexact hardware design, which advocates trading the accuracy of computations in exchange for significant savings in area, power and/or performance of computing hardware, has received increasing prominence in several error-tolerant application domains, particularly those involving perceptual or statistical end-users. In this paper, we evaluate inexact hardware for its applicability in weather and climate modelling. We expand previous studies on inexact techniques, in particular probabilistic pruning, to floating point arithmetic units and derive several simulated set-ups of pruned hardware with reasonable levels of error for applications in atmospheric modelling. The set-up is tested on the Lorenz ‘96 model, a toy model for atmospheric dynamics, using software emulation for the proposed hardware. The results show that large parts of the computation tolerate the use of pruned hardware blocks without major changes in the quality of short- and long-time diagnostics, such as forecast errors and probability density functions. This could open the door to significant savings in computational cost and to higher resolution simulations with weather and climate models. PMID:24842031
RAID Unbound: Storage Fault Tolerance in a Distributed Environment

NASA Technical Reports Server (NTRS)

Ritchie, Brian

1996-01-01

Mirroring, data replication, backup, and more recently, redundant arrays of independent disks (RAID) are all technologies used to protect and ensure access to critical company data. A new set of problems has arisen as data becomes more and more geographically distributed. Each of the technologies listed above provides important benefits; but each has failed to adapt fully to the realities of distributed computing. The key to data high availability and protection is to take the technologies' strengths and 'virtualize' them across a distributed network. RAID and mirroring offer high data availability, which data replication and backup provide strong data protection. If we take these concepts at a very granular level (defining user, record, block, file, or directory types) and them liberate them from the physical subsystems with which they have traditionally been associated, we have the opportunity to create a highly scalable network wide storage fault tolerance. The network becomes the virtual storage space in which the traditional concepts of data high availability and protection are implemented without their corresponding physical constraints.
Fault-tolerant quantum blind signature protocols against collective noise

NASA Astrophysics Data System (ADS)

Zhang, Ming-Hui; Li, Hui-Fang

2016-10-01

This work proposes two fault-tolerant quantum blind signature protocols based on the entanglement swapping of logical Bell states, which are robust against two kinds of collective noises: the collective-dephasing noise and the collective-rotation noise, respectively. Both of the quantum blind signature protocols are constructed from four-qubit decoherence-free (DF) states, i.e., logical Bell qubits. The initial message is encoded on the logical Bell qubits with logical unitary operations, which will not destroy the anti-noise trait of the logical Bell qubits. Based on the fundamental property of quantum entanglement swapping, the receiver simply performs two Bell-state measurements (rather than four-qubit joint measurements) on the logical Bell qubits to verify the signature, which makes the protocols more convenient in a practical application. Different from the existing quantum signature protocols, our protocols can offer the high fidelity of quantum communication with the employment of logical qubits. Moreover, we hereinafter prove the security of the protocols against some individual eavesdropping attacks, and we show that our protocols have the characteristics of unforgeability, undeniability and blindness.
Fault tolerance in space-based digital signal processing and switching systems: Protecting up-link processing resources, demultiplexer, demodulator, and decoder

NASA Technical Reports Server (NTRS)

Redinbo, Robert

1994-01-01

Fault tolerance features in the first three major subsystems appearing in the next generation of communications satellites are described. These satellites will contain extensive but efficient high-speed processing and switching capabilities to support the low signal strengths associated with very small aperture terminals. The terminals' numerous data channels are combined through frequency division multiplexing (FDM) on the up-links and are protected individually by forward error-correcting (FEC) binary convolutional codes. The front-end processing resources, demultiplexer, demodulators, and FEC decoders extract all data channels which are then switched individually, multiplexed, and remodulated before retransmission to earth terminals through narrow beam spot antennas. Algorithm based fault tolerance (ABFT) techniques, which relate real number parity values with data flows and operations, are used to protect the data processing operations. The additional checking features utilize resources that can be substituted for normal processing elements when resource reconfiguration is required to replace a failed unit.

Network-Physics(NP) Bec DIGITAL(#)-VULNERABILITY Versus Fault-Tolerant Analog

NASA Astrophysics Data System (ADS)

Alexander, G. K.; Hathaway, M.; Schmidt, H. E.; Siegel, E.

2011-03-01

Siegel[AMS Joint Mtg.(2002)-Abs.973-60-124] digits logarithmic-(Newcomb(1881)-Weyl(1914; 1916)-Benford(1938)-"NeWBe"/"OLDbe")-law algebraic-inversion to ONLY BEQS BEC:Quanta/Bosons= digits: Synthesis reveals EMP-like SEVERE VULNERABILITY of ONLY DIGITAL-networks(VS. FAULT-TOLERANT ANALOG INvulnerability) via Barabasi "Network-Physics" relative-``statics''(VS.dynamics-[Willinger-Alderson-Doyle(Not.AMS(5/09)]-]critique); (so called)"Quantum-computing is simple-arithmetic(sans division/ factorization); algorithmic-complexities: INtractibility/ UNdecidability/ INefficiency/NONcomputability / HARDNESS(so MIScalled) "noise"-induced-phase-transitions(NITS) ACCELERATION: Cook-Levin theorem Reducibility is Renormalization-(Semi)-Group fixed-points; number-Randomness DEFINITION via WHAT? Query(VS. Goldreich[Not.AMS(02)] How? mea culpa)can ONLY be MBCS "hot-plasma" versus digit-clumping NON-random BEC; Modular-arithmetic Congruences= Signal X Noise PRODUCTS = clock-model; NON-Shor[Physica A,341,586(04)] BEC logarithmic-law inversion factorization:Watkins number-thy. U stat.-phys.); P=/=NP TRIVIAL Proof: Euclid!!! [(So Miscalled) computational-complexity J-O obviation via geometry.
Reliability and coverage analysis of non-repairable fault-tolerant memory systems

NASA Technical Reports Server (NTRS)

Cox, G. W.; Carroll, B. D.

1976-01-01

A method was developed for the construction of probabilistic state-space models for nonrepairable systems. Models were developed for several systems which achieved reliability improvement by means of error-coding, modularized sparing, massive replication and other fault-tolerant techniques. From the models developed, sets of reliability and coverage equations for the systems were developed. Comparative analyses of the systems were performed using these equation sets. In addition, the effects of varying subunit reliabilities on system reliability and coverage were described. The results of these analyses indicated that a significant gain in system reliability may be achieved by use of combinations of modularized sparing, error coding, and software error control. For sufficiently reliable system subunits, this gain may far exceed the reliability gain achieved by use of massive replication techniques, yet result in a considerable saving in system cost.
A New On-Line Diagnosis Protocol for the SPIDER Family of Byzantine Fault Tolerant Architectures

NASA Technical Reports Server (NTRS)

Geser, Alfons; Miner, Paul S.

2004-01-01

This paper presents the formal verification of a new protocol for online distributed diagnosis for the SPIDER family of architectures. An instance of the Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) architecture consists of a collection of processing elements communicating over a Reliable Optical Bus (ROBUS). The ROBUS is a specialized fault-tolerant device that guarantees Interactive Consistency, Distributed Diagnosis (Group Membership), and Synchronization in the presence of a bounded number of physical faults. Formal verification of the original SPIDER diagnosis protocol provided a detailed understanding that led to the discovery of a significantly more efficient protocol. The original protocol was adapted from the formally verified protocol used in the MAFT architecture. It required O(N) message exchanges per defendant to correctly diagnose failures in a system with N nodes. The new protocol achieves the same diagnostic fidelity, but only requires O(1) exchanges per defendant. This paper presents this new diagnosis protocol and a formal proof of its correctness using PVS.
Integrated Hardware and Software for No-Loss Computing

NASA Technical Reports Server (NTRS)

James, Mark

2007-01-01

When an algorithm is distributed across multiple threads executing on many distinct processors, a loss of one of those threads or processors can potentially result in the total loss of all the incremental results up to that point. When implementation is massively hardware distributed, then the probability of a hardware failure during the course of a long execution is potentially high. Traditionally, this problem has been addressed by establishing checkpoints where the current state of some or part of the execution is saved. Then in the event of a failure, this state information can be used to recompute that point in the execution and resume the computation from that point. A serious problem arises when one distributes a problem across multiple threads and physical processors is that one increases the likelihood of the algorithm failing due to no fault of the scientist but as a result of hardware faults coupled with operating system problems. With good reason, scientists expect their computing tools to serve them and not the other way around. What is novel here is a unique combination of hardware and software that reformulates an application into monolithic structure that can be monitored in real-time and dynamically reconfigured in the event of a failure. This unique reformulation of hardware and software will provide advanced aeronautical technologies to meet the challenges of next-generation systems in aviation, for civilian and scientific purposes, in our atmosphere and in atmospheres of other worlds. In particular, with respect to NASA s manned flight to Mars, this technology addresses the critical requirements for improving safety and increasing reliability of manned spacecraft.
Survivable algorithms and redundancy management in NASA's distributed computing systems

NASA Technical Reports Server (NTRS)

Malek, Miroslaw

1992-01-01

The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given.
Development of an interface for an ultrareliable fault-tolerant control system and an electronic servo-control unit

NASA Technical Reports Server (NTRS)

Shaver, Charles; Williamson, Michael

1986-01-01

The NASA Ames Research Center sponsors a research program for the investigation of Intelligent Flight Control Actuation systems. The use of artificial intelligence techniques in conjunction with algorithmic techniques for autonomous, decentralized fault management of flight-control actuation systems is explored under this program. The design, development, and operation of the interface for laboratory investigation of this program is documented. The interface, architecturally based on the Intel 8751 microcontroller, is an interrupt-driven system designed to receive a digital message from an ultrareliable fault-tolerant control system (UFTCS). The interface links the UFTCS to an electronic servo-control unit, which controls a set of hydraulic actuators. It was necessary to build a UFTCS emulator (also based on the Intel 8751) to provide signal sources for testing the equipment.
A tutorial on the CARE III approach to reliability modeling. [of fault tolerant avionics and control systems

NASA Technical Reports Server (NTRS)

Trivedi, K. S.; Geist, R. M.

1981-01-01

The CARE 3 reliability model for aircraft avionics and control systems is described by utilizing a number of examples which frequently use state-of-the-art mathematical modeling techniques as a basis for their exposition. Behavioral decomposition followed by aggregration were used in an attempt to deal with reliability models with a large number of states. A comprehensive set of models of the fault-handling processes in a typical fault-tolerant system was used. These models were semi-Markov in nature, thus removing the usual restrictions of exponential holding times within the coverage model. The aggregate model is a non-homogeneous Markov chain, thus allowing the times to failure to posses Weibull-like distributions. Because of the departures from traditional models, the solution method employed is that of Kolmogorov integral equations, which are evaluated numerically.
Fault trees and sequence dependencies

NASA Technical Reports Server (NTRS)

Dugan, Joanne Bechta; Boyd, Mark A.; Bavuso, Salvatore J.

1990-01-01

One of the frequently cited shortcomings of fault-tree models, their inability to model so-called sequence dependencies, is discussed. Several sources of such sequence dependencies are discussed, and new fault-tree gates to capture this behavior are defined. These complex behaviors can be included in present fault-tree models because they utilize a Markov solution. The utility of the new gates is demonstrated by presenting several models of the fault-tolerant parallel processor, which include both hot and cold spares.
Autonomous Dynamically Self-Organizing and Self-Healing Distributed Hardware Architecture - the eDNA Concept

NASA Technical Reports Server (NTRS)

Boesen, Michael Reibel; Madsen, Jan; Keymeulen, Didier

2011-01-01

This paper presents the current state of the autonomous dynamically self-organizing and self-healing electronic DNA (eDNA) hardware architecture (patent pending). In its current prototype state, the eDNA architecture is capable of responding to multiple injected faults by autonomously reconfiguring itself to accommodate the fault and keep the application running. This paper will also disclose advanced features currently available in the simulation model only. These features are future work and will soon be implemented in hardware. Finally we will describe step-by-step how an application is implemented on the eDNA architecture.
A modified NARMAX model-based self-tuner with fault tolerance for unknown nonlinear stochastic hybrid systems with an input-output direct feed-through term.

PubMed

Tsai, Jason S-H; Hsu, Wen-Teng; Lin, Long-Guei; Guo, Shu-Mei; Tann, Joseph W

2014-01-01

A modified nonlinear autoregressive moving average with exogenous inputs (NARMAX) model-based state-space self-tuner with fault tolerance is proposed in this paper for the unknown nonlinear stochastic hybrid system with a direct transmission matrix from input to output. Through the off-line observer/Kalman filter identification method, one has a good initial guess of modified NARMAX model to reduce the on-line system identification process time. Then, based on the modified NARMAX-based system identification, a corresponding adaptive digital control scheme is presented for the unknown continuous-time nonlinear system, with an input-output direct transmission term, which also has measurement and system noises and inaccessible system states. Besides, an effective state space self-turner with fault tolerance scheme is presented for the unknown multivariable stochastic system. A quantitative criterion is suggested by comparing the innovation process error estimated by the Kalman filter estimation algorithm, so that a weighting matrix resetting technique by adjusting and resetting the covariance matrices of parameter estimate obtained by the Kalman filter estimation algorithm is utilized to achieve the parameter estimation for faulty system recovery. Consequently, the proposed method can effectively cope with partially abrupt and/or gradual system faults and input failures by the fault detection. Copyright © 2013 ISA. Published by Elsevier Ltd. All rights reserved.
An experimental investigation of fault tolerant software structures in an avionics application

NASA Technical Reports Server (NTRS)

Caglayan, Alper K.; Eckhardt, Dave E., Jr.

1989-01-01

The objective of this experimental investigation is to compare the functional performance and software reliability of competing fault tolerant software structures utilizing software diversity. In this experiment, three versions of the redundancy management software for a skewed sensor array have been developed using three diverse failure detection and isolation algorithms and incorporated into various N-version, recovery block and hybrid software structures. The empirical results show that, for maximum functional performance improvement in the selected application domain, the results of diverse algorithms should be voted before being processed by multiple versions without enforced diversity. Results also suggest that when the reliability gain with an N-version structure is modest, recovery block structures are more feasible since higher reliability can be obtained using an acceptance check with a modest reliability.
Fault tolerance in an inner-outer solver: A GVR-enabled case study

DOE PAGES

Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita

2015-04-18

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-streammore » snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.« less
Direct Fault Tolerant RLV Altitude Control: A Singular Perturbation Approach

NASA Technical Reports Server (NTRS)

Zhu, J. J.; Lawrence, D. A.; Fisher, J.; Shtessel, Y. B.; Hodel, A. S.; Lu, P.; Jackson, Scott (Technical Monitor)

2002-01-01

In this paper, we present a direct fault tolerant control (DFTC) technique, where by "direct" we mean that no explicit fault identification is used. The technique will be presented for the attitude controller (autopilot) for a reusable launch vehicle (RLV), although in principle it can be applied to many other applications. Any partial or complete failure of control actuators and effectors will be inferred from saturation of one or more commanded control signals generated by the controller. The saturation causes a reduction in the effective gain, or bandwidth of the feedback loop, which can be modeled as an increase in singular perturbation in the loop. In order to maintain stability, the bandwidth of the nominal (reduced-order) system will be reduced proportionally according to the singular perturbation theory. The presented DFTC technique automatically handles momentary saturations and integrator windup caused by excessive disturbances, guidance command or dispersions under normal vehicle conditions. For multi-input, multi-output (MIMO) systems with redundant control effectors, such as the RLV attitude control system, an algorithm is presented for determining the direction of bandwidth cutback using the method of minimum-time optimal control with constrained control in order to maintain the best performance that is possible with the reduced control authority. Other bandwidth cutback logic, such as one that preserves the commanded direction of the bandwidth or favors a preferred direction when the commanded direction cannot be achieved, is also discussed. In this extended abstract, a simplistic example is proved to demonstrate the idea. In the final paper, test results on the high fidelity 6-DOF X-33 model with severe dispersions will be presented.
Reliable and Fault-Tolerant Software-Defined Network Operations Scheme for Remote 3D Printing

NASA Astrophysics Data System (ADS)

Kim, Dongkyun; Gil, Joon-Min

2015-03-01

The recent wide expansion of applicable three-dimensional (3D) printing and software-defined networking (SDN) technologies has led to a great deal of attention being focused on efficient remote control of manufacturing processes. SDN is a renowned paradigm for network softwarization, which has helped facilitate remote manufacturing in association with high network performance, since SDN is designed to control network paths and traffic flows, guaranteeing improved quality of services by obtaining network requests from end-applications on demand through the separated SDN controller or control plane. However, current SDN approaches are generally focused on the controls and automation of the networks, which indicates that there is a lack of management plane development designed for a reliable and fault-tolerant SDN environment. Therefore, in addition to the inherent advantage of SDN, this paper proposes a new software-defined network operations center (SD-NOC) architecture to strengthen the reliability and fault-tolerance of SDN in terms of network operations and management in particular. The cooperation and orchestration between SDN and SD-NOC are also introduced for the SDN failover processes based on four principal SDN breakdown scenarios derived from the failures of the controller, SDN nodes, and connected links. The abovementioned SDN troubles significantly reduce the network reachability to remote devices (e.g., 3D printers, super high-definition cameras, etc.) and the reliability of relevant control processes. Our performance consideration and analysis results show that the proposed scheme can shrink operations and management overheads of SDN, which leads to the enhancement of responsiveness and reliability of SDN for remote 3D printing and control processes.
A method for joint routing, wavelength dimensioning and fault tolerance for any set of simultaneous failures on dynamic WDM optical networks

NASA Astrophysics Data System (ADS)

Jara, Nicolás; Vallejos, Reinaldo; Rubino, Gerardo

2017-11-01

The design of optical networks decomposes into different tasks, where the engineers must basically organize the way the main system's resources are used, minimizing the design and operation costs and respecting critical performance constraints. More specifically, network operators face the challenge of solving routing and wavelength dimensioning problems while aiming to simultaneously minimize the network cost and to ensure that the network performance meets the level established in the Service Level Agreement (SLA). We call this the Routing and Wavelength Dimensioning (R&WD) problem. Another important problem to be solved is how to deal with failures of links when the network is operating. When at least one link fails, a high rate of data loss may occur. To avoid it, the network must be designed in such a manner that upon one or multiple failures, the affected connections can still communicate using alternative routes, a mechanism known as Fault Tolerance (FT). When the mechanism allows to deal with an arbitrary number of faults, we speak about Multiple Fault Tolerance (MFT). The different tasks before mentioned are usually solved separately, or in some cases by pairs, leading to solutions that are not necessarily close to optimal ones. This paper proposes a novel method to simultaneously solve all of them, that is, the Routing, the Wavelength Dimensioning, and the Multiple Fault Tolerance problems. The method allows to obtain: a) all the primary routes by which each connection normally transmits its information, b) the additional routes, called secondary routes, used to keep each user connected in cases where one or more simultaneous failures occur, and c) the number of wavelengths available at each link of the network, calculated such that the blocking probability of each connection is lower than a pre-determined threshold (which is a network design parameter), despite the occurrence of simultaneous link failures. The solution obtained by the new algorithm is
Implementing a strand of a scalable fault-tolerant quantum computing fabric.

PubMed

Chow, Jerry M; Gambetta, Jay M; Magesan, Easwar; Abraham, David W; Cross, Andrew W; Johnson, B R; Masluk, Nicholas A; Ryan, Colm A; Smolin, John A; Srinivasan, Srikanth J; Steffen, M

2014-06-24

With favourable error thresholds and requiring only nearest-neighbour interactions on a lattice, the surface code is an error-correcting code that has garnered considerable attention. At the heart of this code is the ability to perform a low-weight parity measurement of local code qubits. Here we demonstrate high-fidelity parity detection of two code qubits via measurement of a third syndrome qubit. With high-fidelity gates, we generate entanglement distributed across three superconducting qubits in a lattice where each code qubit is coupled to two bus resonators. Via high-fidelity measurement of the syndrome qubit, we deterministically entangle the code qubits in either an even or odd parity Bell state, conditioned on the syndrome qubit state. Finally, to fully characterize this parity readout, we develop a measurement tomography protocol. The lattice presented naturally extends to larger networks of qubits, outlining a path towards fault-tolerant quantum computing.
Fault-tolerance thresholds for the surface code with fabrication errors

NASA Astrophysics Data System (ADS)

Auger, James M.; Anwar, Hussain; Gimeno-Segovia, Mercedes; Stace, Thomas M.; Browne, Dan E.

2017-10-01

The construction of topological error correction codes requires the ability to fabricate a lattice of physical qubits embedded on a manifold with a nontrivial topology such that the quantum information is encoded in the global degrees of freedom (i.e., the topology) of the manifold. However, the manufacturing of large-scale topological devices will undoubtedly suffer from fabrication errors—permanent faulty components such as missing physical qubits or failed entangling gates—introducing permanent defects into the topology of the lattice and hence significantly reducing the distance of the code and the quality of the encoded logical qubits. In this work we investigate how fabrication errors affect the performance of topological codes, using the surface code as the test bed. A known approach to mitigate defective lattices involves the use of primitive swap gates in a long sequence of syndrome extraction circuits. Instead, we show that in the presence of fabrication errors the syndrome can be determined using the supercheck operator approach and the outcome of the defective gauge stabilizer generators without any additional computational overhead or use of swap gates. We report numerical fault-tolerance thresholds in the presence of both qubit fabrication and gate fabrication errors using a circuit-based noise model and the minimum-weight perfect-matching decoder. Our numerical analysis is most applicable to two-dimensional chip-based technologies, but the techniques presented here can be readily extended to other topological architectures. We find that in the presence of 8 % qubit fabrication errors, the surface code can still tolerate a computational error rate of up to 0.1 % .
The Development of Design Tools for Fault Tolerant Quantum Dot Cellular Automata Based Logic

NASA Technical Reports Server (NTRS)

Armstrong, Curtis D.; Humphreys, William M.

2003-01-01

We are developing software to explore the fault tolerance of quantum dot cellular automata gate architectures in the presence of manufacturing variations and device defects. The Topology Optimization Methodology using Applied Statistics (TOMAS) framework extends the capabilities of the A Quantum Interconnected Network Array Simulator (AQUINAS) by adding front-end and back-end software and creating an environment that integrates all of these components. The front-end tools establish all simulation parameters, configure the simulation system, automate the Monte Carlo generation of simulation files, and execute the simulation of these files. The back-end tools perform automated data parsing, statistical analysis and report generation.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, J. C.

1987-01-01

Specific topics briefly addressed include: the consistent comparison problem in N-version system; analytic models of comparison testing; fault tolerance through data diversity; and the relationship between failures caused by automatically seeded faults.
Synchronization and fault-masking in redundant real-time systems

NASA Technical Reports Server (NTRS)

Krishna, C. M.; Shin, K. G.; Butler, R. W.

1983-01-01

A real time computer may fail because of massive component failures or not responding quickly enough to satisfy real time requirements. An increase in redundancy - a conventional means of improving reliability - can improve the former but can - in some cases - degrade the latter considerably due to the overhead associated with redundancy management, namely the time delay resulting from synchronization and voting/interactive consistency techniques. The implications of synchronization and voting/interactive consistency algorithms in N-modular clusters on reliability are considered. All these studies were carried out in the context of real time applications. As a demonstrative example, we have analyzed results from experiments conducted at the NASA Airlab on the Software Implemented Fault Tolerance (SIFT) computer. This analysis has indeed indicated that in most real time applications, it is better to employ hardware synchronization instead of software synchronization and not allow reconfiguration.

Hierarchical specification of the SIFT fault tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1981-01-01

The specification and mechanical verification of the Software Implemented Fault Tolerance (SIFT) flight control system is described. The methodology employed in the verification effort is discussed, and a description of the hierarchical models of the SIFT system is given. To meet the objective of NASA for the reliability of safety critical flight control systems, the SIFT computer must achieve a reliability well beyond the levels at which reliability can be actually measured. The methodology employed to demonstrate rigorously that the SIFT computer meets as reliability requirements is described. The hierarchy of design specifications from very abstract descriptions of system function down to the actual implementation is explained. The most abstract design specifications can be used to verify that the system functions correctly and with the desired reliability since almost all details of the realization were abstracted out. A succession of lower level models refine these specifications to the level of the actual implementation, and can be used to demonstrate that the implementation has the properties claimed of the abstract design specifications.
On-board fault management for autonomous spacecraft

NASA Technical Reports Server (NTRS)

Fesq, Lorraine M.; Stephan, Amy; Doyle, Susan C.; Martin, Eric; Sellers, Suzanne

1991-01-01

The dynamic nature of the Cargo Transfer Vehicle's (CTV) mission and the high level of autonomy required mandate a complete fault management system capable of operating under uncertain conditions. Such a fault management system must take into account the current mission phase and the environment (including the target vehicle), as well as the CTV's state of health. This level of capability is beyond the scope of current on-board fault management systems. This presentation will discuss work in progress at TRW to apply artificial intelligence to the problem of on-board fault management. The goal of this work is to develop fault management systems. This presentation will discuss work in progress at TRW to apply artificial intelligence to the problem of on-board fault management. The goal of this work is to develop fault management systems that can meet the needs of spacecraft that have long-range autonomy requirements. We have implemented a model-based approach to fault detection and isolation that does not require explicit characterization of failures prior to launch. It is thus able to detect failures that were not considered in the failure and effects analysis. We have applied this technique to several different subsystems and tested our approach against both simulations and an electrical power system hardware testbed. We present findings from simulation and hardware tests which demonstrate the ability of our model-based system to detect and isolate failures, and describe our work in porting the Ada version of this system to a flight-qualified processor. We also discuss current research aimed at expanding our system to monitor the entire spacecraft.
A hierarchical approach to reliability modeling of fault-tolerant systems. M.S. Thesis

NASA Technical Reports Server (NTRS)

Gossman, W. E.

1986-01-01

A methodology for performing fault tolerant system reliability analysis is presented. The method decomposes a system into its subsystems, evaluates vent rates derived from the subsystem's conditional state probability vector and incorporates those results into a hierarchical Markov model of the system. This is done in a manner that addresses failure sequence dependence associated with the system's redundancy management strategy. The method is derived for application to a specific system definition. Results are presented that compare the hierarchical model's unreliability prediction to that of a more complicated tandard Markov model of the system. The results for the example given indicate that the hierarchical method predicts system unreliability to a desirable level of accuracy while achieving significant computational savings relative to component level Markov model of the system.
An Autonomous Self-Aware and Adaptive Fault Tolerant Routing Technique for Wireless Sensor Networks.

PubMed

Abba, Sani; Lee, Jeong-A

2015-08-18

We propose an autonomous self-aware and adaptive fault-tolerant routing technique (ASAART) for wireless sensor networks. We address the limitations of self-healing routing (SHR) and self-selective routing (SSR) techniques for routing sensor data. We also examine the integration of autonomic self-aware and adaptive fault detection and resiliency techniques for route formation and route repair to provide resilience to errors and failures. We achieved this by using a combined continuous and slotted prioritized transmission back-off delay to obtain local and global network state information, as well as multiple random functions for attaining faster routing convergence and reliable route repair despite transient and permanent node failure rates and efficient adaptation to instantaneous network topology changes. The results of simulations based on a comparison of the ASAART with the SHR and SSR protocols for five different simulated scenarios in the presence of transient and permanent node failure rates exhibit a greater resiliency to errors and failure and better routing performance in terms of the number of successfully delivered network packets, end-to-end delay, delivered MAC layer packets, packet error rate, as well as efficient energy conservation in a highly congested, faulty, and scalable sensor network.
Extreme temperature robust optical sensor designs and fault-tolerant signal processing

DOEpatents

Riza, Nabeel Agha [Oviedo, FL; Perez, Frank [Tujunga, CA

2012-01-17

Silicon Carbide (SiC) probe designs for extreme temperature and pressure sensing uses a single crystal SiC optical chip encased in a sintered SiC material probe. The SiC chip may be protected for high temperature only use or exposed for both temperature and pressure sensing. Hybrid signal processing techniques allow fault-tolerant extreme temperature sensing. Wavelength peak-to-peak (or null-to-null) collective spectrum spread measurement to detect wavelength peak/null shift measurement forms a coarse-fine temperature measurement using broadband spectrum monitoring. The SiC probe frontend acts as a stable emissivity Black-body radiator and monitoring the shift in radiation spectrum enables a pyrometer. This application combines all-SiC pyrometry with thick SiC etalon laser interferometry within a free-spectral range to form a coarse-fine temperature measurement sensor. RF notch filtering techniques improve the sensitivity of the temperature measurement where fine spectral shift or spectrum measurements are needed to deduce temperature.
Validation Methods Research for Fault-Tolerant Avionics and Control Systems Sub-Working Group Meeting. CARE 3 peer review

NASA Technical Reports Server (NTRS)

Trivedi, K. S. (Editor); Clary, J. B. (Editor)

1980-01-01

A computer aided reliability estimation procedure (CARE 3), developed to model the behavior of ultrareliable systems required by flight-critical avionics and control systems, is evaluated. The mathematical models, numerical method, and fault-tolerant architecture modeling requirements are examined, and the testing and characterization procedures are discussed. Recommendations aimed at enhancing CARE 3 are presented; in particular, the need for a better exposition of the method and the user interface is emphasized.
ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm

NASA Astrophysics Data System (ADS)

Rybczynski, Tomasz; Bonaccorsi, Enrico; Neufeld, Niko

2014-06-01

The LHCb experiment records millions of proton collisions every second, but only a fraction of them are useful for LHCb physics. In order to filter out the "bad events" a large farm of x86-servers (~2000 nodes) has been put in place. These servers boot from and run from NFS, however they use their local disk to temporarily store data, which cannot be processed in real-time ("data-deferring"). These events are subsequently processed, when there are no live-data coming in. The effective CPU power is thus greatly increased. This gain in CPU power depends critically on the availability of the local disks. For cost and power-reasons, mirroring (RAID-1) is not used, leading to a lot of operational headache with failing disks and disk-errors or server failures induced by faulty disks. To mitigate these problems and increase the reliability of the LHCb farm, while at same time keeping cost and power-consumption low, an extensive research and study of existing highly available and distributed file systems has been done. While many distributed file systems are providing reliability by "file replication", none of the evaluated ones supports erasure algorithms. A decentralised, distributed and fault-tolerant "write once read many" file system has been designed and implemented as a proof of concept providing fault tolerance without using expensive - in terms of disk space - file replication techniques and providing a unique namespace as a main goals. This paper describes the design and the implementation of the Erasure Codes File System (ECFS) and presents the specialised FUSE interface for Linux. Depending on the encoding algorithm ECFS will use a certain number of target directories as a backend to store the segments that compose the encoded data. When target directories are mounted via nfs/autofs - ECFS will act as a file-system over network/block-level raid over multiple servers.
Fault Tolerant Hardware/Software Architecture for Flight Critical Function

DTIC Science & Technology

1985-09-01

Applications Studies Programme. The results of AGARD work are reported to the member nations and the NATO Authorities through the AGARD series of...systems, and is being advocated as a defense against design deficiencies which can plague software. - -- -- z--mm-L ___ K A critical application area for...day of the lecture series concludes with part I of a paper on the ;use of the Ada programming language In flight critical applications . Ada has been
Design of Test Articles and Monitoring System for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.

2008-01-01

This report describes the design of the test articles and monitoring systems developed to characterize the response of a fault-tolerant computer communication system when stressed beyond the theoretical limits for guaranteed correct performance. A high-intensity radiated electromagnetic field (HIRF) environment was selected as the means of injecting faults, as such environments are known to have the potential to cause arbitrary and coincident common-mode fault manifestations that can overwhelm redundancy management mechanisms. The monitors generate stimuli for the systems-under-test (SUTs) and collect data in real-time on the internal state and the response at the external interfaces. A real-time health assessment capability was developed to support the automation of the test. A detailed description of the nature and structure of the collected data is included. The goal of the report is to provide insight into the design and operation of these systems, and to serve as a reference document for use in post-test analyses.
Fault-tolerant optimised tracking control for unknown discrete-time linear systems using a combined reinforcement learning and residual compensation methodology

NASA Astrophysics Data System (ADS)

Han, Ke-Zhen; Feng, Jian; Cui, Xiaohong

2017-10-01

This paper considers the fault-tolerant optimised tracking control (FTOTC) problem for unknown discrete-time linear system. A research scheme is proposed on the basis of data-based parity space identification, reinforcement learning and residual compensation techniques. The main characteristic of this research scheme lies in the parity-space-identification-based simultaneous tracking control and residual compensation. The specific technical line consists of four main contents: apply subspace aided method to design observer-based residual generator; use reinforcement Q-learning approach to solve optimised tracking control policy; rely on robust H∞ theory to achieve noise attenuation; adopt fault estimation triggered by residual generator to perform fault compensation. To clarify the design and implementation procedures, an integrated algorithm is further constructed to link up these four functional units. The detailed analysis and proof are subsequently given to explain the guaranteed FTOTC performance of the proposed conclusions. Finally, a case simulation is provided to verify its effectiveness.
Experimental fault-tolerant universal quantum gates with solid-state spins under ambient conditions

PubMed Central

Rong, Xing; Geng, Jianpei; Shi, Fazhan; Liu, Ying; Xu, Kebiao; Ma, Wenchao; Kong, Fei; Jiang, Zhen; Wu, Yang; Du, Jiangfeng

2015-01-01

Quantum computation provides great speedup over its classical counterpart for certain problems. One of the key challenges for quantum computation is to realize precise control of the quantum system in the presence of noise. Control of the spin-qubits in solids with the accuracy required by fault-tolerant quantum computation under ambient conditions remains elusive. Here, we quantitatively characterize the source of noise during quantum gate operation and demonstrate strategies to suppress the effect of these. A universal set of logic gates in a nitrogen-vacancy centre in diamond are reported with an average single-qubit gate fidelity of 0.999952 and two-qubit gate fidelity of 0.992. These high control fidelities have been achieved at room temperature in naturally abundant 13C diamond via composite pulses and an optimized control method. PMID:26602456
Application of a Resource Theory for Magic States to Fault-Tolerant Quantum Computing.

PubMed

Howard, Mark; Campbell, Earl

2017-03-03

Motivated by their necessity for most fault-tolerant quantum computation schemes, we formulate a resource theory for magic states. First, we show that robustness of magic is a well-behaved magic monotone that operationally quantifies the classical simulation overhead for a Gottesman-Knill-type scheme using ancillary magic states. Our framework subsequently finds immediate application in the task of synthesizing non-Clifford gates using magic states. When magic states are interspersed with Clifford gates, Pauli measurements, and stabilizer ancillas-the most general synthesis scenario-then the class of synthesizable unitaries is hard to characterize. Our techniques can place nontrivial lower bounds on the number of magic states required for implementing a given target unitary. Guided by these results, we have found new and optimal examples of such synthesis.
Fault-tolerant reactor protection system

DOEpatents

Gaubatz, Donald C.

1997-01-01

A reactor protection system having four divisions, with quad redundant sensors for each scram parameter providing input to four independent microprocessor-based electronic chassis. Each electronic chassis acquires the scram parameter data from its own sensor, digitizes the information, and then transmits the sensor reading to the other three electronic chassis via optical fibers. To increase system availability and reduce false scrams, the reactor protection system employs two levels of voting on a need for reactor scram. The electronic chassis perform software divisional data processing, vote 2/3 with spare based upon information from all four sensors, and send the divisional scram signals to the hardware logic panel, which performs a 2/4 division vote on whether or not to initiate a reactor scram. Each chassis makes a divisional scram decision based on data from all sensors. Each division performs independently of the others (asynchronous operation). All communications between the divisions are asynchronous. Each chassis substitutes its own spare sensor reading in the 2/3 vote if a sensor reading from one of the other chassis is faulty or missing. Therefore the presence of at least two valid sensor readings in excess of a set point is required before terminating the output to the hardware logic of a scram inhibition signal even when one of the four sensors is faulty or when one of the divisions is out of service.
Depth optimal sorting networks resistant to k passive faults

DOE Office of Scientific and Technical Information (OSTI.GOV)

Piotrow, M.

In this paper, we study the problem of constructing a sorting network that is tolerant to faults and whose running time (i.e. depth) is as small as possible. We consider the scenario of worst-case comparator faults and follow the model of passive comparator failure proposed by Yao and Yao, in which a faulty comparator outputs directly its inputs without comparison. Our main result is the first construction of an N-input, k-fault-tolerant sorting network that is of an asymptotically optimal depth {theta}(log N+k). That improves over the recent result of Leighton and Ma, whose network is of depth O(log N +more » k log log N/log k). Actually, we present a fault-tolerant correction network that can be added after any N-input sorting network to correct its output in the presence of at most k faulty comparators. Since the depth of the network is O(log N + k) and the constants hidden behind the {open_quotes}O{close_quotes} notation are not big, the construction can be of practical use. Developing the techniques necessary to show the main result, we construct a fault-tolerant network for the insertion problem. As a by-product, we get an N-input, O(log N)-depth INSERT-network that is tolerant to random faults, thereby answering a question posed by Ma in his PhD thesis. The results are based on a new notion of constant delay comparator networks, that is, networks in which each register is used (compared) only in a period of time of a constant length. Copies of such networks can be put one after another with only a constant increase in depth per copy.« less
Advanced Information Processing System - Fault detection and error handling

NASA Technical Reports Server (NTRS)

Lala, J. H.

1985-01-01

The Advanced Information Processing System (AIPS) is designed to provide a fault tolerant and damage tolerant data processing architecture for a broad range of aerospace vehicles, including tactical and transport aircraft, and manned and autonomous spacecraft. A proof-of-concept (POC) system is now in the detailed design and fabrication phase. This paper gives an overview of a preliminary fault detection and error handling philosophy in AIPS.
Data-based fault-tolerant control of high-speed trains with traction/braking notch nonlinearities and actuator failures.

PubMed

Song, Qi; Song, Yong-Duan

2011-12-01

This paper investigates the position and velocity tracking control problem of high-speed trains with multiple vehicles connected through couplers. A dynamic model reflecting nonlinear and elastic impacts between adjacent vehicles as well as traction/braking nonlinearities and actuation faults is derived. Neuroadaptive fault-tolerant control algorithms are developed to account for various factors such as input nonlinearities, actuator failures, and uncertain impacts of in-train forces in the system simultaneously. The resultant control scheme is essentially independent of system model and is primarily data-driven because with the appropriate input-output data, the proposed control algorithms are capable of automatically generating the intermediate control parameters, neuro-weights, and the compensation signals, literally producing the traction/braking force based upon input and response data only--the whole process does not require precise information on system model or system parameter, nor human intervention. The effectiveness of the proposed approach is also confirmed through numerical simulations.
Active fault tolerant control based on interval type-2 fuzzy sliding mode controller and non linear adaptive observer for 3-DOF laboratory helicopter.

PubMed

Zeghlache, Samir; Benslimane, Tarak; Bouguerra, Abderrahmen

2017-11-01

In this paper, a robust controller for a three degree of freedom (3 DOF) helicopter control is proposed in presence of actuator and sensor faults. For this purpose, Interval type-2 fuzzy logic control approach (IT2FLC) and sliding mode control (SMC) technique are used to design a controller, named active fault tolerant interval type-2 Fuzzy Sliding mode controller (AFTIT2FSMC) based on non-linear adaptive observer to estimate and detect the system faults for each subsystem of the 3-DOF helicopter. The proposed control scheme allows avoiding difficult modeling, attenuating the chattering effect of the SMC, reducing the rules number of the fuzzy controller. Exponential stability of the closed loop is guaranteed by using the Lyapunov method. The simulation results show that the AFTIT2FSMC can greatly alleviate the chattering effect, providing good tracking performance, even in presence of actuator and sensor faults. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
An Autonomous Self-Aware and Adaptive Fault Tolerant Routing Technique for Wireless Sensor Networks

PubMed Central

Abba, Sani; Lee, Jeong-A

2015-01-01

We propose an autonomous self-aware and adaptive fault-tolerant routing technique (ASAART) for wireless sensor networks. We address the limitations of self-healing routing (SHR) and self-selective routing (SSR) techniques for routing sensor data. We also examine the integration of autonomic self-aware and adaptive fault detection and resiliency techniques for route formation and route repair to provide resilience to errors and failures. We achieved this by using a combined continuous and slotted prioritized transmission back-off delay to obtain local and global network state information, as well as multiple random functions for attaining faster routing convergence and reliable route repair despite transient and permanent node failure rates and efficient adaptation to instantaneous network topology changes. The results of simulations based on a comparison of the ASAART with the SHR and SSR protocols for five different simulated scenarios in the presence of transient and permanent node failure rates exhibit a greater resiliency to errors and failure and better routing performance in terms of the number of successfully delivered network packets, end-to-end delay, delivered MAC layer packets, packet error rate, as well as efficient energy conservation in a highly congested, faulty, and scalable sensor network. PMID:26295236
Multiple Fault Isolation in Redundant Systems

NASA Technical Reports Server (NTRS)

Pattipati, Krishna R.; Patterson-Hine, Ann; Iverson, David

1997-01-01

Fault diagnosis in large-scale systems that are products of modern technology present formidable challenges to manufacturers and users. This is due to large number of failure sources in such systems and the need to quickly isolate and rectify failures with minimal down time. In addition, for fault-tolerant systems and systems with infrequent opportunity for maintenance (e.g., Hubble telescope, space station), the assumption of at most a single fault in the system is unrealistic. In this project, we have developed novel block and sequential diagnostic strategies to isolate multiple faults in the shortest possible time without making the unrealistic single fault assumption.
Multiple Fault Isolation in Redundant Systems

NASA Technical Reports Server (NTRS)

Pattipati, Krishna R.

1997-01-01

Fault diagnosis in large-scale systems that are products of modem technology present formidable challenges to manufacturers and users. This is due to large number of failure sources in such systems and the need to quickly isolate and rectify failures with minimal down time. In addition, for fault-tolerant systems and systems with infrequent opportunity for maintenance (e.g., Hubble telescope, space station), the assumption of at most a single fault in the system is unrealistic. In this project, we have developed novel block and sequential diagnostic strategies to isolate multiple faults in the shortest possible time without making the unrealistic single fault assumption.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.