Science.gov

Sample records for hardware fault tolerance

  1. Fault Tolerant Characteristics of Artificial Neural Network Electronic Hardware

    NASA Technical Reports Server (NTRS)

    Zee, Frank

    1995-01-01

    The fault tolerant characteristics of analog-VLSI artificial neural network (with 32 neurons and 532 synapses) chips are studied by exposing them to high energy electrons, high energy protons, and gamma ionizing radiations under biased and unbiased conditions. The biased chips became nonfunctional after receiving a cumulative dose of less than 20 krads, while the unbiased chips only started to show degradation with a cumulative dose of over 100 krads. As the total radiation dose increased, all the components demonstrated graceful degradation. The analog sigmoidal function of the neuron became steeper (increase in gain), current leakage from the synapses progressively shifted the sigmoidal curve, and the digital memory of the synapses and the memory addressing circuits began to gradually fail. From these radiation experiments, we can learn how to modify certain designs of the neural network electronic hardware without using radiation-hardening techniques to increase its reliability and fault tolerance.

  2. Hardware reconfiguration for fault-tolerant processor arrays

    SciTech Connect

    Chean, M.

    1989-01-01

    In large VLSI/WSI arrays, improved reliability and yield can be obtained through reconfiguration techniques. In fault tolerance design, redundancy is used to offset faults when they occur in the arrays. Since redundant components are themselves susceptible to faults, their number must be a minimum. This also implies that an efficient reconfiguration scheme is preferred, i.e., one that can use as many spare components as possible so that unnecessary waste of spares is reduced. In this thesis, hardware reconfiguration for fault-tolerant processor arrays is discussed. First, a taxonomy for reconfiguration techniques is introduced, and several schemes are surveyed and classified. This taxonomy can be used to introduce, explain, compare, study, and classify new reconfiguration schemes. Next, an extension to reconfiguration technique is presented. Two special cases of the scheme are simulated and their results compared and studied. Finally, a new approach to hardware reconfiguration, called FUSS (Full Use of Suitable Spares), is proposed for VLSI/WSI fault-tolerant processor arrays. FUSS uses an indicator vector, the surplus vector, to guide the replacement of faulty processors within an array. Analytical study of the general FUSS algorithm shows that a linear relationship between the array size and the area of interconnect is required for the reconfiguration to be 100% successful. In an instance of FUSS, called simple FUSS, reconfiguration is done by simply shifting up or down faulty processors along their corresponding columns according to the surplus vector's entries. The surplus vector is progressively updated after each column is reconfigured. The reconfiguration is successful when the surplus vector becomes the null vector. Simulations show that when the number of faulty processors is equal to that of spare processors, simple FUSS can achieve a probability of survival as high as 99%

  3. Fault Tolerant Hardware/Software Architecture for Flight Critical Function

    DTIC Science & Technology

    1985-09-01

    performance demands, are becoming complex and sophisticated. Demands for higher accuracy, improved reliability /survivabiiity, all-weather operation, and more...catastrophic). Flight critical architectures are more complex than fault-tolerant computers. In addition to airborne computers, overall reliability ... reliable data communications, and multi-computer operation using the Ada language. First Day The first paper presents a description of two recent GEC

  4. Hardware Acquisition for the Enhancement of a Fault Tolerance/Distributed Computing Laboratory.

    DTIC Science & Technology

    2014-09-26

    8217.).-..-e:j N/A N/A 4. TITLE (rn Subtitle) S. TYPE OF REPORT & PERIOD COVERED Hardware Acquisition for the Enhancement of aFMa 1 92 a 4 9 Fault Tolerance...Identify by block number) -. Fault tolerance, multiprocessor, distributed data processing, software support, testbed, computer-aided design. R0 A*TRACT r(c...m - evwao lf i ceeawuy and Identify by block number) A VAX 11/780 computer was obtained to provide a software development envi- ronment for the Fault

  5. Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks

    NASA Technical Reports Server (NTRS)

    Lee, Y. H.; Shin, K. G.

    1982-01-01

    A fault-tolerant multiprocessor with a rollback recovery mechanism is discussed. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery block is constructed by consecutive state-save operations and several state-save units in every processor and memory module. When a fault is detected, the multiprocessor reconfigures itself to replace the faulty component and then the process originally assigned to the faulty component retreats to one of the previously saved states in order to resume fault-free execution. A mathematical model is proposed to calculate both the coverage of multi-step rollback recovery and the risk of restart. A performance evaluation in terms of task execution time is also presented.

  6. Study of a unified hardware and software fault-tolerant architecture

    NASA Technical Reports Server (NTRS)

    Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart

    1989-01-01

    A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.

  7. The full-use-of-suitable-spares (FUSS) approach to hardware reconfiguration for fault-tolerant processor arrays

    SciTech Connect

    Chean, M. ); Fortes, J.A.B. . School of Electrical Engineering)

    1990-04-01

    A general approach to hardware reconfiguration is proposed for VLSI/WSI fault-tolerant processor arrays. The technique, called FUSS (full use of suitable spares), uses an indicator vector, the surplus vector, to guide the replacement of faulty processors within an array. Analytical study of the general FUSS algorithm shows that there is a linear relationship between the array size and the area of interconnect required for reconfiguration to be 100% successful.

  8. Computer hardware fault administration

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-09-14

    Computer hardware fault administration carried out in a parallel computer, where the parallel computer includes a plurality of compute nodes. The compute nodes are coupled for data communications by at least two independent data communications networks, where each data communications network includes data communications links connected to the compute nodes. Typical embodiments carry out hardware fault administration by identifying a location of a defective link in the first data communications network of the parallel computer and routing communications data around the defective link through the second data communications network of the parallel computer.

  9. Relaxed fault-tolerant hardware implementation of neural networks in the presence of multiple transient errors.

    PubMed

    Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro

    2012-08-01

    Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments.

  10. An Efficient Hardware-Software Approach to Network Fault Tolerance with InfiniBand

    SciTech Connect

    Vishnu, Abhinav; Krishnan, Manoj Kumar; Panda, Dhabaleswar K.

    2009-09-01

    In the last decade or so, clusters have observed a tremendous rise in popularity due to excellent price to performance ratio. A variety of Interconnects have been proposed during this period, with InfiniBand leading the way due to its high performance and open standard. Increasing size of the InfiniBand clusters has reduced the mean time between failures of various components of these clusters tremendously. In this paper, we specifically focus on the network component failure and propose a hybrid hardware-software approach to handling network faults. The hybrid approach leverages the user-transparent network fault detection and recovery using Automatic Path Migration (APM), and the software approach is used in the wake of APM failure. Using Global Arrays as the programming model, we implement this approach with Aggregate Remote Memory Copy Interface (ARMCI), the runtime system of Global Arrays. We evaluate our approach using various benchmarks (siosi7, pentane, h2o7 and siosi3) with NWChem, a very popular {\\em ab initio} quantum chemistry application. Using the proposed approach, the applications run to completion without restart on emulated network faults and acceptable overhead for benchmarks executing for a longer period of time.

  11. SFT: Scalable Fault Tolerance

    SciTech Connect

    Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

    2006-04-15

    In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparent and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.

  12. Fault tolerant control of spacecraft

    NASA Astrophysics Data System (ADS)

    Godard

    Autonomous multiple spacecraft formation flying space missions demand the development of reliable control systems to ensure rapid, accurate, and effective response to various attitude and formation reconfiguration commands. Keeping in mind the complexities involved in the technology development to enable spacecraft formation flying, this thesis presents the development and validation of a fault tolerant control algorithm that augments the AOCS on-board a spacecraft to ensure that these challenging formation flying missions will fly successfully. Taking inspiration from the existing theory of nonlinear control, a fault-tolerant control system for the RyePicoSat missions is designed to cope with actuator faults whilst maintaining the desirable degree of overall stability and performance. Autonomous fault tolerant adaptive control scheme for spacecraft equipped with redundant actuators and robust control of spacecraft in underactuated configuration, represent the two central themes of this thesis. The developed algorithms are validated using a hardware-in-the-loop simulation. A reaction wheel testbed is used to validate the proposed fault tolerant attitude control scheme. A spacecraft formation flying experimental testbed is used to verify the performance of the proposed robust control scheme for underactuated spacecraft configurations. The proposed underactuated formation flying concept leads to more than 60% savings in fuel consumption when compared to a fully actuated spacecraft formation configuration. We also developed a novel attitude control methodology that requires only a single thruster to stabilize three axis attitude and angular velocity components of a spacecraft. Numerical simulations and hardware-in-the-loop experimental results along with rigorous analytical stability analysis shows that the proposed methodology will greatly enhance the reliability of the spacecraft, while allowing for potentially significant overall mission cost reduction.

  13. Fault tolerant magnetic bearings

    SciTech Connect

    Maslen, E.H.; Sortore, C.K.; Gillies, G.T.; Williams, R.D.; Fedigan, S.J.; Aimone, R.J.

    1999-07-01

    A fault tolerant magnetic bearing system was developed and demonstrated on a large flexible-rotor test rig. The bearing system comprises a high speed, fault tolerant digital controller, three high capacity radial magnetic bearings, one thrust bearing, conventional variable reluctance position sensors, and an array of commercial switching amplifiers. Controller fault tolerance is achieved through a very high speed voting mechanism which implements triple modular redundancy with a powered spare CPU, thereby permitting failure of up to three CPU modules without system failure. Amplifier/cabling/coil fault tolerance is achieved by using a separate power amplifier for each bearing coil and permitting amplifier reconfiguration by the controller upon detection of faults. This allows hot replacement of failed amplifiers without any system degradation and without providing any excess amplifier kVA capacity over the nominal system requirement. Implemented on a large (2440 mm in length) flexible rotor, the system shows excellent rejection of faults including the failure of three CPUs as well as failure of two adjacent amplifiers (or cabling) controlling an entire stator quadrant.

  14. Fault tolerant linear actuator

    DOEpatents

    Tesar, Delbert

    2004-09-14

    In varying embodiments, the fault tolerant linear actuator of the present invention is a new and improved linear actuator with fault tolerance and positional control that may incorporate velocity summing, force summing, or a combination of the two. In one embodiment, the invention offers a velocity summing arrangement with a differential gear between two prime movers driving a cage, which then drives a linear spindle screw transmission. Other embodiments feature two prime movers driving separate linear spindle screw transmissions, one internal and one external, in a totally concentric and compact integrated module.

  15. Study of fault-tolerant software technology

    NASA Technical Reports Server (NTRS)

    Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.

    1984-01-01

    Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.

  16. Fault tolerant control laws

    NASA Technical Reports Server (NTRS)

    Ly, U. L.; Ho, J. K.

    1986-01-01

    A systematic procedure for the synthesis of fault tolerant control laws to actuator failure has been presented. Two design methods were used to synthesize fault tolerant controllers: the conventional LQ design method and a direct feedback controller design method SANDY. The latter method is used primarily to streamline the full-state Q feedback design into a practical implementable output feedback controller structure. To achieve robustness to control actuator failure, the redundant surfaces are properly balanced according to their control effectiveness. A simple gain schedule based on the landing gear up/down logic involving only three gains was developed to handle three design flight conditions: Mach .25 and Mach .60 at 5000 ft and Mach .90 at 20,000 ft. The fault tolerant control law developed in this study provides good stability augmentation and performance for the relaxed static stability aircraft. The augmented aircraft responses are found to be invariant to the presence of a failure. Furthermore, single-loop stability margins of +6 dB in gain and +30 deg in phase were achieved along with -40 dB/decade rolloff at high frequency.

  17. Software fault tolerance in computer operating systems

    NASA Technical Reports Server (NTRS)

    Iyer, Ravishankar K.; Lee, Inhwan

    1994-01-01

    This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.

  18. Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

    NASA Technical Reports Server (NTRS)

    Padilla, Peter A.

    1991-01-01

    An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.

  19. Fault Tolerant Homopolar Magnetic Bearings

    NASA Technical Reports Server (NTRS)

    Li, Ming-Hsiu; Palazzolo, Alan; Kenny, Andrew; Provenza, Andrew; Beach, Raymond; Kascak, Albert

    2003-01-01

    Magnetic suspensions (MS) satisfy the long life and low loss conditions demanded by satellite and ISS based flywheels used for Energy Storage and Attitude Control (ACESE) service. This paper summarizes the development of a novel MS that improves reliability via fault tolerant operation. Specifically, flux coupling between poles of a homopolar magnetic bearing is shown to deliver desired forces even after termination of coil currents to a subset of failed poles . Linear, coordinate decoupled force-voltage relations are also maintained before and after failure by bias linearization. Current distribution matrices (CDM) which adjust the currents and fluxes following a pole set failure are determined for many faulted pole combinations. The CDM s and the system responses are obtained utilizing 1D magnetic circuit models with fringe and leakage factors derived from detailed, 3D, finite element field models. Reliability results are presented vs. detection/correction delay time and individual power amplifier reliability for 4, 6, and 7 pole configurations. Reliability is shown for two success criteria, i.e. (a) no catcher bearing contact following pole failures and (b) re-levitation off of the catcher bearings following pole failures. An advantage of the method presented over other redundant operation approaches is a significantly reduced requirement for backup hardware such as additional actuators or power amplifiers.

  20. Implementing fault-tolerant sensors

    NASA Technical Reports Server (NTRS)

    Marzullo, Keith

    1989-01-01

    One aspect of fault tolerance in process control programs is the ability to tolerate sensor failure. A methodology is presented for transforming a process control program that cannot tolerate sensor failures to one that can. Additionally, a hierarchy of failure models is identified.

  1. FTAPE: A fault injection tool to measure fault tolerance

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1994-01-01

    The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.

  2. FTAPE: A fault injection tool to measure fault tolerance

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1995-01-01

    The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.

  3. Fault-tolerant rotary actuator

    DOEpatents

    Tesar, Delbert

    2006-10-17

    A fault-tolerant actuator module, in a single containment shell, containing two actuator subsystems that are either asymmetrically or symmetrically laid out is provided. Fault tolerance in the actuators of the present invention is achieved by the employment of dual sets of equal resources. Dual resources are integrated into single modules, with each having the external appearance and functionality of a single set of resources.

  4. Locating hardware faults in a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-04-13

    Locating hardware faults in a parallel computer, including defining within a tree network of the parallel computer two or more sets of non-overlapping test levels of compute nodes of the network that together include all the data communications links of the network, each non-overlapping test level comprising two or more adjacent tiers of the tree; defining test cells within each non-overlapping test level, each test cell comprising a subtree of the tree including a subtree root compute node and all descendant compute nodes of the subtree root compute node within a non-overlapping test level; performing, separately on each set of non-overlapping test levels, an uplink test on all test cells in a set of non-overlapping test levels; and performing, separately from the uplink tests and separately on each set of non-overlapping test levels, a downlink test on all test cells in a set of non-overlapping test levels.

  5. Fault-tolerant PACS server

    NASA Astrophysics Data System (ADS)

    Cao, Fei; Liu, Brent J.; Huang, H. K.; Zhou, Michael Z.; Zhang, Jianguo; Zhang, X. C.; Mogel, Greg T.

    2002-05-01

    Failure of a PACS archive server could cripple an entire PACS operation. Last year we demonstrated that it was possible to design a fault-tolerant (FT) server with 99.999% uptime. The FT design was based on a triple modular redundancy with a simple majority vote to automatically detect and mask a faulty module. The purpose of this presentation is to report on its continuous developments in integrating with external mass storage devices, and to delineate laboratory failover experiments. An FT PACS Simulator with generic PACS software has been used in the experiment. To simulate a PACS clinical operation, image examinations are transmitted continuously from the modality simulator to the DICOM gateway and then to the FT PACS server and workstations. The hardware failures in network, FT server module, disk, RAID, and DLT are manually induced to observe the failover recovery of the FT PACS to resume its normal data flow. We then test and evaluate the FT PACS server in its reliability, functionality, and performance.

  6. Fault detection and fault tolerance in robotics

    NASA Technical Reports Server (NTRS)

    Visinsky, Monica; Walker, Ian D.; Cavallaro, Joseph R.

    1992-01-01

    Robots are used in inaccessible or hazardous environments in order to alleviate some of the time, cost and risk involved in preparing men to endure these conditions. In order to perform their expected tasks, the robots are often quite complex, thus increasing their potential for failures. If men must be sent into these environments to repair each component failure in the robot, the advantages of using the robot are quickly lost. Fault tolerant robots are needed which can effectively cope with failures and continue their tasks until repairs can be realistically scheduled. Before fault tolerant capabilities can be created, methods of detecting and pinpointing failures must be perfected. This paper develops a basic fault tree analysis of a robot in order to obtain a better understanding of where failures can occur and how they contribute to other failures in the robot. The resulting failure flow chart can also be used to analyze the resiliency of the robot in the presence of specific faults. By simulating robot failures and fault detection schemes, the problems involved in detecting failures for robots are explored in more depth.

  7. Software Fault Tolerance: A Tutorial

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo

    2000-01-01

    Because of our present inability to produce error-free software, software fault tolerance is and will continue to be an important consideration in software systems. The root cause of software design errors is the complexity of the systems. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. After a brief overview of the software development processes, we note how hard-to-detect design faults are likely to be introduced during development and how software faults tend to be state-dependent and activated by particular input sequences. Although component reliability is an important quality measure for system level analysis, software reliability is hard to characterize and the use of post-verification reliability estimates remains a controversial issue. For some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing catastrophes. Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Multiversion techniques are based on the assumption that software built differently should fail differently and thus, if one of the redundant versions fails, it is expected that at least one of the other versions will provide an acceptable output. Recovery blocks, N-version programming, and other multiversion techniques are reviewed.

  8. Fault-Tolerant Heat Exchanger

    NASA Technical Reports Server (NTRS)

    Izenson, Michael G.; Crowley, Christopher J.

    2005-01-01

    A compact, lightweight heat exchanger has been designed to be fault-tolerant in the sense that a single-point leak would not cause mixing of heat-transfer fluids. This particular heat exchanger is intended to be part of the temperature-regulation system for habitable modules of the International Space Station and to function with water and ammonia as the heat-transfer fluids. The basic fault-tolerant design is adaptable to other heat-transfer fluids and heat exchangers for applications in which mixing of heat-transfer fluids would pose toxic, explosive, or other hazards: Examples could include fuel/air heat exchangers for thermal management on aircraft, process heat exchangers in the cryogenic industry, and heat exchangers used in chemical processing. The reason this heat exchanger can tolerate a single-point leak is that the heat-transfer fluids are everywhere separated by a vented volume and at least two seals. The combination of fault tolerance, compactness, and light weight is implemented in a unique heat-exchanger core configuration: Each fluid passage is entirely surrounded by a vented region bridged by solid structures through which heat is conducted between the fluids. Precise, proprietary fabrication techniques make it possible to manufacture the vented regions and heat-conducting structures with very small dimensions to obtain a very large coefficient of heat transfer between the two fluids. A large heat-transfer coefficient favors compact design by making it possible to use a relatively small core for a given heat-transfer rate. Calculations and experiments have shown that in most respects, the fault-tolerant heat exchanger can be expected to equal or exceed the performance of the non-fault-tolerant heat exchanger that it is intended to supplant (see table). The only significant disadvantages are a slight weight penalty and a small decrease in the mass-specific heat transfer.

  9. Measuring fault tolerance with the FTAPE fault injection tool

    NASA Technical Reports Server (NTRS)

    Tsai, Timothy K.; Iyer, Ravishankar K.

    1995-01-01

    This paper describes FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The major parts of the tool include a system-wide fault-injector, a workload generator, and a workload activity measurement tool. The workload creates high stress conditions on the machine. Using stress-based injection, the fault injector is able to utilize knowledge of the workload activity to ensure a high level of fault propagation. The errors/fault ratio, performance degradation, and number of system crashes are presented as measures of fault tolerance.

  10. Fault Tolerant Frequent Pattern Mining

    SciTech Connect

    Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan

    2016-12-19

    FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing, though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.

  11. A fault-tolerant intelligent robotic control system

    NASA Technical Reports Server (NTRS)

    Marzwell, Neville I.; Tso, Kam Sing

    1993-01-01

    This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines.

  12. Fault tree models for fault tolerant hypercube multiprocessors

    NASA Technical Reports Server (NTRS)

    Boyd, Mark A.; Tuazon, Jezus O.

    1991-01-01

    Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.

  13. Fault Tolerance of Neural Networks

    DTIC Science & Technology

    1994-07-01

    Systematic Ap - proach, Proc. Government Microcircuit Application Conf. (GOMAC), San Diego, Nov. 1986. [10] D.E.Goldberg, Genetic Algorithms in Search...s l m n ttempt to develop fault tolerant neural networks. The lows. Given a well-trained network, we first eliminate temp todevlopfaut tlernt eurl ...both ap - proaches, and this resulted in very slight improve- ments over the addition/deletion procedure. 103 Fisher’s Iris data in average case Fisher’s

  14. Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 2: Army fault tolerant architecture design and analysis

    NASA Technical Reports Server (NTRS)

    Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

    1992-01-01

    Described here is the Army Fault Tolerant Architecture (AFTA) hardware architecture and components and the operating system. The architectural and operational theory of the AFTA Fault Tolerant Data Bus is discussed. The test and maintenance strategy developed for use in fielded AFTA installations is presented. An approach to be used in reducing the probability of AFTA failure due to common mode faults is described. Analytical models for AFTA performance, reliability, availability, life cycle cost, weight, power, and volume are developed. An approach is presented for using VHSIC Hardware Description Language (VHDL) to describe and design AFTA's developmental hardware. A plan is described for verifying and validating key AFTA concepts during the Dem/Val phase. Analytical models and partial mission requirements are used to generate AFTA configurations for the TF/TA/NOE and Ground Vehicle missions.

  15. A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI

    SciTech Connect

    Hursey, Joshua J; Naughton, III, Thomas J; Vallee, Geoffroy R; Graham, Richard L

    2011-01-01

    The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.

  16. Design study of Software-Implemented Fault-Tolerance (SIFT) computer

    NASA Technical Reports Server (NTRS)

    Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.

    1982-01-01

    Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.

  17. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

    NASA Technical Reports Server (NTRS)

    Smith, T. B., Jr.; Lala, J. H.

    1983-01-01

    The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.

  18. Parametric Modeling and Fault Tolerant Control

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva; Ju, Jianhong

    2000-01-01

    Fault tolerant control is considered for a nonlinear aircraft model expressed as a linear parameter-varying system. By proper parameterization of foreseeable faults, the linear parameter-varying system can include fault effects as additional varying parameters. A recently developed technique in fault effect parameter estimation allows us to assume that estimates of the fault effect parameters are available on-line. Reconfigurability is calculated for this model with respect to the loss of control effectiveness to assess the potentiality of the model to tolerate such losses prior to control design. The control design is carried out by applying a polytopic method to the aircraft model. An error bound on fault effect parameter estimation is provided, within which the Lyapunov stability of the closed-loop system is robust. Our simulation results show that as long as the fault parameter estimates are sufficiently accurate, the polytopic controller can provide satisfactory fault-tolerance.

  19. On fault-tolerant structure, distributed fault-diagnosis, reconfiguration, and recovery of the array processors

    SciTech Connect

    Hosseini, S.H.

    1989-07-01

    The increasing need for the design of high-performance computers has led to the design of special purpose computers such as array processors. This paper studies the design of fault-tolerant array processors. First, it is shown how hardware redundancy can be employed in the existing structures in order to make them capable of withstanding the failure of some of the array links and processors. Then distributed fault-tolerance schemes are introduced for the diagnosis of the faulty elements, reconfiguration, and recovery of the array. Fault tolerance is maintained by the cooperation of processors in a decentralized form of control without the participation of any type of hardcore or fault-free central controller such as a host computer.

  20. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

    SciTech Connect

    Bernstein, P.A.

    1988-02-01

    The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user. However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.

  1. Fault-tolerant software - Experiment with the sift operating system. [Software Implemented Fault Tolerance computer

    NASA Technical Reports Server (NTRS)

    Brunelle, J. E.; Eckhardt, D. E., Jr.

    1985-01-01

    Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.

  2. CIFTS : A coordinated infrastructure for fault-tolerant systems.

    SciTech Connect

    Gupta, R.; Beckman, P.; Park, B. H.; Lusk, E.; Hargrove, P.; Geist, A.; Panda, D. K.; Lumsdaine, A.; Dongarra, J.; ORNL; LBNL; Ohio State Univ.; Indiana Univ.; Univ. of Tennessee

    2009-01-01

    In the next few years SciDAC applications will utilize petascale systems with tens to hundreds of thousands of processors, hundreds of I/O nodes, and thousands of disks. This leap of two orders of magnitude in scale from today's typical systems is causing a critical gap in fault management of these systems. The fault management issues for these emerging systems are well beyond the scope of today's common infrastructure and practice. Currently, systems software components for large-scale machines remain largely independent in their fault awareness and notification strategies. Faults can arise not just from the hardware but also from the OS, middleware, libraries, and application levels. Petascale applications that are evolving to utilize these platforms face many new challenges. With the CIFTS initiative, we aim to provide a coordinated infrastructure that will enable Fault Tolerant Systems to adapt to faults occuring in the operating environment in a holistic manner. Our approach will be to design a reference implementation of a fault awareness and notification backplane to provide common uniform event handling and notification mechanisms for fault-aware libraries and middleware; create an interface specification that allows libraries, run- time systems, and applications to connect to and use the fault-tolerance backplane; and extend key libraries and applications to validate the interface choices and to form the critical mass necessary for adoption in the community.

  3. Robot Position Sensor Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Aldridge, Hal A.

    1997-01-01

    Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. A new method is proposed that utilizes analytical redundancy to allow for continued operation during joint position sensor failure. Joint torque sensors are used with a virtual passive torque controller to make the robot joint stable without position feedback and improve position tracking performance in the presence of unknown link dynamics and end-effector loading. Two Cartesian accelerometer based methods are proposed to determine the position of the joint. The joint specific position determination method utilizes two triaxial accelerometers attached to the link driven by the joint with the failed position sensor. The joint specific method is not computationally complex and the position error is bounded. The system wide position determination method utilizes accelerometers distributed on different robot links and the end-effector to determine the position of sets of multiple joints. The system wide method requires fewer accelerometers than the joint specific method to make all joint position sensors fault tolerant but is more computationally complex and has lower convergence properties. Experiments were conducted on a laboratory manipulator. Both position determination methods were shown to track the actual position satisfactorily. A controller using the position determination methods and the virtual passive torque controller was able to servo the joints to a desired position during position sensor failure.

  4. An Aspect-Oriented Approach to Assessing Fault Tolerance

    DTIC Science & Technology

    2014-10-01

    this paper, we present a fault tolerance assessment framework designed for distributed systems that provides automated injection of faults without... fault tolerance techniques work. Ensuring fault tolerance in military communication systems is particularly important due to the inevitability of...changes to client or server code and automated assessment of whether the injected faults are tolerated. The framework applies aspect-oriented

  5. Adding Fault Tolerance to NPB Benchmarks Using ULFM

    SciTech Connect

    Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J; Engelmann, Christian; Bernholdt, David E; Scott, Stephen L

    2016-01-01

    In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In this work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.

  6. Transparent Ada rendezvous in a fault tolerant distributed system

    NASA Technical Reports Server (NTRS)

    Racine, Roger

    1986-01-01

    There are many problems associated with distributing an Ada program over a loosely coupled communication network. Some of these problems involve the various aspects of the distributed rendezvous. The problems addressed involve supporting the delay statement in a selective call and supporting the else clause in a selective call. Most of these difficulties are compounded by the need for an efficient communication system. The difficulties are compounded even more by considering the possibility of hardware faults occurring while the program is running. With a hardware fault tolerant computer system, it is possible to design a distribution scheme and communication software which is efficient and allows Ada semantics to be preserved. An Ada design for the communications software of one such system will be presented, including a description of the services provided in the seven layers of an International Standards Organization (ISO) Open System Interconnect (OSI) model communications system. The system capabilities (hardware and software) that allow this communication system will also be described.

  7. Fault-tolerant dynamic task graph scheduling

    SciTech Connect

    Kurt, Mehmet C.; Krishnamoorthy, Sriram; Agrawal, Kunal; Agrawal, Gagan

    2014-11-16

    In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space and time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.

  8. Tuning of fault tolerant control design parameters.

    PubMed

    DeLima, Pedro G; Yen, Gary G

    2008-01-01

    This paper presents two major contributions in the field of fault tolerant control. First, it gathers points of concern typical to most fault tolerant control applications and translates the chosen performance metrics into a set of six practical design specifications. Second, it proposes initialization and tuning procedures through which a particular fault tolerant control architecture not only can be set to comply with the required specifications, but also can be tuned online to compensate for a total of twelve properties, such as the noise rejection levels for fault detection and diagnosis signals. The proposed design is realized over a powerful architecture that combines the flexibility of adaptive critic designs with the long term memory and learning capabilities of a supervisor. This paper presents a practical design procedure to facilitate the applications of a fundamentally sound fault tolerant control architecture in real-world problems.

  9. Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Akamine, Robert L.; Hodson, Robert F.; LaMeres, Brock J.; Ray, Robert E.

    2011-01-01

    Fault tolerant systems require the ability to detect and recover from physical damage caused by the hardware s environment, faulty connectors, and system degradation over time. This ability applies to military, space, and industrial computing applications. The integrity of Point-to-Point (P2P) communication, between two microcontrollers for example, is an essential part of fault tolerant computing systems. In this paper, different methods of fault detection and recovery are presented and analyzed.

  10. An aircraft sensor fault tolerant system

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Lancraft, R. E.

    1982-01-01

    The design of a sensor fault tolerant system which uses analytical redundancy for the Terminal Configured Vehicle (TCV) research aircraft in a Microwave Landing System (MLS) environment was studied. The fault tolerant system provides reliable estimates for aircraft position, velocity, and attitude in the presence of possible failures in navigation aid instruments and onboard sensors. The estimates, provided by the fault tolerant system, are used by the automated guidance and control system to land the aircraft along a prescribed path. Sensor failures are identified by utilizing the analytic relationship between the various sensor outputs arising from the aircraft equations of motion.

  11. Fault-tolerant communication channel structures

    NASA Technical Reports Server (NTRS)

    Alkalai, Leon (Inventor); Chau, Savio N. (Inventor); Tai, Ann T. (Inventor)

    2006-01-01

    Systems and techniques for implementing fault-tolerant communication channels and features in communication systems. Selected commercial-off-the-shelf devices can be integrated in such systems to reduce the cost.

  12. Performance Analysis on Fault Tolerant Control System

    NASA Technical Reports Server (NTRS)

    Shin, Jong-Yeob; Belcastro, Christine

    2005-01-01

    In a fault tolerant control (FTC) system, a parameter varying FTC law is reconfigured based on fault parameters estimated by fault detection and isolation (FDI) modules. FDI modules require some time to detect fault occurrences in aero-vehicle dynamics. In this paper, an FTC analysis framework is provided to calculate the upper bound of an induced-L(sub 2) norm of an FTC system with existence of false identification and detection time delay. The upper bound is written as a function of a fault detection time and exponential decay rates and has been used to determine which FTC law produces less performance degradation (tracking error) due to false identification. The analysis framework is applied for an FTC system of a HiMAT (Highly Maneuverable Aircraft Technology) vehicle. Index Terms fault tolerant control system, linear parameter varying system, HiMAT vehicle.

  13. The cost of software fault tolerance

    NASA Technical Reports Server (NTRS)

    Migneault, G. E.

    1982-01-01

    The proposed use of software fault tolerance techniques as a means of reducing software costs in avionics and as a means of addressing the issue of system unreliability due to faults in software is examined. A model is developed to provide a view of the relationships among cost, redundancy, and reliability which suggests strategies for software development and maintenance which are not conventional.

  14. A verified design of a fault-tolerant clock synchronization circuit: Preliminary investigations

    NASA Technical Reports Server (NTRS)

    Miner, Paul S.

    1992-01-01

    Schneider demonstrates that many fault tolerant clock synchronization algorithms can be represented as refinements of a single proven correct paradigm. Shankar provides mechanical proof that Schneider's schema achieves Byzantine fault tolerant clock synchronization provided that 11 constraints are satisfied. Some of the constraints are assumptions about physical properties of the system and cannot be established formally. Proofs are given that the fault tolerant midpoint convergence function satisfies three of the constraints. A hardware design is presented, implementing the fault tolerant midpoint function, which is shown to satisfy the remaining constraints. The synchronization circuit will recover completely from transient faults provided the maximum fault assumption is not violated. The initialization protocol for the circuit also provides a recovery mechanism from total system failure caused by correlated transient faults.

  15. The Design of a Fault-Tolerant COTS-Based Bus Architecture

    NASA Technical Reports Server (NTRS)

    Chau, Savio N.; Alkalai, Leon; Burt, John B.; Tai, Ann T.

    1999-01-01

    In this paper, we report our experiences and findings on the design of a fault-tolerant bus architecture comprised of two COTS buses, the IEEE 1394 and the 12C. This fault-tolerant bus is the backbone system bus for the avionics architecture of the X2000 program at the Jet Propulsion Laboratory. COTS buses are attractive because of the availability of low cost commercial products. However, they are not specifically designed for highly reliable applications such as long-life deep-space missions. The X2000 design team has devised a multi-level fault tolerance approach to compensate for this shortcoming of COTS buses. First, the approach enhances the fault tolerance capabilities of the IEEE 1394 and 12 C buses by adding a layer of fault handling hardware and software. Second, algorithms are developed to enable the IEEE 1394 and the 12 C buses assist each other to isolate and recovery from faults. Third, the set of IEEE 1394 and 12 C buses is duplicated to further enhance system reliability. The X2000 design team has paid special attention to guarantee that all fault tolerance provisions will not cause the bus design to deviate from the commercial standard specifications. Otherwise, the economic attractiveness of using COTS will be diminished. The hardware and software design of the X2000 fault-tolerant bus are being implemented and flight hardware will be delivered to the ST4 and Europa Orbiter missions.

  16. Fault tolerance and testing for WSI systems

    NASA Astrophysics Data System (ADS)

    Ptak, Alan W.; McLeod, R. D.

    Fault tolerance and testing for wafer scale integration (WSI) processor arrays using boundary scan and built-in self-test (BIST) technology are discussed. A test strategy for verification of all components within an integrated circuit wafer is presented, and a fault tolerance technique using semi-concurrent fault detection is described. The test strategy consists of four steps taken to verify test bus continuity, boundary scan register continuity, interconnection network connectivity, and processor element integrity. The component-level area overhead for boundary scan and BIST is modest for present-day fabrication processes, and will diminish to an insignificant level as integrated circuit fabrication technology continues to improve.

  17. Fault-tolerant multichannel demultiplexer subsystems

    NASA Technical Reports Server (NTRS)

    Redinbo, Robert

    1991-01-01

    Fault tolerance in future processing and switching communication satellites is addressed by showing new methods for detecting hardware failures in the first major subsystem, the multichannel demultiplexer. An efficient method for demultiplexing frequency slotted channels uses multirate filter banks which contain fast Fourier transform processing. All numerical processing is performed at a lower rate commensurate with the small bandwidth of each bandbase channel. The integrity of the demultiplexing operations is protected by using real number convolutional codes to compute comparable parity values which detect errors at the data sample level. High rate, systematic convolutional codes produce parity values at a much reduced rate, and protection is achieved by generating parity values in two ways and comparing them. Parity values corresponding to each output channel are generated in parallel by a subsystem, operating even slower and in parallel with the demultiplexer that is virtually identical to the original structure. These parity calculations may be time shared with the same processing resources because they are so similar.

  18. Fault tolerance in computational grids: perspectives, challenges, and issues.

    PubMed

    Haider, Sajjad; Nazir, Babar

    2016-01-01

    Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.

  19. Fault Injection Campaign for a Fault Tolerant Duplex Framework

    NASA Technical Reports Server (NTRS)

    Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.

    2007-01-01

    Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.

  20. Model-Based Fault Tolerant Control

    NASA Technical Reports Server (NTRS)

    Kumar, Aditya; Viassolo, Daniel

    2008-01-01

    The Model Based Fault Tolerant Control (MBFTC) task was conducted under the NASA Aviation Safety and Security Program. The goal of MBFTC is to develop and demonstrate real-time strategies to diagnose and accommodate anomalous aircraft engine events such as sensor faults, actuator faults, or turbine gas-path component damage that can lead to in-flight shutdowns, aborted take offs, asymmetric thrust/loss of thrust control, or engine surge/stall events. A suite of model-based fault detection algorithms were developed and evaluated. Based on the performance and maturity of the developed algorithms two approaches were selected for further analysis: (i) multiple-hypothesis testing, and (ii) neural networks; both used residuals from an Extended Kalman Filter to detect the occurrence of the selected faults. A simple fusion algorithm was implemented to combine the results from each algorithm to obtain an overall estimate of the identified fault type and magnitude. The identification of the fault type and magnitude enabled the use of an online fault accommodation strategy to correct for the adverse impact of these faults on engine operability thereby enabling continued engine operation in the presence of these faults. The performance of the fault detection and accommodation algorithm was extensively tested in a simulation environment.

  1. Fault tolerant operation of switched reluctance machine

    NASA Astrophysics Data System (ADS)

    Wang, Wei

    The energy crisis and environmental challenges have driven industry towards more energy efficient solutions. With nearly 60% of electricity consumed by various electric machines in industry sector, advancement in the efficiency of the electric drive system is of vital importance. Adjustable speed drive system (ASDS) provides excellent speed regulation and dynamic performance as well as dramatically improved system efficiency compared with conventional motors without electronics drives. Industry has witnessed tremendous grow in ASDS applications not only as a driving force but also as an electric auxiliary system for replacing bulky and low efficiency auxiliary hydraulic and mechanical systems. With the vast penetration of ASDS, its fault tolerant operation capability is more widely recognized as an important feature of drive performance especially for aerospace, automotive applications and other industrial drive applications demanding high reliability. The Switched Reluctance Machine (SRM), a low cost, highly reliable electric machine with fault tolerant operation capability, has drawn substantial attention in the past three decades. Nevertheless, SRM is not free of fault. Certain faults such as converter faults, sensor faults, winding shorts, eccentricity and position sensor faults are commonly shared among all ASDS. In this dissertation, a thorough understanding of various faults and their influence on transient and steady state performance of SRM is developed via simulation and experimental study, providing necessary knowledge for fault detection and post fault management. Lumped parameter models are established for fast real time simulation and drive control. Based on the behavior of the faults, a fault detection scheme is developed for the purpose of fast and reliable fault diagnosis. In order to improve the SRM power and torque capacity under faults, the maximum torque per ampere excitation are conceptualized and validated through theoretical analysis and

  2. Fault-tolerant quantum computation with asymmetric Bacon-Shor codes

    NASA Astrophysics Data System (ADS)

    Brooks, Peter; Preskill, John

    2013-03-01

    We develop a scheme for fault-tolerant quantum computation based on asymmetric Bacon-Shor codes, which works effectively against highly biased noise dominated by dephasing. We find the optimal Bacon-Shor block size as a function of the noise strength and the noise bias, and estimate the logical error rate and overhead cost achieved by this optimal code. Our fault-tolerant gadgets, based on gate teleportation, are well suited for hardware platforms with geometrically local gates in two dimensions.

  3. Analysis of fault-tolerant neurocontrol architectures

    NASA Technical Reports Server (NTRS)

    Troudet, T.; Merrill, W.

    1992-01-01

    The fault-tolerance of analog parallel distributed implementations of a multivariable aircraft neurocontroller is analyzed by simulating weight and neuron failures in a simplified scheme of analog processing based on the functional architecture of the ETANN chip (Electrically Trainable Artificial Neural Network). The neural information processing is found to be only partially distributed throughout the set of weights of the neurocontroller synthesized with the backpropagation algorithm. Although the degree of distribution of the neural processing, and consequently the fault-tolerance of the neurocontroller, could be enhanced using Locally Distributed Weight and Neuron Approaches, a satisfactory level of fault-tolerance could only be obtained by retraining the degrated VLSI neurocontroller. The possibility of maintaining neurocontrol performance and stability in the presence of single weight of neuron failures was demonstrated through an automated retraining procedure of the neurocontroller based on a pre-programmed choice and sequence of the training parameters.

  4. Using Relocatable Bitstreams For Fault Tolerance

    DTIC Science & Technology

    2007-03-01

    fault tolerant, increasing their dependability and availability, by allowing an FPGA to restore its functionality after a fault has been detected ...device to be programmed, thus providing direct support for dynamic reconfiguration [GLS99]. All action in JBits must be specified in the source code ... FPGA families, including the Virtex-II Pro, and provides a router based on JHDLBits, an open source project that connects JHDL and JBits. JBits 3.0

  5. A Primer on Architectural Level Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.

    2008-01-01

    This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition.

  6. Network fault tolerance in LA-MPI

    SciTech Connect

    Aulwes, R. T.; Daniel, D. J.; Desai, N. N.; Graham, R. L.; Risinger, L. D.; Sukalski, M. W.; Taylor, M. A.

    2003-01-01

    LA-MPI is a high-performance, network-fault-tolerant implementation of MPl designcd for terascale clusters that are inherently unreliable due to their very large number of system components and to trade-offs between cost and pcrformance. This paper reviews the architectural design of LA-MPI, focusing on our approach to guaranteeing data integrity. We discuss our network data path abstraction that makes LA-MPI highly portable, givcs high-performance through mcssage striping, and niost importantly provides the basis for network fault tolerance. Finally we include some performance numbers for the Quadrics and UDP network paths.

  7. Experiments in fault tolerant software reliability

    NASA Technical Reports Server (NTRS)

    Mcallister, David F.; Tai, K. C.; Vouk, Mladen A.

    1987-01-01

    The reliability of voting was evaluated in a fault-tolerant software system for small output spaces. The effectiveness of the back-to-back testing process was investigated. Version 3.0 of the RSDIMU-ATS, a semi-automated test bed for certification testing of RSDIMU software, was prepared and distributed. Software reliability estimation methods based on non-random sampling are being studied. The investigation of existing fault-tolerance models was continued and formulation of new models was initiated.

  8. Flexible fault tolerance in configurable middleware for embedded systems

    SciTech Connect

    Dorow, Kevin E.

    2003-11-03

    MicroQoSCORBA (MQC) is a middleware platform that focuses on embedded applications by providing a very fine level of configurability of its internal orthogonal components. Using this configurability, a developer can generate a customized middleware instantiation that is tailored to both the requirements and constraints of a specific embedded application and the embedded hardware. One of the key components provided by MQC is a set of fault-tolerant mechanisms, which allow for support of applications that require a higher level of reliability. This document provides a detailed description of the algorithms and protocols selected for these mechanisms, along with a discussion of their implementation and incorporation into the MQC platform.

  9. Integrated design of fault reconstruction and fault-tolerant control against actuator faults using learning observers

    NASA Astrophysics Data System (ADS)

    Jia, Qingxian; Chen, Wen; Zhang, Yingchun; Li, Huayi

    2016-12-01

    This paper addresses the problem of integrated fault reconstruction and fault-tolerant control in linear systems subject to actuator faults via learning observers (LOs). A reconfigurable fault-tolerant controller is designed based on the constructed LO to compensate for the influence of actuator faults by stabilising the closed-loop system. An integrated design of the proposed LO and the fault-tolerant controller is explored such that their performance can be simultaneously considered and their coupling problem can be effectively solved. In addition, such an integrated design is formulated in terms of linear matrix inequalities (LMIs) that can be conveniently solved in a unified framework using LMI optimisation technique. At last, simulation studies on a micro-satellite attitude control system are provided to verify the effectiveness of the proposed approach.

  10. Fly-By-Light/Power-By-Wire Fault-Tolerant Fiber-Optic Backplane

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2002-01-01

    The design and development of a fault-tolerant fiber-optic backplane to demonstrate feasibility of such architecture is presented. The simulation results of test cases on the backplane in the advent of induced faults are presented, and the fault recovery capability of the architecture is demonstrated. The architecture was designed, developed, and implemented using the Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL). The architecture was synthesized and implemented in hardware using Field Programmable Gate Arrays (FPGA) on multiple prototype boards.

  11. Development and Evaluation of Fault-Tolerant Flight Control Systems

    NASA Technical Reports Server (NTRS)

    Song, Yong D.; Gupta, Kajal (Technical Monitor)

    2004-01-01

    The research is concerned with developing a new approach to enhancing fault tolerance of flight control systems. The original motivation for fault-tolerant control comes from the need for safe operation of control elements (e.g. actuators) in the event of hardware failures in high reliability systems. One such example is modem space vehicle subjected to actuator/sensor impairments. A major task in flight control is to revise the control policy to balance impairment detectability and to achieve sufficient robustness. This involves careful selection of types and parameters of the controllers and the impairment detecting filters used. It also involves a decision, upon the identification of some failures, on whether and how a control reconfiguration should take place in order to maintain a certain system performance level. In this project new flight dynamic model under uncertain flight conditions is considered, in which the effects of both ramp and jump faults are reflected. Stabilization algorithms based on neural network and adaptive method are derived. The control algorithms are shown to be effective in dealing with uncertain dynamics due to external disturbances and unpredictable faults. The overall strategy is easy to set up and the computation involved is much less as compared with other strategies. Computer simulation software is developed. A serious of simulation studies have been conducted with varying flight conditions.

  12. Continuous reconfiguration: fault tolerance without a ripple

    SciTech Connect

    Bortner, R.A.

    1983-01-01

    The concepts of the continuously reconfiguring flight control system (crm/sup 2/fcs) and the impact of its architecture upon fault tolerance and reliability are covered. Some of the topics discussed are continuous reconfiguration, autonomous control, virtual common memory and the fault filter. Continuous reconfiguration is defined. An example is discussed with an explanation of transparent failure. Autonomous control is the scheme for controlling a continually reconfiguring system. The process of volunteering is also discussed. The virtual common memory is the common memory architecture used in the continuously reconfiguring system. Its physical implementation is explained. The fault filter is the method used to detect and deal with faulty processors. The different levels and the types of faults each handles are examined. 1 ref.

  13. Investigation of an advanced fault tolerant integrated avionics system

    NASA Technical Reports Server (NTRS)

    Dunn, W. R.; Cottrell, D.; Flanders, J.; Javornik, A.; Rusovick, M.

    1986-01-01

    Presented is an advanced, fault-tolerant multiprocessor avionics architecture as could be employed in an advanced rotorcraft such as LHX. The processor structure is designed to interface with existing digital avionics systems and concepts including the Army Digital Avionics System (ADAS) cockpit/display system, navaid and communications suites, integrated sensing suite, and the Advanced Digital Optical Control System (ADOCS). The report defines mission, maintenance and safety-of-flight reliability goals as might be expected for an operational LHX aircraft. Based on use of a modular, compact (16-bit) microprocessor card family, results of a preliminary study examining simplex, dual and standby-sparing architectures is presented. Given the stated constraints, it is shown that the dual architecture is best suited to meet reliability goals with minimum hardware and software overhead. The report presents hardware and software design considerations for realizing the architecture including redundancy management requirements and techniques as well as verification and validation needs and methods.

  14. Using certification trails to achieve software fault tolerance

    NASA Technical Reports Server (NTRS)

    Sullivan, Gregory F.; Masson, Gerald M.

    1993-01-01

    A conceptually novel and powerful technique to achieve fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance was formalized and it was illustrated by applying it to the fundamental problem of finding a minimum spanning tree. Cases in which the second phase can be run concorrectly with the first and act as a monitor are discussed. The certification trail approach was compared to other approaches to fault tolerance. Because of space limitations we have omitted examples of our technique applied to the Huffman tree, and convex hull problems. These can be found in the full version of this paper.

  15. Critical fault patterns determination in fault-tolerant computer systems

    NASA Technical Reports Server (NTRS)

    Mccluskey, E. J.; Losq, J.

    1978-01-01

    The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics.

  16. An Integrated Fault Tolerant Robotic Controller System for High Reliability and Safety

    NASA Technical Reports Server (NTRS)

    Marzwell, Neville I.; Tso, Kam S.; Hecht, Myron

    1994-01-01

    This paper describes the concepts and features of a fault-tolerant intelligent robotic control system being developed for applications that require high dependability (reliability, availability, and safety). The system consists of two major elements: a fault-tolerant controller and an operator workstation. The fault-tolerant controller uses a strategy which allows for detection and recovery of hardware, operating system, and application software failures.The fault-tolerant controller can be used by itself in a wide variety of applications in industry, process control, and communications. The controller in combination with the operator workstation can be applied to robotic applications such as spaceborne extravehicular activities, hazardous materials handling, inspection and maintenance of high value items (e.g., space vehicles, reactor internals, or aircraft), medicine, and other tasks where a robot system failure poses a significant risk to life or property.

  17. [Advanced Development for Space Robotics With Emphasis on Fault Tolerance Technology

    NASA Technical Reports Server (NTRS)

    Tesar, Delbert

    1997-01-01

    This report describes work developing fault tolerant redundant robotic architectures and adaptive control strategies for robotic manipulator systems which can dynamically accommodate drastic robot manipulator mechanism, sensor or control failures and maintain stable end-point trajectory control with minimum disturbance. Kinematic designs of redundant, modular, reconfigurable arms for fault tolerance were pursued at a fundamental level. The approach developed robotic testbeds to evaluate disturbance responses of fault tolerant concepts in robotic mechanisms and controllers. The development was implemented in various fault tolerant mechanism testbeds including duality in the joint servo motor modules, parallel and serial structural architectures, and dual arms. All have real-time adaptive controller technologies to react to mechanism or controller disturbances (failures) to perform real-time reconfiguration to continue the task operations. The developments fall into three main areas: hardware, software, and theoretical.

  18. Novel neural networks-based fault tolerant control scheme with fault alarm.

    PubMed

    Shen, Qikun; Jiang, Bin; Shi, Peng; Lim, Cheng-Chew

    2014-11-01

    In this paper, the problem of adaptive active fault-tolerant control for a class of nonlinear systems with unknown actuator fault is investigated. The actuator fault is assumed to have no traditional affine appearance of the system state variables and control input. The useful property of the basis function of the radial basis function neural network (NN), which will be used in the design of the fault tolerant controller, is explored. Based on the analysis of the design of normal and passive fault tolerant controllers, by using the implicit function theorem, a novel NN-based active fault-tolerant control scheme with fault alarm is proposed. Comparing with results in the literature, the fault-tolerant control scheme can minimize the time delay between fault occurrence and accommodation that is called the time delay due to fault diagnosis, and reduce the adverse effect on system performance. In addition, the FTC scheme has the advantages of a passive fault-tolerant control scheme as well as the traditional active fault-tolerant control scheme's properties. Furthermore, the fault-tolerant control scheme requires no additional fault detection and isolation model which is necessary in the traditional active fault-tolerant control scheme. Finally, simulation results are presented to demonstrate the efficiency of the developed techniques.

  19. A Unified Fault-Tolerance Protocol

    NASA Technical Reports Server (NTRS)

    Miner, Paul; Gedser, Alfons; Pike, Lee; Maddalon, Jeffrey

    2004-01-01

    Davies and Wakerly show that Byzantine fault tolerance can be achieved by a cascade of broadcasts and middle value select functions. We present an extension of the Davies and Wakerly protocol, the unified protocol, and its proof of correctness. We prove that it satisfies validity and agreement properties for communication of exact values. We then introduce bounded communication error into the model. Inexact communication is inherent for clock synchronization protocols. We prove that validity and agreement properties hold for inexact communication, and that exact communication is a special case. As a running example, we illustrate the unified protocol using the SPIDER family of fault-tolerant architectures. In particular we demonstrate that the SPIDER interactive consistency, distributed diagnosis, and clock synchronization protocols are instances of the unified protocol.

  20. Fault tolerant GPS/Inertial System design

    NASA Astrophysics Data System (ADS)

    Brown, Alison K.; Sturza, Mark A.; Deangelis, Franco; Lukaszewski, David A.

    The use of a GPS/Inertial integrated system in future launch vehicles motivates the described design of the present fault-tolerant system. The robustness of the navigation system is enhanced by integrating the GPS with an inertial fault-tolerant system. Three layers of failure detection and isolation are incorporated to determine the nature of flaws in the inertial instruments, the GPS receivers, or the integrated navigation solution. The layers are based on: (1) a high-rate parity algorithm for instrument failures; (2) a similar parity algorithm for GPS satellite or receiver failures; and (3) a GPS navigation solution to monitor inertial navigation failures. Dual failures of any system component can occur in any system component without affecting the performance of launch-vehicle navigation or guidance.

  1. A dual, fault-tolerant aerospace actuator

    NASA Technical Reports Server (NTRS)

    Siebert, C. J.

    1985-01-01

    The requirements for mechanisms used in the Space Transportation System (STS) are to provide dual fault tolerance, and if the payload equipment violates the Shuttle bay door envelope, these deployment/restow mechanisms must have independent primary and backup features. The research and development of an electromechanical actuator that meets these requirements and will be used on the Transfer Orbit Stage (TOS) program is described.

  2. Strategies for Fault Tolerance in Multicomponent Applications

    SciTech Connect

    Shet, Aniruddha G; Elwasif, Wael R; Foley, Samantha S; Park, Byung H; Bernholdt, David E; Bramley, Randall B

    2011-01-01

    This paper discusses on-going work with the Integrated Plasma Simulator (IPS), a framework for coupled multiphysics simulations of plasmas, to allow simulations to run through the loss of nodes on which the simulation is executing. While many different techniques are available to improve the fault tolerance of computational science applications on high-performance computer systems, checkpoint/restart (C/R) remains virtually the only one that see widespread use in practice. Our focus here is to augment the traditional C/R approach with additional techniques that can provide a more localized and tailored response to faults based on the ability to restart failed tasks on an individual basis, and the use of information external to the application itself in order to guide decision-making, in many cases avoiding the need to stop and restart the entire simulation. This capability involves several features within the IPS framework, and leverages the Fault Tolerance Backplane, a publish/subscribe event service to disseminate fault-related information throughout HPC systems, to obtain information from the Reliability, Availability and Serviceability (RAS) subsystem of the HPC system. This work is described in the context of Cray XT-series computer systems for concreteness, but is applicable to other environments as well. As part of the analysis of this work, we discuss the requirements to generalize this approach to other complex simulation applications beyond the Integrated Plasma Simulator.

  3. Reconfigurable Fault Tolerance for FPGAs

    NASA Technical Reports Server (NTRS)

    Shuler, Robert, Jr.

    2010-01-01

    The invention allows a field-programmable gate array (FPGA) or similar device to be efficiently reconfigured in whole or in part to provide higher capacity, non-redundant operation. The redundant device consists of functional units such as adders or multipliers, configuration memory for the functional units, a programmable routing method, configuration memory for the routing method, and various other features such as block RAM, I/O (random access memory, input/output) capability, dedicated carry logic, etc. The redundant device has three identical sets of functional units and routing resources and majority voters that correct errors. The configuration memory may or may not be redundant, depending on need. For example, SRAM-based FPGAs will need some type of radiation-tolerant configuration memory, or they will need triple-redundant configuration memory. Flash or anti-fuse devices will generally not need redundant configuration memory. Some means of loading and verifying the configuration memory is also required. These are all components of the pre-existing redundant FPGA. This innovation modifies the voter to accept a MODE input, which specifies whether ordinary voting is to occur, or if redundancy is to be split. Generally, additional routing resources will also be required to pass data between sections of the device created by splitting the redundancy. In redundancy mode, the voters produce an output corresponding to the two inputs that agree, in the usual fashion. In the split mode, the voters select just one input and convey this to the output, ignoring the other inputs. In a dual-redundant system (as opposed to triple-redundant), instead of a voter, there is some means to latch or gate a state update only when both inputs agree. In this case, the invention would require modification of the latch or gate so that it would operate normally in redundant mode, and would separately latch or gate the inputs in non-redundant mode.

  4. Real-time optimal torque control of fault-tolerant permanent magnet brushless machines

    NASA Astrophysics Data System (ADS)

    Max, L.; Wang, J.; Atallah, K.; Howe, D.

    2005-05-01

    The paper describes issues that are pertinent to control system hardware and software design for the real-time implementation of an optimal torque control strategy for fault-tolerant permanent magnet brushless ac drives, and reports experimental results. The influence of the current control loop bandwidth and pulse width modulation on the torque ripple are investigated and quantified.

  5. The X-38 Spacecraft Fault-Tolerant Avionics System

    NASA Technical Reports Server (NTRS)

    Kouba,Coy; Buscher, Deborah; Busa, Joseph

    2003-01-01

    In 1995 NASA began an experimental program to develop a reusable crew return vehicle (CRV) for the International Space Station. The purpose of the CRV was threefold: (i) to bring home an injured or ill crewmember; (ii) to bring home the entire crew if the Shuttle fleet was grounded; and (iii) to evacuate the crew in the case of an imminent Station threat (i.e., fire, decompression, etc). Built at the Johnson Space Center, were two approach and landing prototypes and one spacecraft demonstrator (called V201). A series of increasingly complex ground subsystem tests were completed, and eight successful high-altitude drop tests were achieved to prove the design concept. In this program, an unprecedented amount of commercial-off-the-shelf technology was utilized in this first crewed spacecraft NASA has built since the Shuttle program. Unfortunately, in 2002 the program was canceled due to changing Agency priorities. The vehicle was 80% complete and the program was shut down in such a manner as to preserve design, development, test and engineering data. This paper describes the X-38 V201 fault-tolerant avionics system. Based on Draper Laboratory's Byzantine-resilient fault-tolerant parallel processing system and their "network element" hardware, each flight computer exchanges information on a strict timescale to process input data, compare results, and issue voted vehicle output commands. Major accomplishments achieved in this development include: (i) a space qualified two-fault tolerant design using mostly COTS (hardware and operating system); (ii) a single event upset tolerant network element board, (iii) on-the-fly recovery of a failed processor; (iv) use of synched cache; (v) realignment of memory to bring back a failed channel; (vi) flight code automatically generated from the master measurement list; and (vii) built in-house by a team of civil servants and support contractors. This paper will present an overview of the avionics system and the hardware

  6. Coordinated Fault Tolerance for High-Performance Computing

    SciTech Connect

    Dongarra, Jack; Bosilca, George; et al.

    2013-04-08

    Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.

  7. Fault Tolerant Magnetic Bearing for Turbomachinery

    NASA Technical Reports Server (NTRS)

    Choi, Benjamin; Provenza, Andrew

    2001-01-01

    NASA Glenn Research Center (GRC) has developed a Fault-Tolerant Magnetic Bearing Suspension rig to enhance the bearing system safety. It successfully demonstrated that using only two active poles out of eight redundant poles from each radial bearing (that is, simply 12 out of 16 poles dead) levitated the rotor and spun it without losing stability and desired position up to the maximum allowable speed of 20,000 rpm. In this paper, it is demonstrated that as far as the summation of force vectors of the attracting poles and rotor weight is zero, a fault-tolerant magnetic bearing system maintained the rotor at the desired position without losing stability even at the maximum rotor speed. A proportional-integral-derivative (PID) controller generated autonomous corrective actions with no operator's input for the fault situations without losing load capacity in terms of rotor position. This paper also deals with a centralized modal controller to better control the dynamic behavior over system modes.

  8. MIL-M-38510/470 test vectors: Fault detection efficiency measurement via hardware fault simulation. [rca 1802 microprocessor

    NASA Technical Reports Server (NTRS)

    Timoc, C. C.

    1980-01-01

    The stuck fault detection efficiency of the test vectors developed for the MIL-M-38510/470 NASA was measured using a hardware stuck fault simulator for the 1802 microprocessor. Thirty-nine stuck faults were not detected out of a total of 874 injected into the combinatorial and sequential parts of the microprocessor. Since undetected faults can create catastrophic errors in equipment designed for high reliability applications, it is recommended that the MIL-M-38510/470 NASA be enhanced with additional test vectors so as to achieve 100% stuck fault detection efficiency.

  9. Software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1993-01-01

    Strategies and tools for the testing, risk assessment and risk control of dependable software-based systems were developed. Part of this project consists of studies to enable the transfer of technology to industry, for example the risk management techniques for safety-concious systems. Theoretical investigations of Boolean and Relational Operator (BRO) testing strategy were conducted for condition-based testing. The Basic Graph Generation and Analysis tool (BGG) was extended to fully incorporate several variants of the BRO metric. Single- and multi-phase risk, coverage and time-based models are being developed to provide additional theoretical and empirical basis for estimation of the reliability and availability of large, highly dependable software. A model for software process and risk management was developed. The use of cause-effect graphing for software specification and validation was investigated. Lastly, advanced software fault-tolerance models were studied to provide alternatives and improvements in situations where simple software fault-tolerance strategies break down.

  10. FPGA-Based, Self-Checking, Fault-Tolerant Computers

    NASA Technical Reports Server (NTRS)

    Some, Raphael; Rennels, David

    2004-01-01

    A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing

  11. Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

    NASA Technical Reports Server (NTRS)

    Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

    1984-01-01

    SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.

  12. Fault-Tolerant Coding for State Machines

    NASA Technical Reports Server (NTRS)

    Naegle, Stephanie Taft; Burke, Gary; Newell, Michael

    2008-01-01

    Two reliable fault-tolerant coding schemes have been proposed for state machines that are used in field-programmable gate arrays and application-specific integrated circuits to implement sequential logic functions. The schemes apply to strings of bits in state registers, which are typically implemented in practice as assemblies of flip-flop circuits. If a single-event upset (SEU, a radiation-induced change in the bit in one flip-flop) occurs in a state register, the state machine that contains the register could go into an erroneous state or could hang, by which is meant that the machine could remain in undefined states indefinitely. The proposed fault-tolerant coding schemes are intended to prevent the state machine from going into an erroneous or hang state when an SEU occurs. To ensure reliability of the state machine, the coding scheme for bits in the state register must satisfy the following criteria: 1. All possible states are defined. 2. An SEU brings the state machine to a known state. 3. There is no possibility of a hang state. 4. No false state is entered. 5. An SEU exerts no effect on the state machine. Fault-tolerant coding schemes that have been commonly used include binary encoding and "one-hot" encoding. Binary encoding is the simplest state machine encoding and satisfies criteria 1 through 3 if all possible states are defined. Binary encoding is a binary count of the state machine number in sequence; the table represents an eight-state example. In one-hot encoding, N bits are used to represent N states: All except one of the bits in a string are 0, and the position of the 1 in the string represents the state. With proper circuit design, one-hot encoding can satisfy criteria 1 through 4. Unfortunately, the requirement to use N bits to represent N states makes one-hot coding inefficient.

  13. Software fault tolerance using data diversity

    NASA Technical Reports Server (NTRS)

    Knight, John C.

    1991-01-01

    Research on data diversity is discussed. Data diversity relies on a different form of redundancy from existing approaches to software fault tolerance and is substantially less expensive to implement. Data diversity can also be applied to software testing and greatly facilitates the automation of testing. Up to now it has been explored both theoretically and in a pilot study, and has been shown to be a promising technique. The effectiveness of data diversity as an error detection mechanism and the application of data diversity to differential equation solvers are discussed.

  14. Method and system for environmentally adaptive fault tolerant computing

    NASA Technical Reports Server (NTRS)

    Copenhaver, Jason L. (Inventor); Jeremy, Ramos (Inventor); Wolfe, Jeffrey M. (Inventor); Brenner, Dean (Inventor)

    2010-01-01

    A method and system for adapting fault tolerant computing. The method includes the steps of measuring an environmental condition representative of an environment. An on-board processing system's sensitivity to the measured environmental condition is measured. It is determined whether to reconfigure a fault tolerance of the on-board processing system based in part on the measured environmental condition. The fault tolerance of the on-board processing system may be reconfigured based in part on the measured environmental condition.

  15. Study on fault-tolerant processors for advanced launch system

    NASA Technical Reports Server (NTRS)

    Shin, Kang G.; Liu, Jyh-Charn

    1990-01-01

    Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.

  16. Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

    SciTech Connect

    Lumsdaine, Andrew

    2013-03-08

    The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility or control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.

  17. Fault tolerant architectures for integrated aircraft electronics systems

    NASA Technical Reports Server (NTRS)

    Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

    1983-01-01

    Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered.

  18. Exploiting data representation for fault tolerance

    DOE PAGES

    Hoemmen, Mark Frederick; Elliott, J.; Sandia National Lab.; ...

    2015-01-06

    Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product'smore » result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.« less

  19. Exploiting data representation for fault tolerance

    SciTech Connect

    Hoemmen, Mark Frederick; Elliott, J.; Mueller, F.

    2015-01-06

    Incorrect computer hardware behavior may corrupt intermediate computations in numerical algorithms, possibly resulting in incorrect answers. Prior work models misbehaving hardware by randomly flipping bits in memory. We start by accepting this premise, and present an analytic model for the error introduced by a bit flip in an IEEE 754 floating-point number. We then relate this finding to the linear algebra concepts of normalization and matrix equilibration. In particular, we present a case study illustrating that normalizing both vector inputs of a dot product minimizes the probability of a single bit flip causing a large error in the dot product's result. Moreover, the absolute error is either less than one or very large, which allows detection of large errors. Then, we apply this to the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase of GMRES, and show that when the matrix is equilibrated, the absolute error is bounded above by one.

  20. Parameter Transient Behavior Analysis on Fault Tolerant Control System

    NASA Technical Reports Server (NTRS)

    Belcastro, Christine (Technical Monitor); Shin, Jong-Yeob

    2003-01-01

    In a fault tolerant control (FTC) system, a parameter varying FTC law is reconfigured based on fault parameters estimated by fault detection and isolation (FDI) modules. FDI modules require some time to detect fault occurrences in aero-vehicle dynamics. This paper illustrates analysis of a FTC system based on estimated fault parameter transient behavior which may include false fault detections during a short time interval. Using Lyapunov function analysis, the upper bound of an induced-L2 norm of the FTC system performance is calculated as a function of a fault detection time and the exponential decay rate of the Lyapunov function.

  1. Eigenstructure Assignment for Fault Tolerant Flight Control Design

    NASA Technical Reports Server (NTRS)

    Sobel, Kenneth; Joshi, Suresh (Technical Monitor)

    2002-01-01

    In recent years, fault tolerant flight control systems have gained an increased interest for high performance military aircraft as well as civil aircraft. Fault tolerant control systems can be described as either active or passive. An active fault tolerant control system has to either reconfigure or adapt the controller in response to a failure. One approach is to reconfigure the controller based upon detection and identification of the failure. Another approach is to use direct adaptive control to adjust the controller without explicitly identifying the failure. In contrast, a passive fault tolerant control system uses a fixed controller which achieves acceptable performance for a presumed set of failures. We have obtained a passive fault tolerant flight control law for the F/A-18 aircraft which achieves acceptable handling qualities for a class of control surface failures. The class of failures includes the symmetric failure of any one control surface being stuck at its trim value. A comparison was made of an eigenstructure assignment gain designed for the unfailed aircraft with a fault tolerant multiobjective optimization gain. We have shown that time responses for the unfailed aircraft using the eigenstructure assignment gain and the fault tolerant gain are identical. Furthermore, the fault tolerant gain achieves MIL-F-8785C specifications for all failure conditions.

  2. Parallel and distributed computation for fault-tolerant object recognition

    NASA Technical Reports Server (NTRS)

    Wechsler, Harry

    1988-01-01

    The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.

  3. Abstractions for Fault-Tolerant Distributed System Verification

    NASA Technical Reports Server (NTRS)

    Pike, Lee S.; Maddalon, Jeffrey M.; Miner, Paul S.; Geser, Alfons

    2004-01-01

    Four kinds of abstraction for the design and analysis of fault tolerant distributed systems are discussed. These abstractions concern system messages, faults, fault masking voting, and communication. The abstractions are formalized in higher order logic, and are intended to facilitate specifying and verifying such systems in higher order theorem provers.

  4. Machine-checked proofs of the design and implementation of a fault-tolerant circuit

    NASA Technical Reports Server (NTRS)

    Bevier, William R.; Young, William D.

    1990-01-01

    A formally verified implementation of the 'oral messages' algorithm of Pease, Shostak, and Lamport is described. An abstract implementation of the algorithm is verified to achieve interactive consistency in the presence of faults. This abstract characterization is then mapped down to a hardware level implementation which inherits the fault-tolerant characteristics of the abstract version. All steps in the proof were checked with the Boyer-Moore theorem prover. A significant results is the demonstration of a fault-tolerant device that is formally specified and whose implementation is proved correct with respect to this specification. A significant simplifying assumption is that the redundant processors behave synchronously. A mechanically checked proof that the oral messages algorithm is 'optimal' in the sense that no algorithm which achieves agreement via similar message passing can tolerate a larger proportion of faulty processor is also described.

  5. Analysis of typical fault-tolerant architectures using HARP

    NASA Technical Reports Server (NTRS)

    Bavuso, Salvatore J.; Bechta Dugan, Joanne; Trivedi, Kishor S.; Rothmann, Elizabeth M.; Smith, W. Earl

    1987-01-01

    Difficulties encountered in the modeling of fault-tolerant systems are discussed. The Hybrid Automated Reliability Predictor (HARP) approach to modeling fault-tolerant systems is described. The HARP is written in FORTRAN, consists of nearly 30,000 lines of codes and comments, and is based on behavioral decomposition. Using the behavioral decomposition, the dependability model is divided into fault-occurrence/repair and fault/error-handling models; the characteristics and combining of these two models are examined. Examples in which the HARP is applied to the modeling of some typical fault-tolerant systems, including a local-area network, two fault-tolerant computer systems, and a flight control system, are presented.

  6. FTMP (Fault Tolerant Multiprocessor) programmer's manual

    NASA Technical Reports Server (NTRS)

    Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

    1986-01-01

    The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.

  7. An integrated study of fault tolerance in computing systems

    SciTech Connect

    Lin, Tein-Hsiang.

    1988-01-01

    A general framework for the design and analysis of distributed fault-tolerant systems is proposed including fault/error occurrence and detection, error propagation, fault location, retry, system reconfiguration, damage assessment, and error recovery. Detection mechanisms are usually assumed to be so perfect that problems within a particular phase of fault tolerance can be studied without considering its interplay with other phases. This dissertation shows that the assumption of imperfect detection mechanisms will greatly influence fault diagnosis, rollback recovery, and checkpointing. Two additional related problems are studied. One is concerned with the use of retry following a fault detection and the other with the optimal placement of checkpoints in a real-time task with or without the perfect detection assumption. A fault-classification scheme is developed for on-line estimation of fault parameters.

  8. The IEEE eighteenth international symposium on fault-tolerant computing (Digest of Papers)

    SciTech Connect

    Not Available

    1988-01-01

    These proceedings collect papers on fault detection and computers. Topics include: software failure behavior, fault tolerant distributed programs, parallel simulation of faults, concurrent built-in self-test techniques, fault-tolerant parallel processor architectures, probabilistic fault diagnosis, fault tolerances in hypercube processors and cellular automation modeling.

  9. A second generation experiment in fault-tolerant software

    NASA Technical Reports Server (NTRS)

    Knight, J. C.

    1986-01-01

    The primary goal was to determine whether the application of fault tolerance to software increases its reliability if the cost of production is the same as for an equivalent nonfault tolerance version derived from the same requirements specification. Software development protocols are discussed. The feasibility of adapting to software design fault tolerance the technique of N-fold Modular Redundancy with majority voting was studied.

  10. Application-Specific Fault Tolerance via Data Access Characterization

    SciTech Connect

    Ali, Nawab; Krishnamoorthy, Sriram; Govind, Niranjan; Kowalski, Karol; Sadayappan, Ponnuswamy

    2011-08-30

    Recent trends in semiconductor technology and supercomputer design predict an increasing probability of faults during an application's execution. Designing an application that is resilient to system failures requires careful evaluation of the impact of various approaches on preserving key application state. In this paper, we present our experiences in an ongoing effort to make a large computational chemistry application fault tolerant. We construct the data access signatures of key application modules to evaluate alternative fault tolerance approaches. We present the instrumentation methodology, characterization of the application modules, and evaluation of fault tolerance techniques using the information collected. The application signatures developed capture application characteristics not traditionally revealed by performance tools. We believe these can be used in the design and evaluation of runtimes beyond fault tolerance.

  11. Fault tolerant architectures for integrated aircraft electronics systems, task 2

    NASA Technical Reports Server (NTRS)

    Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

    1984-01-01

    The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported.

  12. A fault-tolerant software strategy for digital systems

    NASA Technical Reports Server (NTRS)

    Hitt, E. F.; Webb, J. J.

    1984-01-01

    Techniques developed for producing fault-tolerant software are described. Tolerance is required because of the impossibility of defining fault-free software. Faults are caused by humans and can appear anywhere in the software life cycle. Tolerance is effected through error detection, damage assessment, recovery, and fault treatment, followed by return of the system to service. Multiversion software comprises two or more versions of the software yielding solutions which are examined by a decision algorithm. Errors can also be detected by extrapolation from previous results or by the acceptability of results. Violations of timing specifications can reveal errors, or the system can roll back to an error-free state when a defect is detected. The software, when used in flight control systems, must not impinge on time-critical responses. Efforts are still needed to reduce the costs of developing the fault-tolerant systems.

  13. Steps toward fault-tolerant quantum chemistry.

    SciTech Connect

    Taube, Andrew Garvin

    2010-05-01

    Developing quantum chemistry programs on the coming generation of exascale computers will be a difficult task. The programs will need to be fault-tolerant and minimize the use of global operations. This work explores the use a task-based model that uses a data-centric approach to allocate work to different processes as it applies to quantum chemistry. After introducing the key problems that appear when trying to parallelize a complicated quantum chemistry method such as coupled-cluster theory, we discuss the implications of that model as it pertains to the computational kernel of a coupled-cluster program - matrix multiplication. Also, we discuss the extensions that would required to build a full coupled-cluster program using the task-based model. Current programming models for high-performance computing are fault-intolerant and use global operations. Those properties are unsustainable as computers scale to millions of CPUs; instead one must recognize that these systems will be hierarchical in structure, prone to constant faults, and global operations will be infeasible. The FAST-OS HARE project is introducing a scale-free computing model to address these issues. This model is hierarchical and fault-tolerant by design, allows for the clean overlap of computation and communication, reducing the network load, does not require checkpointing, and avoids the complexity of many HPC runtimes. Development of an algorithm within this model requires a change in focus from imperative programming to a data-centric approach. Quantum chemistry (QC) algorithms, in particular electronic structure methods, are an ideal test bed for this computing model. These methods describe the distribution of electrons in a molecule, which determine the properties of the molecule. The computational cost of these methods is high, scaling quartically or higher in the size of the molecule, which is why QC applications are major users of HPC resources. The complexity of these algorithms means that

  14. Scalability, performance, and fault tolerance of PACS architectures

    NASA Astrophysics Data System (ADS)

    Blume, Hartwig R.; Prior, Fred W.; di Pierro, Milan C.; Goble, John C.; Lodgberg, Jonas; Kenney, Robert S.; Goeringer, Fred

    1998-07-01

    Three data-base architectures may be distinguished among Picture Archiving and Communication Systems (PACSs): (1) Configurations with logically and physically centralized data- base and file server, (2) systems with physically distributed file servers and a logically centralized data-base, and (3) installations with logically and physically distributed data- bases and file servers. A brief overview of these architectures and their scaleability, performance, and fault- tolerance is given. A PACS for an existing large university hospital is designed for the first as well as the second architecture using given image production data and workflow. We evaluate the fault-tolerance of the two architectures. By modeling the work-flow and employing queuing theory, solutions with practically realizable data transfer requirements are found for both architectures. With today's performance and cost of computers, storage, and information management technologies, the second and third architectures are preferably implemented, depending on the size of the installation. The architectures offer almost unlimited scaleability, very high fault-tolerance, and optimized workflow. We describe a modern commercial PACS that adheres to the open-systems concept and consists of software application programs that run, independent of specific computer and network components, on off-the-shelf hardware and under standard multi-platform operating systems and utilize commercial data-base management systems and network managers. The system is based on the second architecture with multiple islands of functionality, each with servers and archive modules and a physically distributed data-base. Our PACS architecture supports browser technology: Workstations use the data-base to determine the location of needed information and then, through the image browser, mount the appropriate file server for access. The architecture supports a concept similar to domain name server (DNS) directory services on the

  15. Fault Tolerant Statistical Signal Processing Algorithms for Parallel Architectures.

    DTIC Science & Technology

    2014-09-26

    AD-fi57 393 FAULT TOLERANT STATISTICAL SIGNAL PROCESSING ALGORITHMS i/i FOR PARALLEL ARCH U) JOHNS HOPKINS UNIV BALTIMORE MD DEPT OF ELECTRICAL...COVERED * ’ Fault Tolerant Statistical Signal Processing Technical A l g o r i t h m s f o r P a r a l l e l A r c h i t e c t u r e s a ._ P E R F O R M I...Identify by block number) , Fault Tolerance, Signal Processing, Parallel Architecture 0 20. ABSTRACT (Continue on reveree side It neceseary and identify by

  16. RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

    NASA Technical Reports Server (NTRS)

    Dunn, W. R.

    1980-01-01

    RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.

  17. Fault tolerance in a supercomputer through dynamic repartitioning

    DOEpatents

    Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

    2007-02-27

    A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.

  18. SIFT - Design and analysis of a fault-tolerant computer for aircraft control. [Software Implemented Fault Tolerant systems

    NASA Technical Reports Server (NTRS)

    Wensley, J. H.; Lamport, L.; Goldberg, J.; Green, M. W.; Levitt, K. N.; Melliar-Smith, P. M.; Shostak, R. E.; Weinstock, C. B.

    1978-01-01

    SIFT (Software Implemented Fault Tolerance) is an ultrareliable computer for critical aircraft control applications that achieves fault tolerance by the replication of tasks among processing units. The main processing units are off-the-shelf minicomputers, with standard microcomputers serving as the interface to the I/O system. Fault isolation is achieved by using a specially designed redundant bus system to interconnect the processing units. Error detection and analysis and system reconfiguration are performed by software. Iterative tasks are redundantly executed, and the results of each iteration are voted upon before being used. Thus, any single failure in a processing unit or bus can be tolerated with triplication of tasks, and subsequent failures can be tolerated after reconfiguration. Independent execution by separate processors means that the processors need only be loosely synchronized, and a novel fault-tolerant synchronization method is described.

  19. Final Project Report. Scalable fault tolerance runtime technology for petascale computers

    SciTech Connect

    Krishnamoorthy, Sriram; Sadayappan, P

    2015-06-16

    With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been a considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.

  20. Fault tolerant testbed evaluation, phase 1

    NASA Technical Reports Server (NTRS)

    Caluori, V., Jr.; Newberry, T.

    1993-01-01

    In recent years, avionics systems development costs have become the driving factor in the development of space systems, military aircraft, and commercial aircraft. A method of reducing avionics development costs is to utilize state-of-the-art software application generator (autocode) tools and methods. The recent maturity of application generator technology has the potential to dramatically reduce development costs by eliminating software development steps that have historically introduced errors and the need for re-work. Application generator tools have been demonstrated to be an effective method for autocoding non-redundant, relatively low-rate input/output (I/O) applications on the Space Station Freedom (SSF) program; however, they have not been demonstrated for fault tolerant, high-rate I/O, flight critical environments. This contract will evaluate the use of application generators in these harsh environments. Using Boeing's quad-redundant avionics system controller as the target system, Space Shuttle Guidance, Navigation, and Control (GN&C) software will be autocoded, tested, and evaluated in the Johnson (Space Center) Avionics Engineering Laboratory (JAEL). The response of the autocoded system will be shown to match the response of the existing Shuttle General Purpose Computers (GPC's), thereby demonstrating the viability of using autocode techniques in the development of future avionics systems.

  1. Two fault tolerant toggle-hook release

    NASA Technical Reports Server (NTRS)

    Graves, Thomas Joseph (Inventor); Brown, Christopher William (Inventor)

    1991-01-01

    A coupling device is disclosed which is mechanically two fault tolerant for release. The device comprises a fastener plate and fastener body, each of which is attachable to a different one of a pair of structures to be joined. The fastener plate and body are coupled by an elongate toggle mounted at one end in a socket on the fastener plate for universal pivotal movement thereon. The other end of the toggle is received in an opening in the fastener body and adapted for limited pivotal movement therein. The toggle is adapted to be restrained by three latch hooks arranged in symmetrical equiangular spacing about the axis of the toggle, each hook being mounted on the fastener body for pivotal movement between an unlatching non-contact position with respect to the toggle and a latching position in engagement with a latching surface of the toggle. The device includes releasable lock means for locking each latch hook in its latching position whereby the toggle couples the fastener plate to the fastener body and means for releasing the lock means to unlock each said latch hook from the latch position whereby the unlocking of at least one of the latch hooks from its latching position results in the decoupling of the fastener plate from the fastener body.

  2. Fault tolerant testbed evaluation, phase 1

    NASA Astrophysics Data System (ADS)

    Caluori, V., Jr.; Newberry, T.

    1993-09-01

    In recent years, avionics systems development costs have become the driving factor in the development of space systems, military aircraft, and commercial aircraft. A method of reducing avionics development costs is to utilize state-of-the-art software application generator (autocode) tools and methods. The recent maturity of application generator technology has the potential to dramatically reduce development costs by eliminating software development steps that have historically introduced errors and the need for re-work. Application generator tools have been demonstrated to be an effective method for autocoding non-redundant, relatively low-rate input/output (I/O) applications on the Space Station Freedom (SSF) program; however, they have not been demonstrated for fault tolerant, high-rate I/O, flight critical environments. This contract will evaluate the use of application generators in these harsh environments. Using Boeing's quad-redundant avionics system controller as the target system, Space Shuttle Guidance, Navigation, and Control (GN&C) software will be autocoded, tested, and evaluated in the Johnson (Space Center) Avionics Engineering Laboratory (JAEL). The response of the autocoded system will be shown to match the response of the existing Shuttle General Purpose Computers (GPC's), thereby demonstrating the viability of using autocode techniques in the development of future avionics systems.

  3. Shared memory for a fault-tolerant computer

    NASA Technical Reports Server (NTRS)

    Gilley, G. C. (Inventor)

    1976-01-01

    A system is described for sharing a memory in a fault-tolerant computer. The memory is under the direct control and monitoring of error detecting and error diagnostic units in the fault-tolerant computer. This computer verifies that data to and from the memory is legally encoded and verifies that words read from the memory at a desired address are, in fact, actually delivered from that desired address. The means are provided for a second processor, which is independent of the direct control and monitoring of the error checking and diagnostic units of the fault-tolerant computer, and to share the memory of the fault-tolerant computer. Circuitry is included to verify that: (1) the processor has properly accessed a desired memory location in the memory; (2) a data word read-out from the memory is properly coded; and (3) no inactive memory was erroneously outputting data onto the shared memory bus.

  4. Software fault tolerance for real-time avionics systems

    NASA Technical Reports Server (NTRS)

    Anderson, T.; Knight, J. C.

    1983-01-01

    Avionics systems have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be very expensive for systems which utilize concurrent processes. The concurrency present in most avionics systems and the further difficulties introduced by timing constraints imply that providing tolerance for software faults may be inordinately expensive or complex. A straightforward pragmatic approach to software fault tolerance which is believed to be applicable to many real-time avionics systems is proposed. A classification system for software errors is presented together with approaches to recovery and continued service for each error type.

  5. Survey of Fault Tolerant Computer Security and Computer Safety.

    DTIC Science & Technology

    introduction to the report as a whole. Contents: Fundamental Concepts of Fault-Tolerant Computing; Survey of Device and System Testing; Computer Security in Defense Systems; and State of the Art of Safety for Computer Controlled Systems .

  6. Reliability of Fault Tolerant Control Systems. Part 1

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva

    2001-01-01

    This paper reports Part I of a two part effort, that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability analysis of fault-tolerant control systems is performed using Markov models. Reliability properties, peculiar to fault-tolerant control systems are emphasized. As a consequence, coverage of failures through redundancy management can be severely limited. It is shown that in the early life of a syi1ein composed of highly reliable subsystems, the reliability of the overall system is affine with respect to coverage, and inadequate coverage induces dominant single point failures. The utility of some existing software tools for assessing the reliability of fault tolerant control systems is also discussed. Coverage modeling is attempted in Part II in a way that captures its dependence on the control performance and on the diagnostic resolution.

  7. Fault tolerant programmable digital attitude control electronics study

    NASA Technical Reports Server (NTRS)

    Sorensen, A. A.

    1974-01-01

    The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation.

  8. Optimal Management of Redundant Control Authority for Fault Tolerance

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva; Ju, Jianhong

    2000-01-01

    This paper is intended to demonstrate the feasibility of a solution to a fault tolerant control problem. It explains, through a numerical example, the design and the operation of a novel scheme for fault tolerant control. The fundamental principle of the scheme was formalized in [5] based on the notion of normalized nonspecificity. The novelty lies with the use of a reliability criterion for redundancy management, and therefore leads to a high overall system reliability.

  9. Advanced development for space robotics with emphasis on fault tolerance

    NASA Technical Reports Server (NTRS)

    Tesar, D.; Chladek, J.; Hooper, R.; Sreevijayan, D.; Kapoor, C.; Geisinger, J.; Meaney, M.; Browning, G.; Rackers, K.

    1995-01-01

    This paper describes the ongoing work in fault tolerance at the University of Texas at Austin. The paper describes the technical goals the group is striving to achieve and includes a brief description of the individual projects focusing on fault tolerance. The ultimate goal is to develop and test technology applicable to all future missions of NASA (lunar base, Mars exploration, planetary surveillance, space station, etc.).

  10. On the design of fault-tolerant robotic manipulator systems

    NASA Technical Reports Server (NTRS)

    Tesar, Delbert

    1993-01-01

    Robotic systems are finding increasing use in space applications. Many of these devices are going to be operational on board the Space Station Freedom. Fault tolerance has been deemed necessary because of the criticality of the tasks and the inaccessibility of the systems to maintenance and repair. Design for fault tolerance in manipulator systems is an area within robotics that is without precedence in the literature. In this paper, we will attempt to lay down the foundations for such a technology. Design for fault tolerance demands new and special approaches to design, often at considerable variance from established design practices. These design aspects, together with reliability evaluation and modeling tools, are presented. Mechanical architectures that employ protective redundancies at many levels and have a modular architecture are then studied in detail. Once a mechanical architecture for fault tolerance has been derived, the chronological stages of operational fault tolerance are investigated. Failure detection, isolation, and estimation methods are surveyed, and such methods for robot sensors and actuators are derived. Failure recovery methods are also presented for each of the protective layers of redundancy. Failure recovery tactics often span all of the layers of a control hierarchy. Thus, a unified framework for decision-making and control, which orchestrates both the nominal redundancy management tasks and the failure management tasks, has been derived. The well-developed field of fault-tolerant computers is studied next, and some design principles relevant to the design of fault-tolerant robot controllers are abstracted. Conclusions are drawn, and a road map for the design of fault-tolerant manipulator systems is laid out with recommendations for a 10 DOF arm with dual actuators at each joint.

  11. Fault-tolerant computation with higher-dimensional systems

    NASA Technical Reports Server (NTRS)

    Gottesman, D.

    1998-01-01

    Instead of a quantum computer where the fundamental units are 2-dimensional qubits, the author considers a quantum computer made up of d-dimensional systems. There is a straightforward generalization of the class of stabilizer codes to d-dimensional systems, and he discusses the theory of fault-tolerant computation using such codes. He proves that universal fault-tolerant computation is possible with any higher-dimensional stabilizer code for prime d.

  12. Formal Techniques for Synchronized Fault-Tolerant Systems

    NASA Technical Reports Server (NTRS)

    DiVito, Ben L.; Butler, Ricky W.

    1992-01-01

    We present the formal verification of synchronizing aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization is based on an extended state machine model incorporating snapshots of local processors clocks.

  13. Experiments in fault tolerant software reliability

    NASA Technical Reports Server (NTRS)

    Mcallister, David F.; Vouk, Mladen A.

    1989-01-01

    Twenty functionally equivalent programs were built and tested in a multiversion software experiment. Following unit testing, all programs were subjected to an extensive system test. In the process sixty-one distinct faults were identified among the versions. Less than 12 percent of the faults exhibited varying degrees of positive correlation. The common-cause (or similar) faults spanned as many as 14 components. However, a majority of these faults were trivial, and easily detected by proper unit and/or system testing. Only two of the seven similar faults were difficult faults, and both were caused by specification ambiguities. One of these faults exhibited variable identical-and-wrong response span, i.e. response span which varied with the testing conditions and input data. Techniques that could have been used to avoid the faults are discussed. For example, it was determined that back-to-back testing of 2-tuples could have been used to eliminate about 90 percent of the faults. In addition, four of the seven similar faults could have been detected by using back-to-back testing of 5-tuples. It is believed that most, if not all, similar faults could have been avoided had the specifications been written using more formal notation, the unit testing phase was subject to more stringent standards and controls, and better tools for measuring the quality and adequacy of the test data (e.g. coverage) were used.

  14. Fault-tolerant wait-free shared objects

    NASA Technical Reports Server (NTRS)

    Jayanti, Prasad; Chandra, Tushar Deepak; Toueg, Sam

    1992-01-01

    A concurrent system consists of processes and shared objects. Previous research focused on the problem of tolerating process failure. We study the complementary problem of tolerating failures. We divide object failures into two broad classes: responsive and non-responsive. With responsive failures, a faulty object responds to every invocation, but responses may be incorrect. With non-responsive failures, a faulty object may also 'hang' without responding. For each class, we consider crash, and arbitrary types of failures. For each type of failure, we are seeking a universal implementation for fault-tolerant wait-free shared objects. We present (deterministic) implementations for all types of responsive failures, including arbitrary failures. In contrast, we show that even the most benign type of non-responsive failures requires the use of randomization. Of special interest is the problem of implementing fault-tolerant objects using only objects of the same type. We present such fault-tolerant self-implementations for many common object types. Graceful degradation is a desirable property of fault-tolerant implementations: the implemented object never fails more severely than the base objects it is derived from, even if all the base objects fail. For several failure models, we show whether this property can be achieved, and, if so, how. In addition to the above possibility/impossibility results, we also consider the resources complexity of fault-tolerant implementations. In many cases, we present lower bounds and give matching algorithms.

  15. Algorithm-Based Fault Tolerance Integrated with Replication

    NASA Technical Reports Server (NTRS)

    Some, Raphael; Rennels, David

    2008-01-01

    In a proposed approach to programming and utilization of commercial off-the-shelf computing equipment, a combination of algorithm-based fault tolerance (ABFT) and replication would be utilized to obtain high degrees of fault tolerance without incurring excessive costs. The basic idea of the proposed approach is to integrate ABFT with replication such that the algorithmic portions of computations would be protected by ABFT, and the logical portions by replication. ABFT is an extremely efficient, inexpensive, high-coverage technique for detecting and mitigating faults in computer systems used for algorithmic computations, but does not protect against errors in logical operations surrounding algorithms.

  16. Fault tolerance through reconfiguration in VLSI and WSI arrays

    SciTech Connect

    Negrini, R.; Sami, M.G.; Stefanelli, R. )

    1989-01-01

    This book discusses the research in fault tolerance. The authors focus in particular on reconfiguration techniques and present their results in the reconfiguration of processing arrays. Contents include: Introduction; Typical Processing Arrays; Failure Mechanisms and Fault Models; Basic Problems of Fault-Tolerance Through Array Configuration; Technologies Supporting Reconfiguration; Testing; Reconfiguration: An Introduction; The Diogenes Approach; Reconfiguration for Linear Arrays; Graph-Theoretical Approaches to Reconfiguration; Local Reconfiguration; Global Reconfiguration Techniques: Row/Column Elimination; Global Mapping: Index Mapping Reconfiguration Techniques; Reconfiguration Based on Request-Acknowledge Local Protocols; Reconfiguration of Multiple-Pipeline Structures; Some Extensions Toward Time-Redundancy; Appendix: Reliability Prediction of Arrays.

  17. Chip level simulation of fault tolerant computers

    NASA Technical Reports Server (NTRS)

    Armstrong, J. R.

    1983-01-01

    Chip level modeling techniques, functional fault simulation, simulation software development, a more efficient, high level version of GSP, and a parallel architecture for functional simulation are discussed.

  18. A Byzantine resilient processor with an encoded fault-tolerant shared memory

    NASA Technical Reports Server (NTRS)

    Butler, Bryan; Harper, Richard

    1990-01-01

    The memory requirements for ultra-reliable computers are expected to increase due to future increases in mission functionality and operating-system requirements. This increase will have a negative effect on the reliability and cost of the system. Increased memory size will also reduce the ability to reintegrate a channel after a transient fault, since the time required to reintegrate a channel in a conventional fault-tolerant processor is dominated by memory realignment time. A Byzantine Resilient Fault-Tolerant Processor with Fault-Tolerant Shared Memory (FTP/FTSM) is presented as a solution to these problems. The FTSM uses an encoded memory system, which reduces the memory requirement by one-half compared to a conventional quad-FTP design. This increases the reliability and decreases the cost of the system. The realignment problem is also addressed by the FTSM. Because any single error is corrected upon a read from the FTSM, a faulty channel's corrupted memory does not need realignment before reintegration of the faulty channel. A combination of correct-on-access and background scrubbing is proposed to prevent the accumulation of transient errors in the memory. With a hardware-implemented scrubber, the scrubbing cycle time, and therefore the memory fault latency, can be upper-bounded at a small value. This technique increases the reliability of the memory system and facilitates validation of its reliability model.

  19. Reconfigurable fault-tolerant multiprocessor system for real-time control

    SciTech Connect

    Kao, M.L.

    1986-01-01

    Real-time control applications place stringent constraints in computers controlling them since the failure of a computer could result in costly damages and even loss of human lives. Fault-tolerant computers, therefore, have been always in high demand in critical avionic and aerospace applications. However, the use of redundancy techniques to achieve fault tolerance in industrial applications has only recently become feasible due to the rapid decrease in cost and increase in performance of microprocessors. As more and more robots are being built to replace human beings in dangerous and difficult tasks, the need for a reliable computer for robotics control increases. This need, in particular, motivated the research described in this dissertation - the design and implementation of a reconfigurable fault-tolerant multiprocessor system (the FREMP system). The FREMP system consists of four processing units (PUs) and three common parallel buses. Each PU is a combination of an Intel 86/30 single board computer and a custom fault detection/masking circuit board (FDM board). A hardware/software combined scheme was devised to detect faults and correct errors. This scheme has shown to be more efficient than software voting while maintaining the flexibility of software approaches. Time-frame scheduling was adopted to schedule tasks for execution.

  20. ROBUS-2: A Fault-Tolerant Broadcast Communication System

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.

    2005-01-01

    The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault-tolerant integrated modular architecture currently under development at NASA Langley Research Center. The ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, clock synchronization, and distributed diagnosis (group membership). The ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 is tolerant to internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. This version of the ROBUS is intended for laboratory experimentation and demonstrations of the capability to reintegrate failed nodes, dynamically update the communication schedule, and tolerate and recover from correlated transient faults.

  1. Fault tolerant photodiode and photogate active pixel sensors

    NASA Astrophysics Data System (ADS)

    Jung, Cory; Chapman, Glenn H.; La Haye, Michelle L.; Djaja, Sunjaya; Cheung, Desmond Y. H.; Lin, Henry; Loo, Edward; Audet, Yves R.

    2005-03-01

    As the pixel counts of digital imagers increase, the challenge of maintaining high yields and ensuring reliability over an imager"s lifetime increases. A fault tolerant active pixel sensor (APS) has been designed to meet this need by splitting an APS in half and operating both halves in parallel. The fault tolerant APS will perform normally in the no defect case and will produce approximately half the output for single defects. Thus, the entire signal can be recovered by multiplying the output by two. Since pixels containing multiple defects are rare, this design can correct for most defects allowing for higher production yields. Fault tolerant photodiode and photogate APS" were fabricated in 0.18-micron technology. Testing showed that the photodiode APS could correct for optically induced and electrically induced faults, within experimental error. The photogate APS was only tested for optically induced defects and also corrects for defects within experimental error. Further testing showed that the sensitivity of fault tolerant pixels was approximately 2-3 times more sensitive than the normal pixels. HSpice simulations of the fault tolerant APS circuit did not show increased sensitivity, however an equivalent normal APS circuit with twice width readout and row transistors was 1.90 times more sensitive than a normal pixel.

  2. Hardware

    NASA Technical Reports Server (NTRS)

    1999-01-01

    The full complement of EDOMP investigations called for a broad spectrum of flight hardware ranging from commercial items, modified for spaceflight, to custom designed hardware made to meet the unique requirements of testing in the space environment. In addition, baseline data collection before and after spaceflight required numerous items of ground-based hardware. Two basic categories of ground-based hardware were used in EDOMP testing before and after flight: (1) hardware used for medical baseline testing and analysis, and (2) flight-like hardware used both for astronaut training and medical testing. To ensure post-landing data collection, hardware was required at both the Kennedy Space Center (KSC) and the Dryden Flight Research Center (DFRC) landing sites. Items that were very large or sensitive to the rigors of shipping were housed permanently at the landing site test facilities. Therefore, multiple sets of hardware were required to adequately support the prime and backup landing sites plus the Johnson Space Center (JSC) laboratories. Development of flight hardware was a major element of the EDOMP. The challenges included obtaining or developing equipment that met the following criteria: (1) compact (small size and light weight), (2) battery-operated or requiring minimal spacecraft power, (3) sturdy enough to survive the rigors of spaceflight, (4) quiet enough to pass acoustics limitations, (5) shielded and filtered adequately to assure electromagnetic compatibility with spacecraft systems, (6) user-friendly in a microgravity environment, and (7) accurate and efficient operation to meet medical investigative requirements.

  3. Fault-tolerant building-block computer study

    NASA Technical Reports Server (NTRS)

    Rennels, D. A.

    1978-01-01

    Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.

  4. Reconfigurable tree architectures using subtree oriented fault tolerance

    NASA Technical Reports Server (NTRS)

    Lowrie, Matthew B.

    1987-01-01

    An approach to the design of reconfigurable tree architecture is presented in which spare processors are allocated at the leaves. The approach is unique in that spares are associated with subtrees and sharing of spares between these subtrees can occur. The Subtree Oriented Fault Tolerance (SOFT) approach is more reliable than previous approaches capable of tolerating link and switch failures for both single chip and multichip tree implementations while reducing redundancy in terms of both spare processors and links. VLSI layout is 0(n) for binary trees and is directly extensible to N-ary trees and fault tolerance through performance degradation.

  5. Runtime Speculative Software-Only Fault Tolerance

    DTIC Science & Technology

    2012-06-01

    5.6.2 Memory consumption . . . . . . . . . . . . . . . . . . . . . . . . 61 5.6.3 Power consumption...Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2.2 Physical Memory Usage . . . . . . . . . . . . . . . . . . . . . . . 69 6.2.3 Power ...overhead for RSFT with and without fault recovery. . . . 70 6.5 Physical memory overhead for RSFT with and without fault recovery. . . . 72 6.6 Power

  6. Block QCA Fault-Tolerant Logic Gates

    NASA Technical Reports Server (NTRS)

    Firjany, Amir; Toomarian, Nikzad; Modarres, Katayoon

    2003-01-01

    Suitably patterned arrays (blocks) of quantum-dot cellular automata (QCA) have been proposed as fault-tolerant universal logic gates. These block QCA gates could be used to realize the potential of QCA for further miniaturization, reduction of power consumption, increase in switching speed, and increased degree of integration of very-large-scale integrated (VLSI) electronic circuits. The limitations of conventional VLSI circuitry, the basic principle of operation of QCA, and the potential advantages of QCA-based VLSI circuitry were described in several NASA Tech Briefs articles, namely Implementing Permutation Matrices by Use of Quantum Dots (NPO-20801), Vol. 25, No. 10 (October 2001), page 42; Compact Interconnection Networks Based on Quantum Dots (NPO-20855) Vol. 27, No. 1 (January 2003), page 32; Bit-Serial Adder Based on Quantum Dots (NPO-20869), Vol. 27, No. 1 (January 2003), page 35; and Hybrid VLSI/QCA Architecture for Computing FFTs (NPO-20923), which follows this article. To recapitulate the principle of operation (greatly oversimplified because of the limitation on space available for this article): A quantum-dot cellular automata contains four quantum dots positioned at or between the corners of a square cell. The cell contains two extra mobile electrons that can tunnel (in the quantummechanical sense) between neighboring dots within the cell. The Coulomb repulsion between the two electrons tends to make them occupy antipodal dots in the cell. For an isolated cell, there are two energetically equivalent arrangements (denoted polarization states) of the extra electrons. The cell polarization is used to encode binary information. Because the polarization of a nonisolated cell depends on Coulomb-repulsion interactions with neighboring cells, universal logic gates and binary wires could be constructed, in principle, by arraying QCA of suitable design in suitable patterns. Heretofore, researchers have recognized two major obstacles to realization of QCA

  7. Fault tolerant hypercube computer system architecture

    NASA Technical Reports Server (NTRS)

    Madan, Herb S. (Inventor); Chow, Edward (Inventor)

    1989-01-01

    A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node

  8. Fault-tolerant control of heavy-haul trains

    NASA Astrophysics Data System (ADS)

    Zhuan, Xiangtao; Xia, Xiaohua

    2010-06-01

    The fault-tolerant control (FTC) of heavy-haul trains is discussed on the basis of the speed regulation proposed in previous works. The fault modes of trains are assumed and the corresponding fault detection and isolation (FDI) are studied. The FDI of sensor faults is based on a geometric approach for residual generators. The FDI of a braking system is based on the observation of the steady-state speed. From the difference of the steady-state speeds between the fault system and the faultless system, one can get fault information. Simulation tests were conducted on the suitability of the FDIs and the redesigned speed regulators. It is shown that the proposed FTC does not explicitly worsen the performance of the speed regulator in the case of a faultless system, while it obviously improves the performance of the speed regulator in the case of a faulty system.

  9. Distributed Fault-Tolerant Control of Networked Uncertain Euler-Lagrange Systems Under Actuator Faults.

    PubMed

    Chen, Gang; Song, Yongduan; Lewis, Frank L

    2016-05-03

    This paper investigates the distributed fault-tolerant control problem of networked Euler-Lagrange systems with actuator and communication link faults. An adaptive fault-tolerant cooperative control scheme is proposed to achieve the coordinated tracking control of networked uncertain Lagrange systems on a general directed communication topology, which contains a spanning tree with the root node being the active target system. The proposed algorithm is capable of compensating for the actuator bias fault, the partial loss of effectiveness actuation fault, the communication link fault, the model uncertainty, and the external disturbance simultaneously. The control scheme does not use any fault detection and isolation mechanism to detect, separate, and identify the actuator faults online, which largely reduces the online computation and expedites the responsiveness of the controller. To validate the effectiveness of the proposed method, a test-bed of multiple robot-arm cooperative control system is developed for real-time verification. Experiments on the networked robot-arms are conduced and the results confirm the benefits and the effectiveness of the proposed distributed fault-tolerant control algorithms.

  10. Fault tolerant, reliable and scalable scientific ballooning control software

    NASA Astrophysics Data System (ADS)

    Stewart, Michael F.; Ellison, Steven B.; Isbert, Joachim; Granger, Doug; Guzik, T. Gregory; Wefel, John P.

    The Universal Balloon Control Software package (UBCS) was first designed and developed for the ATIC experiment in 1997 and has evolved over the years into a highly reliable and adaptable control system. The system has logged thousands of hours of operation time on ATIC with few reboots and has been adapted for the HASP balloon payload which has had two successful flights in 2006 and 2007. The goal was to develop a UBCS that was fault tolerant and auto-recoverable while at the same time extremely reliable and scalable. In order to meet these goals, we designed a modular software system where each process was able to run in parallel with other processes on the same or different CPUs. These modular processes needed to be relatively independent; so that one process didn't rely on another in order to function. We chose QNX 4.25 as the operating system because of its multi-tasking abilities and the level of abstraction offered in communication between processes. Another key component in the UBCS, called the Buffer Process Group (BPG), was developed to de-couple processes from one another allowing each to operate independently. The BPG is a client/server process data port with a standardized interface allowing any given server to load records for access by an independent client at any given time. The BPG is capable of handling many data servers and clients simultaneously. Examples of data servers are the data acquisition process and housekeeping processes and examples of data clients are the archive process, the down link telemetry processes and the ground display processes. Together, the BPG process and the QNX 4.25 OS allow the UBCS to meet all of its design goals. In particular they allow the system to be highly fault tolerant and recoverable. A monitoring process is able to restart failed processes and reboot the computers on which they reside, if necessary. This allows the UBCS to recover from software errors or bugs as well as hardware glitches such as temporary

  11. SABRE: a bio-inspired fault-tolerant electronic architecture.

    PubMed

    Bremner, P; Liu, Y; Samie, M; Dragffy, G; Pipe, A G; Tempesti, G; Timmis, J; Tyrrell, A M

    2013-03-01

    As electronic devices become increasingly complex, ensuring their reliable, fault-free operation is becoming correspondingly more challenging. It can be observed that, in spite of their complexity, biological systems are highly reliable and fault tolerant. Hence, we are motivated to take inspiration for biological systems in the design of electronic ones. In SABRE (self-healing cellular architectures for biologically inspired highly reliable electronic systems), we have designed a bio-inspired fault-tolerant hierarchical architecture for this purpose. As in biology, the foundation for the whole system is cellular in nature, with each cell able to detect faults in its operation and trigger intra-cellular or extra-cellular repair as required. At the next level in the hierarchy, arrays of cells are configured and controlled as function units in a transport triggered architecture (TTA), which is able to perform partial-dynamic reconfiguration to rectify problems that cannot be solved at the cellular level. Each TTA is, in turn, part of a larger multi-processor system which employs coarser grain reconfiguration to tolerate faults that cause a processor to fail. In this paper, we describe the details of operation of each layer of the SABRE hierarchy, and how these layers interact to provide a high systemic level of fault tolerance.

  12. Evaluation of reliability modeling tools for advanced fault tolerant systems

    NASA Technical Reports Server (NTRS)

    Baker, Robert; Scheper, Charlotte

    1986-01-01

    The Computer Aided Reliability Estimation (CARE III) and Automated Reliability Interactice Estimation System (ARIES 82) reliability tools for application to advanced fault tolerance aerospace systems were evaluated. To determine reliability modeling requirements, the evaluation focused on the Draper Laboratories' Advanced Information Processing System (AIPS) architecture as an example architecture for fault tolerance aerospace systems. Advantages and limitations were identified for each reliability evaluation tool. The CARE III program was designed primarily for analyzing ultrareliable flight control systems. The ARIES 82 program's primary use was to support university research and teaching. Both CARE III and ARIES 82 were not suited for determining the reliability of complex nodal networks of the type used to interconnect processing sites in the AIPS architecture. It was concluded that ARIES was not suitable for modeling advanced fault tolerant systems. It was further concluded that subject to some limitations (the difficulty in modeling systems with unpowered spare modules, systems where equipment maintenance must be considered, systems where failure depends on the sequence in which faults occurred, and systems where multiple faults greater than a double near coincident faults must be considered), CARE III is best suited for evaluating the reliability of advanced tolerant systems for air transport.

  13. Fault-tolerant wait-free shared objects

    NASA Technical Reports Server (NTRS)

    Jayanti, Prasad; Chandra, Tushar D.; Toueg, Sam

    1992-01-01

    A concurrent system consists of processes communicating via shared objects, such as shared variables, queues, etc. The concept of wait-freedom was introduced to cope with process failures: each process that accesses a wait-free object is guaranteed to get a response even if all the other processes crash. However, if a wait-free object 'crashes,' all the processes that access that object are prevented from making progress. In this paper, we introduce the concept of fault-tolerant wait-free objects, and study the problem of implementing them. We give a universal method to construct fault-tolerant wait-free objects, for all types of 'responsive' failures (including one in which faulty objects may 'lie'). In sharp contrast, we prove that many common and interesting types (such as queues, sets, and test&set) have no fault-tolerant wait-free implementations even under the most benign of the 'non-responsive' types of failure. We also introduce several concepts and techniques that are central to the design of fault-tolerant concurrent systems: the concepts of self-implementation and graceful degradation, and techniques to automatically increase the fault-tolerance of implementations. We prove matching lower bounds on the resource complexity of most of our algorithms.

  14. Design of a dual fault tolerant space shuttle payload deployment actuator

    NASA Technical Reports Server (NTRS)

    Teske, D. R.

    1985-01-01

    As the Shuttle Transportation System (STS) becomes operational, the number and variety of payloads will increase. The need to deploy these cargo elements will require a variety of unique actuator designs, all of which will have to conform with STS safety policy. For those missions where payload operations extend beyond the payload bay door envelope, this policy deems the prevention of door closure as a catastrophic hazard. As such, it must be controlled by independent, primary and backup methods. The combination of these methods must be two fault tolerant. The design of such an actuator is described. The device consists of a single linear ballscrew with two ballnuts, each bellnut forming an independent actuator using the common ballscrew. The design requirements, concept development, hardware configuration, and fault tolerance rationale are highlighted.

  15. Problems related to the integration of fault tolerant aircraft electronic systems

    NASA Technical Reports Server (NTRS)

    Bannister, J. A.; Adlakha, V.; Triyedi, K.; Alspaugh, T. A., Jr.

    1982-01-01

    Problems related to the design of the hardware for an integrated aircraft electronic system are considered. Taxonomies of concurrent systems are reviewed and a new taxonomy is proposed. An informal methodology intended to identify feasible regions of the taxonomic design space is described. Specific tools are recommended for use in the methodology. Based on the methodology, a preliminary strawman integrated fault tolerant aircraft electronic system is proposed. Next, problems related to the programming and control of inegrated aircraft electronic systems are discussed. Issues of system resource management, including the scheduling and allocation of real time periodic tasks in a multiprocessor environment, are treated in detail. The role of software design in integrated fault tolerant aircraft electronic systems is discussed. Conclusions and recommendations for further work are included.

  16. Locating hardware faults in a data communications network of a parallel computer

    DOEpatents

    Archer, Charles J.; Megerian, Mark G.; Ratterman, Joseph D.; Smith, Brian E.

    2010-01-12

    Hardware faults location in a data communications network of a parallel computer. Such a parallel computer includes a plurality of compute nodes and a data communications network that couples the compute nodes for data communications and organizes the compute node as a tree. Locating hardware faults includes identifying a next compute node as a parent node and a root of a parent test tree, identifying for each child compute node of the parent node a child test tree having the child compute node as root, running a same test suite on the parent test tree and each child test tree, and identifying the parent compute node as having a defective link connected from the parent compute node to a child compute node if the test suite fails on the parent test tree and succeeds on all the child test trees.

  17. An observer based approach for achieving fault diagnosis and fault tolerant control of systems modeled as hybrid Petri nets.

    PubMed

    Renganathan, K; Bhaskar, VidhyaCharan

    2011-07-01

    In this paper, we propose an approach for achieving detection and identification of faults, and provide fault tolerant control for systems that are modeled using timed hybrid Petri nets. For this purpose, an observer based technique is adopted which is useful in detection of faults, such as sensor faults, actuator faults, signal conditioning faults, etc. The concepts of estimation, reachability and diagnosability have been considered for analyzing faulty behaviors, and based on the detected faults, different schemes are proposed for achieving fault tolerant control using optimization techniques. These concepts are applied to a typical three tank system and numerical results are obtained.

  18. A benchmark for fault tolerant flight control evaluation

    NASA Astrophysics Data System (ADS)

    Smaili, H.; Breeman, J.; Lombaerts, T.; Stroosma, O.

    2013-12-01

    A large transport aircraft simulation benchmark (REconfigurable COntrol for Vehicle Emergency Return - RECOVER) has been developed within the GARTEUR (Group for Aeronautical Research and Technology in Europe) Flight Mechanics Action Group 16 (FM-AG(16)) on Fault Tolerant Control (2004 2008) for the integrated evaluation of fault detection and identification (FDI) and reconfigurable flight control strategies. The benchmark includes a suitable set of assessment criteria and failure cases, based on reconstructed accident scenarios, to assess the potential of new adaptive control strategies to improve aircraft survivability. The application of reconstruction and modeling techniques, based on accident flight data, has resulted in high-fidelity nonlinear aircraft and fault models to evaluate new Fault Tolerant Flight Control (FTFC) concepts and their real-time performance to accommodate in-flight failures.

  19. Measurement and analysis of operating system fault tolerance

    NASA Technical Reports Server (NTRS)

    Lee, I.; Tang, D.; Iyer, R. K.

    1992-01-01

    This paper demonstrates a methodology to model and evaluate the fault tolerance characteristics of operational software. The methodology is illustrated through case studies on three different operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Measurements are made on these systems for substantial periods to collect software error and recovery data. In addition to investigating basic dependability characteristics such as major software problems and error distributions, we develop two levels of models to describe error and recovery processes inside an operating system and on multiple instances of an operating system running in a distributed environment. Based on the models, reward analysis is conducted to evaluate the loss of service due to software errors and the effect of the fault-tolerance techniques implemented in the systems. Software error correlation in multicomputer systems is also investigated.

  20. Hybrid Fault Tolerant Peer to Peer Video Streaming Architecture

    NASA Astrophysics Data System (ADS)

    Öztoprak, Kasim; Akar, Gözde Bozdagi

    In this paper, we propose a fault tolerant hybrid p2p-CDN video streaming arhitecture to overcome the problems caused by peer behavior in peer-to-peer (P2P) video streaming systems. Although there are several studies modeling and analytically investigating peer behaviors in P2P video streaming systems, they do not come up with a solution to guarantee the required Quality of the Services (QoS). Therefore, in this study a hybrid geographical location-time and interest based clustering algorithm is proposed to improve the success ratio and reduce the delivery time of required content. A Hybrid Fault Tolerant Video Streaming System (HFTS) over P2P networks conforming the required QoS and Fault Tolerance is also offered. The simulations indicate that the required QoS can be achieved in streaming video applications using the proposed hybrid approach.

  1. Multiple Embedded Processors for Fault-Tolerant Computing

    NASA Technical Reports Server (NTRS)

    Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

    2005-01-01

    A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

  2. Design methods for fault-tolerant finite state machines

    NASA Technical Reports Server (NTRS)

    Niranjan, Shailesh; Frenzel, James F.

    1993-01-01

    VLSI electronic circuits are increasingly being used in space-borne applications where high levels of radiation may induce faults, known as single event upsets. In this paper we review the classical methods of designing fault tolerant digital systems, with an emphasis on those methods which are particularly suitable for VLSI-implementation of finite state machines. Four methods are presented and will be compared in terms of design complexity, circuit size, and estimated circuit delay.

  3. Fault Tolerance Middleware for a Multi-Core System

    NASA Technical Reports Server (NTRS)

    Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

    2012-01-01

    Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the

  4. Fault-tolerant three-level inverter

    DOEpatents

    Edwards, John; Xu, Longya; Bhargava, Brij B.

    2006-12-05

    A method for driving a neutral point clamped three-level inverter is provided. In one exemplary embodiment, DC current is received at a neutral point-clamped three-level inverter. The inverter has a plurality of nodes including first, second and third output nodes. The inverter also has a plurality of switches. Faults are checked for in the inverter and predetermined switches are automatically activated responsive to a detected fault such that three-phase electrical power is provided at the output nodes.

  5. Fault tolerant kinematic control of hyper-redundant manipulators

    NASA Technical Reports Server (NTRS)

    Bedrossian, Nazareth S.

    1994-01-01

    Hyper-redundant spatial manipulators possess fault-tolerant features because of their redundant structure. The kinematic control of these manipulators is investigated with special emphasis on fault-tolerant control. The manipulator tasks are viewed in the end-effector space while actuator commands are in joint-space, requiring an inverse kinematic algorithm to generate joint-angle commands from the end-effector ones. The rate-inverse kinematic control algorithm presented in this paper utilizes the pseudoinverse to accommodate for joint motor failures. An optimal scale factor for the robust inverse is derived.

  6. Single-Shot Fault-Tolerant Quantum Error Correction

    NASA Astrophysics Data System (ADS)

    Bombín, Héctor

    2015-07-01

    Conventional quantum error correcting codes require multiple rounds of measurements to detect errors with enough confidence in fault-tolerant scenarios. Here, I show that for suitable topological codes, a single round of local measurements is enough. This feature is generic and is related to self-correction and confinement phenomena in the corresponding quantum Hamiltonian model. Three-dimensional gauge color codes exhibit this single-shot feature, which also applies to initialization and gauge fixing. Assuming the time for efficient classical computations to be negligible, this yields a topological fault-tolerant quantum computing scheme where all elementary logical operations can be performed in constant time.

  7. A performance assessment of a byzantine resilient fault-tolerant computer

    NASA Technical Reports Server (NTRS)

    Young, Steven D.; Elks, Carl R.; Graham, R. L.

    1989-01-01

    This report presents the results of a performance analysis of a quad-redundant Fault-Tolerant Processor (FTP). The FTP is a computing system specifically designed for applications where very high reliability is required. Examples of such applications are flight control systems, nuclear power systems, and spacecraft control systems. The FTP performance was analyzed in a hierarchical manner encompassing the hardware, the operating system, and the application. At the hardware level, the hardware organization and design was assessed in relation to system throughput and response. Analysis at the operating system level revealed that the scheduler took only 3.2 percent of each 40ms frame, while the redundancy management software took 10.4 percent. The application level performance was analyzed via a synthetic workload and a representative flight control model. The estimated throughput for this application was found to be 317.6 KIPS if not exercising the voter. Exercising the voter to ensure fault tolerance will diminish this number linearly as the number of votes is increased. This performance analysis method was proven effective by uncovering undesirable behavior and anomalies in the FTP system.

  8. Fault Model Development for Fault Tolerant VLSI Design

    DTIC Science & Technology

    1988-05-01

    it minimizes the number of bridging 5 % -W V,. Pi’%A faults but because of the ease with which the layout principles can be automated . This implies a...diffusion over a significant portion. Thus, it turns out .. 4 that the layout chosen on the basis of easy automation is also efficient in terms of...34, Proo. 24th ACM/IEEE . Design Automation Conference, June 1987, pp 244-250. 106 ii * . .A 16. [Reddy,19861 Sudhakar M. Reddy and Madhukar M. Reddy

  9. Software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1990-01-01

    The use of back-to-back, or comparison, testing for regression test or porting is examined. The efficiency and the cost of the strategy is compared with manual and table-driven single version testing. Some of the key parameters that influence the efficiency and the cost of the approach are the failure identification effort during single version program testing, the extent of implemented changes, the nature of the regression test data (e.g., random), and the nature of the inter-version failure correlation and fault-masking. The advantages and disadvantages of the technique are discussed, together with some suggestions concerning its practical use.

  10. Data-based fault-tolerant control for affine nonlinear systems with actuator faults.

    PubMed

    Xie, Chun-Hua; Yang, Guang-Hong

    2016-09-01

    This paper investigates the fault-tolerant control (FTC) problem for unknown nonlinear systems with actuator faults including stuck, outage, bias and loss of effectiveness. The upper bounds of stuck faults, bias faults and loss of effectiveness faults are unknown. A new data-based FTC scheme is proposed. It consists of the online estimations of the bounds and a state-dependent function. The estimations are adjusted online to compensate automatically the actuator faults. The state-dependent function solved by using real system data helps to stabilize the system. Furthermore, all signals in the resulting closed-loop system are uniformly bounded and the states converge asymptotically to zero. Compared with the existing results, the proposed approach is data-based. Finally, two simulation examples are provided to show the effectiveness of the proposed approach.

  11. Multi-version software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1989-01-01

    A number of experimental and theoretical issues associated with the practical use of multi-version software to provide run-time tolerance to software faults were investigated. A specialized tool was developed and evaluated for measuring testing coverage for a variety of metrics. The tool was used to collect information on the relationships between software faults and coverage provided by the testing process as measured by different metrics (including data flow metrics). Considerable correlation was found between coverage provided by some higher metrics and the elimination of faults in the code. Back-to-back testing was continued as an efficient mechanism for removal of un-correlated faults, and common-cause faults of variable span. Software reliability estimation methods was also continued based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. New fault tolerance models were formulated. Simulation studies of the Acceptance Voting and Multi-stage Voting algorithms were finished and it was found that these two schemes for software fault tolerance are superior in many respects to some commonly used schemes. Particularly encouraging are the safety properties of the Acceptance testing scheme.

  12. Reliability model of fault-tolerant data processing system with primary and backup nodes

    NASA Astrophysics Data System (ADS)

    Rahman, P. A.; Bobkova, E. Yu

    2016-04-01

    This paper deals with the fault-tolerant data processing systems, which are widely used in modern world of information technologies and have acceptable overhead expenses in hardware implementation. A simplified reliability model for duplex systems and the offered by authors advanced model for data processing systems with primary and backup nodes based on a three-state model of recoverable elements, which takes into consideration different failure rates of passive and active nodes and finite time of node activation, are also given. A calculation formula for the availability factor of the dual-node data processing system with primary and backup nodes and calculation examples are also provided.

  13. Trojan horse attack free fault-tolerant quantum key distribution protocols

    NASA Astrophysics Data System (ADS)

    Yang, Chun-Wei; Hwang, Tzonelih

    2013-11-01

    This work proposes two quantum key distribution (QKD) protocols—each of which is robust under one kind of collective noises—collective-dephasing noise and collective-rotation noise. Due to the use of a new coding function which produces error-robust codewords allowing one-time transmission of quanta, the proposed QKD schemes are fault-tolerant and congenitally free from Trojan horse attacks without having to use any extra hardware. Moreover, by adopting two Bell state measurements instead of a 4-GHZ state joint measurement for decoding, the proposed protocols are practical in combating collective noises.

  14. Universal Fault-Tolerant Gates on Concatenated Stabilizer Codes

    NASA Astrophysics Data System (ADS)

    Yoder, Theodore J.; Takagi, Ryuji; Chuang, Isaac L.

    2016-07-01

    It is an oft-cited fact that no quantum code can support a set of fault-tolerant logical gates that is both universal and transversal. This no-go theorem is generally responsible for the interest in alternative universality constructions including magic state distillation. Widely overlooked, however, is the possibility of nontransversal, yet still fault-tolerant, gates that work directly on small quantum codes. Here, we demonstrate precisely the existence of such gates. In particular, we show how the limits of nontransversality can be overcome by performing rounds of intermediate error correction to create logical gates on stabilizer codes that use no ancillas other than those required for syndrome measurement. Moreover, the logical gates we construct, the most prominent examples being Toffoli and controlled-controlled-Z , often complete universal gate sets on their codes. We detail such universal constructions for the smallest quantum codes, the 5-qubit and 7-qubit codes, and then proceed to generalize the approach. One remarkable result of this generalization is that any nondegenerate stabilizer code with a complete set of fault-tolerant single-qubit Clifford gates has a universal set of fault-tolerant gates. Another is the interaction of logical qubits across different stabilizer codes, which, for instance, implies a broadly applicable method of code switching.

  15. Study of fault tolerant software technology for dynamic systems

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Zacharias, G. L.

    1985-01-01

    The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented.

  16. Electronic Power Switch for Fault-Tolerant Networks

    NASA Technical Reports Server (NTRS)

    Volp, J.

    1987-01-01

    Power field-effect transistors reduce energy waste and simplify interconnections. Current switch containing power field-effect transistor (PFET) placed in series with each load in fault-tolerant power-distribution system. If system includes several loads and supplies, switches placed in series with adjacent loads and supplies. System of switches protects against overloads and losses of individual power sources.

  17. Fenix, A Fault Tolerant Programming Framework for MPI Applications

    SciTech Connect

    Gamel, Marc; Teranihi, Keita; Valenzuela, Eric; Heroux, Michael; Parashar, Manish

    2016-10-05

    Fenix provides APIs to allow the users to add fault tolerance capability to MPI-based parallel programs in a transparent manner. Fenix-enabled programs can run through process failures during program execution using a pool of spare processes accommodated by Fenix.

  18. Microprocessor-based fault-tolerant nuclear turbine governor

    SciTech Connect

    Tone, Y.; Nakamura, H.; Yokota, Y.

    1986-02-01

    A new microprocessor-based fault-tolerant nuclear turbine governor has been developed. Hierarchically distributed configuration and asynchronous triplicated architecture with middle value voting logic maximizes the plant availability. Problem-oriented language is provided for design ease and program maintainability. The turbine governor with these features is described with test results.

  19. cost and benefits optimization model for fault-tolerant aircraft electronic systems

    NASA Technical Reports Server (NTRS)

    1983-01-01

    The factors involved in economic assessment of fault tolerant systems (FTS) and fault tolerant flight control systems (FTFCS) are discussed. Algorithms for optimization and economic analysis of FTFCS are documented.

  20. Adaptive fuzzy fault-tolerant output feedback control of uncertain nonlinear systems with actuator faults

    NASA Astrophysics Data System (ADS)

    Huo, Baoyu; Tong, Shaocheng; Li, Yongming

    2013-12-01

    This article develops an adaptive fuzzy control method for accommodating actuator faults in a class of unknown nonlinear systems with unmeasured states. The considered faults are modelled as both loss of effectiveness and lock-in-place (stuck at unknown place). With the help of fuzzy logic systems to approximate the unknown nonlinear functions, a fuzzy adaptive observer is developed for estimating the unmeasured states. Combining the backstepping technique with the nonlinear tolerant-fault control theory, a novel adaptive fuzzy faults-tolerant control approach is constructed. It is proved that the proposed control approach can guarantee that all the signals of the resulting closed-loop system are bounded and the tracking error between the system output and the reference signal converges to a small neighbourhood of zero by appropriate choice of the design parameters. Simulation results are provided to show the effectiveness of the control approach.

  1. Fault tolerance control for proton exchange membrane fuel cell systems

    NASA Astrophysics Data System (ADS)

    Wu, Xiaojuan; Zhou, Boyang

    2016-08-01

    Fault diagnosis and controller design are two important aspects to improve proton exchange membrane fuel cell (PEMFC) system durability. However, the two tasks are often separately performed. For example, many pressure and voltage controllers have been successfully built. However, these controllers are designed based on the normal operation of PEMFC. When PEMFC faces problems such as flooding or membrane drying, a controller with a specific design must be used. This paper proposes a unique scheme that simultaneously performs fault diagnosis and tolerance control for the PEMFC system. The proposed control strategy consists of a fault diagnosis, a reconfiguration mechanism and adjustable controllers. Using a back-propagation neural network, a model-based fault detection method is employed to detect the PEMFC current fault type (flooding, membrane drying or normal). According to the diagnosis results, the reconfiguration mechanism determines which backup controllers to be selected. Three nonlinear controllers based on feedback linearization approaches are respectively built to adjust the voltage and pressure difference in the case of normal, membrane drying and flooding conditions. The simulation results illustrate that the proposed fault tolerance control strategy can track the voltage and keep the pressure difference at desired levels in faulty conditions.

  2. A Self-Stabilizing Hybrid Fault-Tolerant Synchronization Protocol

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2015-01-01

    This paper presents a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. The strategy consists of two parts: first, converting Byzantine faults into symmetric faults, and second, using a proven symmetric-fault tolerant algorithm to solve the general case of the problem. A protocol (algorithm) is also present that tolerates symmetric faults, provided that there are more good nodes than faulty ones. The solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. The solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. A mechanical verification of a proposed protocol is also present. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period.

  3. Graceful fault tolerance in large networks of microcomputers

    SciTech Connect

    Agrawal, B.K.

    1984-01-01

    This work considers the problem of fault diagnosis in a network of distributed multicomputers, and a strategy for repeated reconfiguration is presented in detail to help improve the degree of fault tolerance. The overall system diagnosability is shown to be enhanced further by constructing a large network with small well-known graphs as its basis and then applying reconfiguration techniques locally in various system partitions and exchanging diagnostic information globally. A detailed description of this new attractive approach is presented along with the diagnostic algorithm suitable for large networks of microcomputers in VLSI based distributed systems. A systematic procedure for defining near-optimal fault-tolerance graph theoretic networks is investigated which is well suited for multicomputer structures. A distributed algorithm along with a new system diagnostic theory is proposed.

  4. A Convex Approach to Fault Tolerant Control

    NASA Technical Reports Server (NTRS)

    Maghami, Peiman G.; Cox, David E.; Bauer, Frank (Technical Monitor)

    2002-01-01

    The design of control laws for dynamic systems with the potential for actuator failures is considered in this work. The use of Linear Matrix Inequalities allows more freedom in controller design criteria than typically available with robust control. This work proposes an extension of fault-scheduled control design techniques that can find a fixed controller with provable performance over a set of plants. Through convexity of the objective function, performance bounds on this set of plants implies performance bounds on a range of systems defined by a convex hull. This is used to incorporate performance bounds for a variety of soft and hard failures into the control design problem.

  5. Fault-tolerant reactor protection system

    DOEpatents

    Gaubatz, Donald C.

    1997-01-01

    A reactor protection system having four divisions, with quad redundant sensors for each scram parameter providing input to four independent microprocessor-based electronic chassis. Each electronic chassis acquires the scram parameter data from its own sensor, digitizes the information, and then transmits the sensor reading to the other three electronic chassis via optical fibers. To increase system availability and reduce false scrams, the reactor protection system employs two levels of voting on a need for reactor scram. The electronic chassis perform software divisional data processing, vote 2/3 with spare based upon information from all four sensors, and send the divisional scram signals to the hardware logic panel, which performs a 2/4 division vote on whether or not to initiate a reactor scram. Each chassis makes a divisional scram decision based on data from all sensors. Each division performs independently of the others (asynchronous operation). All communications between the divisions are asynchronous. Each chassis substitutes its own spare sensor reading in the 2/3 vote if a sensor reading from one of the other chassis is faulty or missing. Therefore the presence of at least two valid sensor readings in excess of a set point is required before terminating the output to the hardware logic of a scram inhibition signal even when one of the four sensors is faulty or when one of the divisions is out of service.

  6. Fault-tolerant reactor protection system

    DOEpatents

    Gaubatz, D.C.

    1997-04-15

    A reactor protection system is disclosed having four divisions, with quad redundant sensors for each scram parameter providing input to four independent microprocessor-based electronic chassis. Each electronic chassis acquires the scram parameter data from its own sensor, digitizes the information, and then transmits the sensor reading to the other three electronic chassis via optical fibers. To increase system availability and reduce false scrams, the reactor protection system employs two levels of voting on a need for reactor scram. The electronic chassis perform software divisional data processing, vote 2/3 with spare based upon information from all four sensors, and send the divisional scram signals to the hardware logic panel, which performs a 2/4 division vote on whether or not to initiate a reactor scram. Each chassis makes a divisional scram decision based on data from all sensors. Each division performs independently of the others (asynchronous operation). All communications between the divisions are asynchronous. Each chassis substitutes its own spare sensor reading in the 2/3 vote if a sensor reading from one of the other chassis is faulty or missing. Therefore the presence of at least two valid sensor readings in excess of a set point is required before terminating the output to the hardware logic of a scram inhibition signal even when one of the four sensors is faulty or when one of the divisions is out of service. 16 figs.

  7. Advanced information processing system: The Army Fault-Tolerant Architecture detailed design overview

    NASA Technical Reports Server (NTRS)

    Harper, Richard E.; Babikyan, Carol A.; Butler, Bryan P.; Clasen, Robert J.; Harris, Chris H.; Lala, Jaynarayan H.; Masotto, Thomas K.; Nagle, Gail A.; Prizant, Mark J.; Treadwell, Steven

    1994-01-01

    The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The Computer Aided Low Altitude Night Helicopter Flight Program has identified automated Terrain Following/Terrain Avoidance, Nap of the Earth (TF/TA, NOE) operation as key enabling technology for advanced tactical rotorcraft to enhance mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction times is life-critical and mission-critical necessitating an ultra-reliable/high throughput computing platform for dependable service for flight control, fusion of sensor data, route planning, near-field/far-field navigation, and obstacle avoidance operations. To address these needs the Army Fault Tolerant Architecture (AFTA) is being designed and developed. This computer system is based upon the Fault Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Labs (CSDL). AFTA is hard real-time, Byzantine, fault-tolerant parallel processor which is programmed in the ADA language. This document describes the results of the Detailed Design (Phase 2 and 3 of a 3-year project) of the AFTA development. This document contains detailed descriptions of the program objectives, the TF/TA NOE application requirements, architecture, hardware design, operating systems design, systems performance measurements and analytical models.

  8. Development and application of diagnostic systems to achieve fault tolerance

    SciTech Connect

    King, R.W.; Singer, R.M.

    1989-01-01

    Much work is currently being done to develop and apply diagnostic systems that are tolerant to faulted conditions in the process being monitored and in the sensors that measure the critical parameters associated with the process. A fault-tolerant diagnostic system based on state-determination, pattern-recognition techniques is currently undergoing testing and evaluation in certain applications at the EBR-II reactor. Testing and operational experience with the system to date has shown a high degree of tolerance to sensor failures, while being sensitive to very slight changes in the plant operational state. This paper briefly mentions related work being done by others, and describes in more detail the pattern-recognition system and the results of the testing and operational experience with the system at EBR-II. 9 refs., 10 figs.

  9. Fault detection, isolation and reconfiguration in FTMP Methods and experimental results. [fault tolerant multiprocessor

    NASA Technical Reports Server (NTRS)

    Lala, J. H.

    1983-01-01

    The Fault-Tolerant Multiprocessor (FTMP) is a highly reliable computer designed to meet a goal of 10 to the -10th failures per hour and built with the objective of flying an active-control transport aircraft. Fault detection, identification, and recovery software is described, and experimental results obtained by injecting faults in the pin level in the FTMP are presented. Over 21,000 faults were injected in the CPU, memory, bus interface circuits, and error detection, masking, and error reporting circuits of one LRU of the multiprocessor. Detection, isolation, and reconfiguration times were recorded for each fault, and the results were found to agree well with earlier assumptions made in reliability modeling.

  10. Better Fault Tolerance via Application Enhanced Networks

    DTIC Science & Technology

    2005-09-01

    TIME TO CONSTRUCT AND DRAW A SURFACE APPROXIMATION OF MOUNT RAINIER EAST AS A FUNCTION OF THE ERROR PER DISTANCE TOLERANCE. THE GRAPH ALSO SHOWS THE...resolution 7.5 minute digital elevation model (DEM) of Mount Rainier , Washington has 977 x 1405 = 1,372,685 data points. Combining several of these map...x 467 data points), and a larger elevation model (figure 14), Mount Rainier East (977 x 1405 data points). In both cases, we measured the time to

  11. Multiversion software reliability through fault-avoidance and fault-tolerance

    NASA Technical Reports Server (NTRS)

    Vouk, Mladen A.; Mcallister, David F.

    1990-01-01

    In this project we have proposed to investigate a number of experimental and theoretical issues associated with the practical use of multi-version software in providing dependable software through fault-avoidance and fault-elimination, as well as run-time tolerance of software faults. In the period reported here we have working on the following: We have continued collection of data on the relationships between software faults and reliability, and the coverage provided by the testing process as measured by different metrics (including data flow metrics). We continued work on software reliability estimation methods based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. We have continued studying back-to-back testing as an efficient mechanism for removal of uncorrelated faults, and common-cause faults of variable span. We have also been studying back-to-back testing as a tool for improvement of the software change process, including regression testing. We continued investigating existing, and worked on formulation of new fault-tolerance models. In particular, we have partly finished evaluation of Consensus Voting in the presence of correlated failures, and are in the process of finishing evaluation of Consensus Recovery Block (CRB) under failure correlation. We find both approaches far superior to commonly employed fixed agreement number voting (usually majority voting). We have also finished a cost analysis of the CRB approach.

  12. Integrated sensor and actuator fault-tolerant control

    NASA Astrophysics Data System (ADS)

    Seron, María M.; De Doná, José A.; Richter, Jan H.

    2013-04-01

    We propose a fault-tolerant control scheme that deals with sensor and actuator faults through the use of a virtual actuator (VA) and a bank of virtual sensors (VSs). A novel feature of the scheme is that the VSs implicitly integrate both fault detection and isolation (FDI) and - in conjunction with the VA - controller reconfiguration tasks. The VA and the bank of VSs operate in closed-loop with an observer-based tracking controller designed for a nominal (fault free) model of the plant. A switching rule that reconfigures the VA and engages the suitable VS from the bank is based on sets defined for measurable residual signals constructed directly from the VS signals. Our method handles abrupt actuator and sensor faults of arbitrary magnitude including complete outage. The overall scheme is shown to guarantee closed-loop boundedness and setpoint tracking under all considered fault situations. Enhancements of the scheme to deal with errors in the fault detection and isolation are also proposed. Applications of the scheme to a winding machine and an interconnected tank system are presented.

  13. Fault-tolerant, embedded CLIPS applications

    NASA Technical Reports Server (NTRS)

    Hicks, Jaye; Matthews, John

    1990-01-01

    The enhancements to CLIPS4.3 presented in this paper provide an embedded CLIPS application with the ability to continue operation with minimal to no loss of information in the event of a hardware or a software failure. Given an arbitrary failure, the CLIPS application's environment (fact-list, agenda, and pattern matching network) will be reconstructed to the point at which the failure was experienced. The environment reconstruction is based on state files to which the application periodically checks environment information (fact-list and agenda). The routine for checkpointing the state of the application is as efficient as possible so that the overhead introduced to normal execution of the application is minimal. The only assumptions made by the CLIPS application are that it is running under an operating system that guarantees it access to uncorrupt state files and that the application will be automatically restarted should it terminate abnormally.

  14. Development of software fault-tolerance techniques

    NASA Technical Reports Server (NTRS)

    Melliar-Smith, P. M.

    1983-01-01

    As computers become more widely used, and in particular as they become used in more safety critical applications, the reliability of the computer system and its software becomes more important. There is also an increasing need for high levels of reliability in applications involving very large numbers of inexpensive units where recall of the units would be disproportionately expensive. The nature of faults and the assumptions made by different approaches to correct operation are considered. The recovery block approach is described and a probabilistic analysis of its effectiveness, with and without correlated design errors is provided. Mechanisms for generating acceptance tests from specifications, and for providing recovery in the presence of asynchrony, are described. An analysis of, and design for, the provision of recovery blocks in the microprogram of the Bendix BDX930 processor is provided. An example of the use of recovery blocks in a simple operating system is also provided.

  15. Fault-Tolerant Local-Area Network

    NASA Technical Reports Server (NTRS)

    Morales, Sergio; Friedman, Gary L.

    1988-01-01

    Local-area network (LAN) for computers prevents single-point failure from interrupting communication between nodes of network. Includes two complete cables, LAN 1 and LAN 2. Microprocessor-based slave switches link cables to network-node devices as work stations, print servers, and file servers. Slave switches respond to commands from master switch, connecting nodes to two cable networks or disconnecting them so they are completely isolated. System monitor and control computer (SMC) acts as gateway, allowing nodes on either cable to communicate with each other and ensuring that LAN 1 and LAN 2 are fully used when functioning properly. Network monitors and controls itself, automatically routes traffic for efficient use of resources, and isolates and corrects its own faults, with potential dramatic reduction in time out of service.

  16. Algorithm-dependent fault tolerance for distributed computing

    SciTech Connect

    P. D. Hough; M. e. Goldsby; E. J. Walsh

    2000-02-01

    Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

  17. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.

  18. Fault tolerant multiphase electrical drives: the impact of design

    NASA Astrophysics Data System (ADS)

    Semail, E.; Kestelyn, X.; Locment, F.

    2008-08-01

    This paper deals with fault tolerant multiphase electrical drives. The quality of the torque of a vector-controlled Permanent Magnet (PM) Synchronous Machine supplied by a multi-leg Voltage Source Inverter (VSI) is examined in normal operation and when one or two phases are open-circuited. It is then deduced that a seven-phase machine is a good compromise allowing high torque-to-volume density and easy control with smooth torque in fault operation. Experimental results confirm the predicted characteristics. This article has been submitted as part of “IET Colloquium on Reliability in Electromagnetic Systems”, 24 and 25 May 2007, Paris

  19. Fault tolerance and vehicle health management aspects of launch vehicle PMAD system design

    NASA Astrophysics Data System (ADS)

    Jackson, William E.; Johnson, Stephen

    Attention is given to the future needs of the USAF, NASA, and project launch vehicle PMAD (power management and distribution). It is found that customers and projects desire increased PMAD reliability (fault avoidance and fault tolerance coverage features), fault tolerance (coverage and latency features), and availability (fault avoidance and fault tolerance latency features). PMAD/VHM (vehicle health management) architectures need to be intelligent to provide improvements in process failures during manufacturing, building, assembly, and checkout phases. PMAD architectures driven by fault tolerance, reliability, and availability need to be modular-redundant at the function level.

  20. Sliding mode based fault detection, reconstruction and fault tolerant control scheme for motor systems.

    PubMed

    Mekki, Hemza; Benzineb, Omar; Boukhetala, Djamel; Tadjine, Mohamed; Benbouzid, Mohamed

    2015-07-01

    The fault-tolerant control problem belongs to the domain of complex control systems in which inter-control-disciplinary information and expertise are required. This paper proposes an improved faults detection, reconstruction and fault-tolerant control (FTC) scheme for motor systems (MS) with typical faults. For this purpose, a sliding mode controller (SMC) with an integral sliding surface is adopted. This controller can make the output of system to track the desired position reference signal in finite-time and obtain a better dynamic response and anti-disturbance performance. But this controller cannot deal directly with total system failures. However an appropriate combination of the adopted SMC and sliding mode observer (SMO), later it is designed to on-line detect and reconstruct the faults and also to give a sensorless control strategy which can achieve tolerance to a wide class of total additive failures. The closed-loop stability is proved, using the Lyapunov stability theory. Simulation results in healthy and faulty conditions confirm the reliability of the suggested framework.

  1. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.

  2. Verification of the FtCayuga fault-tolerant microprocessor system. Volume 1: A case study in theorem prover-based verification

    NASA Technical Reports Server (NTRS)

    Srivas, Mandayam; Bickford, Mark

    1991-01-01

    The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.

  3. Fault Detection, Isolation and Recovery (FDIR) Portable Liquid Oxygen Hardware Demonstrator

    NASA Technical Reports Server (NTRS)

    Oostdyk, Rebecca L.; Perotti, Jose M.

    2011-01-01

    The Fault Detection, Isolation and Recovery (FDIR) hardware demonstration will highlight the effort being conducted by Constellation's Ground Operations (GO) to provide the Launch Control System (LCS) with system-level health management during vehicle processing and countdown activities. A proof-of-concept demonstration of the FDIR prototype established the capability of the software to provide real-time fault detection and isolation using generated Liquid Hydrogen data. The FDIR portable testbed unit (presented here) aims to enhance FDIR by providing a dynamic simulation of Constellation subsystems that feed the FDIR software live data based on Liquid Oxygen system properties. The LO2 cryogenic ground system has key properties that are analogous to the properties of an electronic circuit. The LO2 system is modeled using electrical components and an equivalent circuit is designed on a printed circuit board to simulate the live data. The portable testbed is also be equipped with data acquisition and communication hardware to relay the measurements to the FDIR application running on a PC. This portable testbed is an ideal capability to perform FDIR software testing, troubleshooting, training among others.

  4. A Decentralized Adaptive Approach to Fault Tolerant Flight Control

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva; Nikulin, Vladimir; Heimes, Felix; Shormin, Victor

    2000-01-01

    This paper briefly reports some results of our study on the application of a decentralized adaptive control approach to a 6 DOF nonlinear aircraft model. The simulation results showed the potential of using this approach to achieve fault tolerant control. Based on this observation and some analysis, the paper proposes a multiple channel adaptive control scheme that makes use of the functionally redundant actuating and sensing capabilities in the model, and explains how to implement the scheme to tolerate actuator and sensor failures. The conditions, under which the scheme is applicable, are stated in the paper.

  5. A Fault Tolerant System for an Integrated Avionics Sensor Configuration

    NASA Technical Reports Server (NTRS)

    Caglayan, A. K.; Lancraft, R. E.

    1984-01-01

    An aircraft sensor fault tolerant system methodology for the Transport Systems Research Vehicle in a Microwave Landing System (MLS) environment is described. The fault tolerant system provides reliable estimates in the presence of possible failures both in ground-based navigation aids, and in on-board flight control and inertial sensors. Sensor failures are identified by utilizing the analytic relationships between the various sensors arising from the aircraft point mass equations of motion. The estimation and failure detection performance of the software implementation (called FINDS) of the developed system was analyzed on a nonlinear digital simulation of the research aircraft. Simulation results showing the detection performance of FINDS, using a dual redundant sensor compliment, are presented for bias, hardover, null, ramp, increased noise and scale factor failures. In general, the results show that FINDS can distinguish between normal operating sensor errors and failures while providing an excellent detection speed for bias failures in the MLS, indicated airspeed, attitude and radar altimeter sensors.

  6. Fault Tolerance in ZigBee Wireless Sensor Networks

    NASA Technical Reports Server (NTRS)

    Alena, Richard; Gilstrap, Ray; Baldwin, Jarren; Stone, Thom; Wilson, Pete

    2011-01-01

    Wireless sensor networks (WSN) based on the IEEE 802.15.4 Personal Area Network standard are finding increasing use in the home automation and emerging smart energy markets. The network and application layers, based on the ZigBee 2007 PRO Standard, provide a convenient framework for component-based software that supports customer solutions from multiple vendors. This technology is supported by System-on-a-Chip solutions, resulting in extremely small and low-power nodes. The Wireless Connections in Space Project addresses the aerospace flight domain for both flight-critical and non-critical avionics. WSNs provide the inherent fault tolerance required for aerospace applications utilizing such technology. The team from Ames Research Center has developed techniques for assessing the fault tolerance of ZigBee WSNs challenged by radio frequency (RF) interference or WSN node failure.

  7. Fault Tolerance with Noisy and Slow Measurements and Preparation

    NASA Astrophysics Data System (ADS)

    Paz-Silva, Gerardo A.; Brennen, Gavin K.; Twamley, Jason

    2010-09-01

    It is not so well known that measurement-free quantum error correction protocols can be designed to achieve fault-tolerant quantum computing. Despite their potential advantages in terms of the relaxation of accuracy, speed, and addressing requirements, they have usually been overlooked since they are expected to yield a very bad threshold. We show that this is not the case. We design fault-tolerant circuits for the 9-qubit Bacon-Shor code and find an error threshold for unitary gates and preparation of p(p,g)thresh=3.76×10-5 (30% of the best known result for the same code using measurement) while admitting up to 1/3 error rates for measurements and allocating no constraints on measurement speed. We further show that demanding gate error rates sufficiently below the threshold pushes the preparation threshold up to p(p)thresh=1/3.

  8. Decomposition in reliability analysis of fault-tolerant systems

    NASA Technical Reports Server (NTRS)

    Trivedi, K. S.; Geist, R. M.

    1983-01-01

    The existing approaches to reliability modeling are briefly reviewed. An examination of the limitations of the existing approaches in modeling ultrareliable fault-tolerant systems illustrates the need to use decomposition techniques. The notion of behavioral decomposition is introduced for dealing with reliability models with a large number of states, and a series of examples is presented. The CARE (computer-aided reliability estimation) and HARP (hybrid automated reliability predictor) approaches to reliability are discussed.

  9. Combining Dynamical Decoupling with Fault-Tolerant Quantum Computation

    DTIC Science & Technology

    2009-11-17

    ar X iv :0 91 1. 32 02 v1 [ qu an t- ph ] 1 7 N ov 2 00 9 Combining dynamical decoupling with fault-tolerant quantum computation Hui Khoon Ng,1...Daniel A. Lidar,2 and John Preskill1 1Institute for Quantum Information, California Institute of Technology, Pasadena, CA 91125, USA 2Departments...of Chemistry, Electrical Engineering, and Physics, and Center for Quantum Information Science & Technology, University of Southern California, Los

  10. Fault-tolerant Landau-Zener quantum gates

    SciTech Connect

    Hicke, C.; Santos, L. F.; Dykman, M. I.

    2006-01-15

    We present a method to perform fault-tolerant single-qubit gate operations using Landau-Zener tunneling. In a single Landau-Zener pulse, the qubit transition frequency is varied in time so that it passes through the frequency of the radiation field. We show that a simple three-pulse sequence allows eliminating errors in the gate up to the third order in errors in the qubit energies or the radiation frequency.

  11. Scalable and Fault Tolerant Failure Detection and Consensus

    SciTech Connect

    Katti, Amogh; Di Fatta, Giuseppe; Naughton III, Thomas J; Engelmann, Christian

    2015-01-01

    Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.

  12. Reliability of Fault Tolerant Control Systems. Part 2

    NASA Technical Reports Server (NTRS)

    Wu, N. Eva

    2000-01-01

    This paper reports Part II of a two part effort that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability properties peculiar to fault-tolerant control systems are emphasized, such as the presence of analytic redundancy in high proportion, the dependence of failures on control performance, and high risks associated with decisions in redundancy management due to multiple sources of uncertainties and sometimes large processing requirements. As a consequence, coverage of failures through redundancy management can be severely limited. The paper proposes to formulate the fault tolerant control problem as an optimization problem that maximizes coverage of failures through redundancy management. Coverage modeling is attempted in a way that captures its dependence on the control performance and on the diagnostic resolution. Under the proposed redundancy management policy, it is shown that an enhanced overall system reliability can be achieved with a control law of a superior robustness, with an estimator of a higher resolution, and with a control performance requirement of a lesser stringency.

  13. Redundant finite rings for fault-tolerant signal processors

    NASA Astrophysics Data System (ADS)

    Jullien, Graham A.; Bizzan, S. S.; Wigley, Neil M.; Miller, W. C.

    1994-10-01

    Redundant Residue Number Systems (RRNS) have been proposed as suitable candidates for fault tolerance in compute intensive applications. The redundancy is based on multiple projections to moduli sub-sets and conducting a search for results that lie in a so-called illegitimate range. This paper presents RRNS fault tolerant procedures for a recently introduced finite polynomial ring mapping procedure (modulus replication RNS). The mapping technique dispenses with the need for many relatively prime ring moduli, which is a major draw-back with conventional RRNS systems. Although double, triple, and quadrupole modular redundancy can be implemented in the polynomial mapping structure, polynomial coefficient circuitry, or the independent direct product ring computational channels, for error detection and/or correction, this paper discusses the implementation of redundant rings which are generated by (1) redundant residues, (2) spare general computational channels, or (3) a combination of the two. The first architecture is suitable for RNS embedding in the MRRNS, and the second for single moduli mappings. The combination architecture allows a trade-off between the two extremes. The application area is in fault tolerant compute intensive DSP arrays.

  14. Active Fault Tolerant Control for Ultrasonic Piezoelectric Motor

    NASA Astrophysics Data System (ADS)

    Boukhnifer, Moussa

    2012-07-01

    Ultrasonic piezoelectric motor technology is an important system component in integrated mechatronics devices working on extreme operating conditions. Due to these constraints, robustness and performance of the control interfaces should be taken into account in the motor design. In this paper, we apply a new architecture for a fault tolerant control using Youla parameterization for an ultrasonic piezoelectric motor. The distinguished feature of proposed controller architecture is that it shows structurally how the controller design for performance and robustness may be done separately which has the potential to overcome the conflict between performance and robustness in the traditional feedback framework. A fault tolerant control architecture includes two parts: one part for performance and the other part for robustness. The controller design works in such a way that the feedback control system will be solely controlled by the proportional plus double-integral PI2 performance controller for a nominal model without disturbances and H∞ robustification controller will only be activated in the presence of the uncertainties or an external disturbances. The simulation results demonstrate the effectiveness of the proposed fault tolerant control architecture.

  15. Fault-Tolerant, Radiation-Hard DSP

    NASA Technical Reports Server (NTRS)

    Czajkowski, David

    2011-01-01

    Commercial digital signal processors (DSPs) for use in high-speed satellite computers are challenged by the damaging effects of space radiation, mainly single event upsets (SEUs) and single event functional interrupts (SEFIs). Innovations have been developed for mitigating the effects of SEUs and SEFIs, enabling the use of very-highspeed commercial DSPs with improved SEU tolerances. Time-triple modular redundancy (TTMR) is a method of applying traditional triple modular redundancy on a single processor, exploiting the VLIW (very long instruction word) class of parallel processors. TTMR improves SEU rates substantially. SEFIs are solved by a SEFI-hardened core circuit, external to the microprocessor. It monitors the health of the processor, and if a SEFI occurs, forces the processor to return to performance through a series of escalating events. TTMR and hardened-core solutions were developed for both DSPs and reconfigurable field-programmable gate arrays (FPGAs). This includes advancement of TTMR algorithms for DSPs and reconfigurable FPGAs, plus a rad-hard, hardened-core integrated circuit that services both the DSP and FPGA. Additionally, a combined DSP and FPGA board architecture was fully developed into a rad-hard engineering product. This technology enables use of commercial off-the-shelf (COTS) DSPs in computers for satellite and other space applications, allowing rapid deployment at a much lower cost. Traditional rad-hard space computers are very expensive and typically have long lead times. These computers are either based on traditional rad-hard processors, which have extremely low computational performance, or triple modular redundant (TMR) FPGA arrays, which suffer from power and complexity issues. Even more frustrating is that the TMR arrays of FPGAs require a fixed, external rad-hard voting element, thereby causing them to lose much of their reconfiguration capability and in some cases significant speed reduction. The benefits of COTS high

  16. Algorithm-Based Fault Tolerance for Numerical Subroutines

    NASA Technical Reports Server (NTRS)

    Tumon, Michael; Granat, Robert; Lou, John

    2007-01-01

    A software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.

  17. Disturbance observer based fault estimation and dynamic output feedback fault tolerant control for fuzzy systems with local nonlinear models.

    PubMed

    Han, Jian; Zhang, Huaguang; Wang, Yingchun; Liu, Yang

    2015-11-01

    This paper addresses the problems of fault estimation (FE) and fault tolerant control (FTC) for fuzzy systems with local nonlinear models, external disturbances, sensor and actuator faults, simultaneously. Disturbance observer (DO) and FE observer are designed, simultaneously. Compared with the existing results, the proposed observer is with a wider application range. Using the estimation information, a novel fuzzy dynamic output feedback fault tolerant controller (DOFFTC) is designed. The controller can be used for the fuzzy systems with unmeasurable local nonlinear models, mismatched input disturbances, and measurement output affecting by sensor faults and disturbances. At last, the simulation shows the effectiveness of the proposed methods.

  18. SIFT - Multiprocessor architecture for Software Implemented Fault Tolerance flight control and avionics computers

    NASA Technical Reports Server (NTRS)

    Forman, P.; Moses, K.

    1979-01-01

    A brief description of a SIFT (Software Implemented Fault Tolerance) Flight Control Computer with emphasis on implementation is presented. A multiprocessor system that relies on software-implemented fault detection and reconfiguration algorithms is described. A high level reliability and fault tolerance is achieved by the replication of computing tasks among processing units.

  19. A fault-tolerant control architecture for unmanned aerial vehicles

    NASA Astrophysics Data System (ADS)

    Drozeski, Graham R.

    Research has presented several approaches to achieve varying degrees of fault-tolerance in unmanned aircraft. Approaches in reconfigurable flight control are generally divided into two categories: those which incorporate multiple non-adaptive controllers and switch between them based on the output of a fault detection and identification element, and those that employ a single adaptive controller capable of compensating for a variety of fault modes. Regardless of the approach for reconfigurable flight control, certain fault modes dictate system restructuring in order to prevent a catastrophic failure. System restructuring enables active control of actuation not employed by the nominal system to recover controllability of the aircraft. After system restructuring, continued operation requires the generation of flight paths that adhere to an altered flight envelope. The control architecture developed in this research employs a multi-tiered hierarchy to allow unmanned aircraft to generate and track safe flight paths despite the occurrence of potentially catastrophic faults. The hierarchical architecture increases the level of autonomy of the system by integrating five functionalities with the baseline system: fault detection and identification, active system restructuring, reconfigurable flight control; reconfigurable path planning, and mission adaptation. Fault detection and identification algorithms continually monitor aircraft performance and issue fault declarations. When the severity of a fault exceeds the capability of the baseline flight controller, active system restructuring expands the controllability of the aircraft using unconventional control strategies not exploited by the baseline controller. Each of the reconfigurable flight controllers and the baseline controller employ a proven adaptive neural network control strategy. A reconfigurable path planner employs an adaptive model of the vehicle to re-shape the desired flight path. Generation of the revised

  20. An improved fault-tolerant control scheme for PWM inverter-fed induction motor-based EVs.

    PubMed

    Tabbache, Bekheïra; Benbouzid, Mohamed; Kheloui, Abdelaziz; Bourgeot, Jean-Matthieu; Mamoune, Abdeslam

    2013-11-01

    This paper proposes an improved fault-tolerant control scheme for PWM inverter-fed induction motor-based electric vehicles. The proposed strategy deals with power switch (IGBTs) failures mitigation within a reconfigurable induction motor control. To increase the vehicle powertrain reliability regarding IGBT open-circuit failures, 4-wire and 4-leg PWM inverter topologies are investigated and their performances discussed in a vehicle context. The proposed fault-tolerant topologies require only minimum hardware modifications to the conventional off-the-shelf six-switch three-phase drive, mitigating the IGBTs failures by specific inverter control. Indeed, the two topologies exploit the induction motor neutral accessibility for fault-tolerant purposes. The 4-wire topology uses then classical hysteresis controllers to account for the IGBT failures. The 4-leg topology, meanwhile, uses a specific 3D space vector PWM to handle vehicle requirements in terms of size (DC bus capacitors) and cost (IGBTs number). Experiments on an induction motor drive and simulations on an electric vehicle are carried-out using a European urban driving cycle to show that the proposed fault-tolerant control approach is effective and provides a simple configuration with high performance in terms of speed and torque responses.

  1. Standardized Fault-Tolerant Sensing Nodes for an Intelligent Turbine Engine Control System (Preprint)

    DTIC Science & Technology

    2013-05-01

    Fault detection and isolation is improved. For example, when a thermocouple is malfunctioning, the problem...converting the analog voltage to a temperature (or pressure, or position etc.). Any electronic interface must have fault detection and isolation built...added benefits – Increase Noise Immunity – Sensor Calibration Made Easy – Fault Detection and Isolation – Fault Tolerance – Prognostic Functions

  2. Modeling and measurement of fault-tolerant multiprocessors

    NASA Technical Reports Server (NTRS)

    Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

    1985-01-01

    The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.

  3. Hypothetical Scenario Generator for Fault-Tolerant Diagnosis

    NASA Technical Reports Server (NTRS)

    James, Mark

    2007-01-01

    The Hypothetical Scenario Generator for Fault-tolerant Diagnostics (HSG) is an algorithm being developed in conjunction with other components of artificial- intelligence systems for automated diagnosis and prognosis of faults in spacecraft, aircraft, and other complex engineering systems. By incorporating prognostic capabilities along with advanced diagnostic capabilities, these developments hold promise to increase the safety and affordability of the affected engineering systems by making it possible to obtain timely and accurate information on the statuses of the systems and predicting impending failures well in advance. The HSG is a specific instance of a hypothetical- scenario generator that implements an innovative approach for performing diagnostic reasoning when data are missing. The special purpose served by the HSG is to (1) look for all possible ways in which the present state of the engineering system can be mapped with respect to a given model and (2) generate a prioritized set of future possible states and the scenarios of which they are parts.

  4. Fault tolerance techniques for embedded telemetry system: case study

    NASA Astrophysics Data System (ADS)

    Krosman, Kazimierz; Sosnowski, Janusz

    2016-09-01

    This paper presents software methods of improving fault tolerance in embedded systems. These methods have been adapted to a telemetry system dedicated to tracking vehicles for logistics purposes. The developed telemetry system allows us to monitor vehicle position and some technical parameters via GSM communication. It comprises the capability of remote software reconfiguration. To evaluate dependability of the system we use a fault injection technique based on simulating bit-flip errors within memory cells. For this purpose an original testbed has been developed. It provides not only the capability of disturbing internal state of the tested system (via JTAG interface) but also the possibility of controlling system input states and observing its behavior (in particular output signals) according to specified test scenarios. The whole test process is automatized. The paper presents a case study related to a commercial product but the described methodology and techniques can be extended for other embedded systems.

  5. Gain-Scheduled Fault Tolerance Control Under False Identification

    NASA Technical Reports Server (NTRS)

    Shin, Jong-Yeob; Belcastro, Christine (Technical Monitor)

    2006-01-01

    An active fault tolerant control (FTC) law is generally sensitive to false identification since the control gain is reconfigured for fault occurrence. In the conventional FTC law design procedure, dynamic variations due to false identification are not considered. In this paper, an FTC synthesis method is developed in order to consider possible variations of closed-loop dynamics under false identification into the control design procedure. An active FTC synthesis problem is formulated into an LMI optimization problem to minimize the upper bound of the induced-L2 norm which can represent the worst-case performance degradation due to false identification. The developed synthesis method is applied for control of the longitudinal motions of FASER (Free-flying Airplane for Subscale Experimental Research). The designed FTC law of the airplane is simulated for pitch angle command tracking under a false identification case.

  6. Sliding mode fault detection and fault-tolerant control of smart dampers in semi-active control of building structures

    NASA Astrophysics Data System (ADS)

    Yeganeh Fallah, Arash; Taghikhany, Touraj

    2015-12-01

    Recent decades have witnessed much interest in the application of active and semi-active control strategies for seismic protection of civil infrastructures. However, the reliability of these systems is still in doubt as there remains the possibility of malfunctioning of their critical components (i.e. actuators and sensors) during an earthquake. This paper focuses on the application of the sliding mode method due to the inherent robustness of its fault detection observer and fault-tolerant control. The robust sliding mode observer estimates the state of the system and reconstructs the actuators’ faults which are used for calculating a fault distribution matrix. Then the fault-tolerant sliding mode controller reconfigures itself by the fault distribution matrix and accommodates the fault effect on the system. Numerical simulation of a three-story structure with magneto-rheological dampers demonstrates the effectiveness of the proposed fault-tolerant control system. It was shown that the fault-tolerant control system maintains the performance of the structure at an acceptable level in the post-fault case.

  7. Minimizing resource overheads for fault-tolerant preparation of encoded states of the Steane code.

    PubMed

    Goto, Hayato

    2016-01-27

    The seven-qubit quantum error-correcting code originally proposed by Steane is one of the best known quantum codes. The Steane code has a desirable property that most basic operations can be performed easily in a fault-tolerant manner. A major obstacle to fault-tolerant quantum computation with the Steane code is fault-tolerant preparation of encoded states, which requires large computational resources. Here we propose efficient state preparation methods for zero and magic states encoded with the Steane code, where the zero state is one of the computational basis states and the magic state allows us to achieve universality in fault-tolerant quantum computation. The methods minimize resource overheads for the fault-tolerant state preparation, and therefore reduce necessary resources for quantum computation with the Steane code. Thus, the present results will open a new possibility for efficient fault-tolerant quantum computation.

  8. Design and simulation of advanced fault tolerant flight control schemes

    NASA Astrophysics Data System (ADS)

    Gururajan, Srikanth

    This research effort describes the design and simulation of a distributed Neural Network (NN) based fault tolerant flight control scheme and the interface of the scheme within a simulation/visualization environment. The goal of the fault tolerant flight control scheme is to recover an aircraft from failures to its sensors or actuators. A commercially available simulation package, Aviator Visual Design Simulator (AVDS), was used for the purpose of simulation and visualization of the aircraft dynamics and the performance of the control schemes. For the purpose of the sensor failure detection, identification and accommodation (SFDIA) task, it is assumed that the pitch, roll and yaw rate gyros onboard are without physical redundancy. The task is accomplished through the use of a Main Neural Network (MNN) and a set of three De-Centralized Neural Networks (DNNs), providing analytical redundancy for the pitch, roll and yaw gyros. The purpose of the MNN is to detect a sensor failure while the purpose of the DNNs is to identify the failed sensor and then to provide failure accommodation. The actuator failure detection, identification and accommodation (AFDIA) scheme also features the MNN, for detection of actuator failures, along with three Neural Network Controllers (NNCs) for providing the compensating control surface deflections to neutralize the failure induced pitching, rolling and yawing moments. All NNs continue to train on-line, in addition to an offline trained baseline network structure, using the Extended Back-Propagation Algorithm (EBPA), with the flight data provided by the AVDS simulation package. The above mentioned adaptive flight control schemes have been traditionally implemented sequentially on a single computer. This research addresses the implementation of these fault tolerant flight control schemes on parallel and distributed computer architectures, using Berkeley Software Distribution (BSD) sockets and Message Passing Interface (MPI) for inter

  9. Modeling the Fault Tolerant Capability of a Flight Control System: An Exercise in SCR Specification

    NASA Technical Reports Server (NTRS)

    Alexander, Chris; Cortellessa, Vittorio; DelGobbo, Diego; Mili, Ali; Napolitano, Marcello

    2000-01-01

    In life-critical and mission-critical applications, it is important to make provisions for a wide range of contingencies, by providing means for fault tolerance. In this paper, we discuss the specification of a flight control system that is fault tolerant with respect to sensor faults. Redundancy is provided by analytical relations that hold between sensor readings; depending on the conditions, this redundancy can be used to detect, identify and accommodate sensor faults.

  10. Reliability analysis of fault-tolerant reconfigurable nano-architectures

    SciTech Connect

    Bhaduri, D.; Graham, P. S.; Shukla, S. K.

    2004-01-01

    Manufacturing defects and transient errors will be abundant in high - density reconfigurable nano-scale designs. Recently, we have automated a computational scheme based on Markov Random Field (MRF) and Belief Propagation algorithms in a tool named NANOLAB to evaluate the reliability of nano architectures. In this paper, we show how our methodology can be exploited to design defect- and fault-tolerant programmable logic architectures. The effectiveness of such automation is illustrated by analyzing reconfigurable Boolean networks formed using different industry-based configurable logic blocks (CLBs), both in the presence of thermal perturbations and signal noise.

  11. A Test Generation Framework for Distributed Fault-Tolerant Algorithms

    NASA Technical Reports Server (NTRS)

    Goodloe, Alwyn; Bushnell, David; Miner, Paul; Pasareanu, Corina S.

    2009-01-01

    Heavyweight formal methods such as theorem proving have been successfully applied to the analysis of safety critical fault-tolerant systems. Typically, the models and proofs performed during such analysis do not inform the testing process of actual implementations. We propose a framework for generating test vectors from specifications written in the Prototype Verification System (PVS). The methodology uses a translator to produce a Java prototype from a PVS specification. Symbolic (Java) PathFinder is then employed to generate a collection of test cases. A small example is employed to illustrate how the framework can be used in practice.

  12. Software Implemented Fault-Tolerant (SIFT) user's guide

    NASA Technical Reports Server (NTRS)

    Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.

    1984-01-01

    Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.

  13. Subsystem fault tolerance with the Bacon-Shor code.

    PubMed

    Aliferis, Panos; Cross, Andrew W

    2007-06-01

    We discuss how the presence of gauge subsystems in the Bacon-Shor code [D. Bacon, Phys. Rev. A 73, 012340 (2006)10.1103/PhysRevA.73.012340 (2006)] leads to remarkably simple and efficient methods for fault-tolerant error correction (FTEC). Most notably, FTEC does not require entangled ancillary states, and it can be implemented with nearest-neighbor two-qubit measurements. By using these methods, we prove a lower bound on the quantum accuracy threshold, 1.94 x 10(-4) for adversarial stochastic noise, that improves previous lower bounds by nearly an order of magnitude.

  14. Validation Methods for Fault-Tolerant avionics and control systems, working group meeting 1

    NASA Technical Reports Server (NTRS)

    1979-01-01

    The proceedings of the first working group meeting on validation methods for fault tolerant computer design are presented. The state of the art in fault tolerant computer validation was examined in order to provide a framework for future discussions concerning research issues for the validation of fault tolerant avionics and flight control systems. The development of positions concerning critical aspects of the validation process are given.

  15. ROSE::FTTransform - A Source-to-Source Translation Framework for Exascale Fault-Tolerance Research

    SciTech Connect

    Lidman, J; Quinlan, D; Liao, C; McKee, S

    2012-03-26

    Exascale computing systems will require sufficient resilience to tolerate numerous types of hardware faults while still assuring correct program execution. Such extreme-scale machines are expected to be dominated by processors driven at lower voltages (near the minimum 0.5 volts for current transistors). At these voltage levels, the rate of transient errors increases dramatically due to the sensitivity to transient and geographically localized voltage drops on parts of the processor chip. To achieve power efficiency, these processors are likely to be streamlined and minimal, and thus they cannot be expected to handle transient errors entirely in hardware. Here we present an open, compiler-based framework to automate the armoring of High Performance Computing (HPC) software to protect it from these types of transient processor errors. We develop an open infrastructure to support research work in this area, and we define tools that, in the future, may provide more complete automated and/or semi-automated solutions to support software resiliency on future exascale architectures. Results demonstrate that our approach is feasible, pragmatic in how it can be separated from the software development process, and reasonably efficient (0% to 30% overhead for the Jacobi iteration on common hardware; and 20%, 40%, 26%, and 2% overhead for a randomly selected subset of benchmarks from the Livermore Loops [1]).

  16. The BTeV DAQ and Trigger System - Some throughput, usability and fault tolerance aspects

    SciTech Connect

    Erik Edward Gottschalk et al.

    2001-08-20

    As presented at the last CHEP conference, the BTeV triggering and data collection pose a significant challenge in construction and operation, generating 1.5 Terabytes/second of raw data from over 30 million detector channels. We report on facets of the DAQ and trigger farms. We report on the current design of the DAQ, especially its partitioning features to support commissioning of the detector. We are exploring collaborations with computer science groups experienced in fault tolerant and dynamic real-time and embedded systems to develop a system to provide the extreme flexibility and high availability required of the heterogeneous trigger farm ({approximately} ten thousand DSPs and commodity processors). We describe directions in the following areas: system modeling and analysis using the Model Integrated Computing approach to assist in the creation of domain-specific modeling, analysis, and program synthesis environments for building complex, large-scale computer-based systems; System Configuration Management to include compilable design specifications for configurable hardware components, schedules, and communication maps; Runtime Environment and Hierarchical Fault Detection/Management--a system-wide infrastructure for rapidly detecting, isolating, filtering, and reporting faults which will be encapsulated in intelligent active entities (agents) to run on DSPs, L2/3 processors, and other supporting processors throughout the system.

  17. Verification of the FtCayuga fault-tolerant microprocessor system. Volume 2: Formal specification and correctness theorems

    NASA Technical Reports Server (NTRS)

    Bickford, Mark; Srivas, Mandayam

    1991-01-01

    Presented here is a formal specification and verification of a property of a quadruplicately redundant fault tolerant microprocessor system design. A complete listing of the formal specification of the system and the correctness theorems that are proved are given. The system performs the task of obtaining interactive consistency among the processors using a special instruction on the processors. The design is based on an algorithm proposed by Pease, Shostak, and Lamport. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, providing certain preconditions hold, using a computer aided design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.

  18. Multi-Core Technology for and Fault Tolerant High-Performance Spacecraft Computer Systems

    NASA Astrophysics Data System (ADS)

    Behr, Peter M.; Haulsen, Ivo; Van Kampenhout, J. Reinier; Pletner, Samuel

    2012-08-01

    The current architectural trends in the field of multi-core processors can provide an enormous increase in processing power by exploiting the parallelism available in many applications. In particular because of their high energy efficiency, it is obvious that multi-core processor-based systems will also be used in future space missions. In this paper we present the system architecture of a powerful optical sensor system based on the eight core multi-core processor P4080 from Freescale. The fault tolerant structure and the highly effective FDIR concepts implemented on different hardware and software levels of the system are described in detail. The space application scenario and thus the main requirements for the sensor system have been defined by a complex tracking sensor application for autonomous landing or docking manoeuvres.

  19. Gapped boundaries, group cohomology and fault-tolerant logical gates

    NASA Astrophysics Data System (ADS)

    Yoshida, Beni

    2017-02-01

    This paper attempts to establish the connection among classifications of gapped boundaries in topological phases of matter, bosonic symmetry-protected topological (SPT) phases and fault-tolerantly implementable logical gates in quantum error-correcting codes. We begin by presenting constructions of gapped boundaries for the d-dimensional quantum double model by using d-cocycles functions (d ≥ 2). We point out that the system supports m-dimensional excitations (m < d), which we shall call fluctuating charges, that are superpositions of point-like electric charges characterized by m-dimensional bosonic SPT wavefunctions. There exist gapped boundaries where electric charges or magnetic fluxes may not condense by themselves, but may condense only when accompanied by fluctuating charges. Magnetic fluxes and codimension-2 fluctuating charges exhibit non-trivial multi-excitation braiding statistics, involving more than two excitations. The statistical angle can be computed by taking slant products of underlying cocycle functions sequentially. We find that excitations that may condense into a gapped boundary can be characterized by trivial multi-excitation braiding statistics, generalizing the notion of the Lagrangian subgroup. As an application, we construct fault-tolerantly implementable logical gates for the d-dimensional quantum double model by using d-cocycle functions. Namely, corresponding logical gates belong to the dth level of the Clifford hierarchy, but are outside of the (d - 1) th level, if cocycle functions have non-trivial sequences of slant products.

  20. Fault tolerant, radiation hard, high performance digital signal processor

    NASA Technical Reports Server (NTRS)

    Holmann, Edgar; Linscott, Ivan R.; Maurer, Michael J.; Tyler, G. L.; Libby, Vibeke

    1990-01-01

    An architecture has been developed for a high-performance VLSI digital signal processor that is highly reliable, fault-tolerant, and radiation-hard. The signal processor, part of a spacecraft receiver designed to support uplink radio science experiments at the outer planets, organizes the connections between redundant arithmetic resources, register files, and memory through a shuffle exchange communication network. The configuration of the network and the state of the processor resources are all under microprogram control, which both maps the resources according to algorithmic needs and reconfigures the processing should a failure occur. In addition, the microprogram is reloadable through the uplink to accommodate changes in the science objectives throughout the course of the mission. The processor will be implemented with silicon compiler tools, and its design will be verified through silicon compilation simulation at all levels from the resources to full functionality. By blending reconfiguration with redundancy the processor implementation is fault-tolerant and reliable, and possesses the long expected lifetime needed for a spacecraft mission to the outer planets.

  1. Distributed Evaluation Functions for Fault Tolerant Multi-Rover Systems

    NASA Technical Reports Server (NTRS)

    Agogino, Adrian; Turner, Kagan

    2005-01-01

    The ability to evolve fault tolerant control strategies for large collections of agents is critical to the successful application of evolutionary strategies to domains where failures are common. Furthermore, while evolutionary algorithms have been highly successful in discovering single-agent control strategies, extending such algorithms to multiagent domains has proven to be difficult. In this paper we present a method for shaping evaluation functions for agents that provide control strategies that both are tolerant to different types of failures and lead to coordinated behavior in a multi-agent setting. This method neither relies of a centralized strategy (susceptible to single point of failures) nor a distributed strategy where each agent uses a system wide evaluation function (severe credit assignment problem). In a multi-rover problem, we show that agents using our agent-specific evaluation perform up to 500% better than agents using the system evaluation. In addition we show that agents are still able to maintain a high level of performance when up to 60% of the agents fail due to actuator, communication or controller faults.

  2. Performance and economy of a fault-tolerant multiprocessor

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, C. J.

    1979-01-01

    The FTMP (Fault-Tolerant Multiprocessor) is one of two central aircraft fault-tolerant architectures now in the prototype phase under NASA sponsorship. The intended application of the computer includes such critical real-time tasks as 'fly-by-wire' active control and completely automatic Category III landings of commercial aircraft. The FTMP architecture is briefly described and it is shown that it is a viable solution to the multi-faceted problems of safety, speed, and cost. Three job dispatch strategies are described, and their results with respect to job-starting delay are presented. The first strategy is a simple First-Come-First-Serve (FCFS) job dispatch executive. The other two schedulers are an adaptive FCFS and an interrupt driven scheduler. Three failure modes are discussed, and the FTMP survival probability in the face of random hard failures is evaluated. It is noted that the hourly cost of operating two FTMPs in a transport aircraft can be as little as one-to-two percent of the total flight-hour cost of the aircraft.

  3. The Design of a Fault-Tolerant COTS-Based Bus Architecture for Space Applications

    NASA Technical Reports Server (NTRS)

    Chau, Savio N.; Alkalai, Leon; Tai, Ann T.

    2000-01-01

    The high-performance, scalability and miniaturization requirements together with the power, mass and cost constraints mandate the use of commercial-off-the-shelf (COTS) components and standards in the X2000 avionics system architecture for deep-space missions. In this paper, we report our experiences and findings on the design of an IEEE 1394 compliant fault-tolerant COTS-based bus architecture. While the COTS standard IEEE 1394 adequately supports power management, high performance and scalability, its topological criteria impose restrictions on fault tolerance realization. To circumvent the difficulties, we derive a "stack-tree" topology that not only complies with the IEEE 1394 standard but also facilitates fault tolerance realization in a spaceborne system with limited dedicated resource redundancies. Moreover, by exploiting pertinent standard features of the 1394 interface which are not purposely designed for fault tolerance, we devise a comprehensive set of fault detection mechanisms to support the fault-tolerant bus architecture.

  4. Robust and fault tolerant control of modular and reconfigurable robots

    NASA Astrophysics Data System (ADS)

    Abdul, Sajan

    Modular and reconfigurable robot has been one of the main areas of robotics research in recent years due to its wide range of applications, especially in aerospace sector. Dynamic control of manipulators can be performed using joint torque sensing with little information of the link dynamics. From the modular robot perspective, this advantage offered by the torque sensor can be taken to enhance the modularity of the control system. Known modular robots though boast novel and diverse mechanical design on joint modules in one way or another, they still require the whole robot dynamic model for motion control, and modularity offered in the mechanical side does not offer any advantage in the control design. In this work, a modular distributed control technique is formulated for modular and reconfigurable robots that can instantly adapt to robot reconfigurations. Under this control methodology, a modular and reconfigurable robot is stabilized joint by joint, and modules can be added or removed without the need of re-tuning the controller. Model uncertainties associated with load and links are compensated by the use of joint torque sensors. Other model uncertainties at each joint module are compensated by a decomposition based robust controller for each module. The proposed distributed control technique offers a 'modular' approach, featuring a unique joint-by-joint control synthesis of the joint modules. Fault tolerance and fault detection are formulated as a decentralized control problem for modular and reconfigurable robots in this thesis work. The modularity of the system is exploited to derive a strategy dependent only on a single joint module, while eliminating the need for the motion states of other joint modules. While the traditional fault tolerant and detection schemes are suitable for robots with the whole dynamic model, this proposed technique is ideal for modular and reconfigurable robots because of its modular nature. The proposed methods have been

  5. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary

    NASA Technical Reports Server (NTRS)

    Smith, T. B., III; Lala, J. H.

    1984-01-01

    The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.

  6. Fault tolerant cooperative control for UAV rendezvous problem subject to actuator faults

    NASA Astrophysics Data System (ADS)

    Jiang, T.; Meskin, N.; Sobhani-Tehrani, E.; Khorasani, K.; Rabbath, C. A.

    2007-04-01

    This paper investigates the problem of fault tolerant cooperative control for UAV rendezvous problem in which multiple UAVs are required to arrive at their designated target despite presence of a fault in the thruster of any UAV. An integrated hierarchical scheme is proposed and developed that consists of a cooperative rendezvous planning algorithm at the team level and a nonlinear fault detection and isolation (FDI) subsystem at individual UAV's actuator/sensor level. Furthermore, a rendezvous re-planning strategy is developed that interfaces the rendezvous planning algorithm with the low-level FDI. A nonlinear geometric approach is used for the FDI subsystem that can detect and isolate faults in various UAV actuators including thrusters and control surfaces. The developed scheme is implemented for a rendezvous scenario with three Aerosonde UAVs, a single target, and presence of a priori known threats. Simulation results reveal the effectiveness of our proposed scheme in fulfilling the rendezvous mission objective that is specified as a successful intercept of Aerosondes at their designated target, despite the presence of severe loss of effectiveness in Aerosondes engine thrusters.

  7. A universal, fault-tolerant, non-linear analytic network for modeling and fault detection

    SciTech Connect

    Mott, J.E. ); King, R.W.; Monson, L.R.; Olson, D.L.; Staffon, J.D. )

    1992-03-06

    The similarities and differences of a universal network to normal neural networks are outlined. The description and application of a universal network is discussed by showing how a simple linear system is modeled by normal techniques and by universal network techniques. A full implementation of the universal network as universal process modeling software on a dedicated computer system at EBR-II is described and example results are presented. It is concluded that the universal network provides different feature recognition capabilities than a neural network and that the universal network can provide extremely fast, accurate, and fault-tolerant estimation, validation, and replacement of signals in a real system.

  8. 14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

    Code of Federal Regulations, 2011 CFR

    2011-01-01

    ... 14 Aeronautics and Space 1 2011-01-01 2011-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance...

  9. 14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

    Code of Federal Regulations, 2012 CFR

    2012-01-01

    ... 14 Aeronautics and Space 1 2012-01-01 2012-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance...

  10. 14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

    Code of Federal Regulations, 2014 CFR

    2014-01-01

    ... 14 Aeronautics and Space 1 2014-01-01 2014-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance...

  11. 14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

    Code of Federal Regulations, 2010 CFR

    2010-01-01

    ... 14 Aeronautics and Space 1 2010-01-01 2010-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance...

  12. 14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements

    Code of Federal Regulations, 2013 CFR

    2013-01-01

    ... 14 Aeronautics and Space 1 2013-01-01 2013-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance...

  13. A survey of NASA and military standards on fault tolerance and reliability applied to robotics

    NASA Technical Reports Server (NTRS)

    Cavallaro, Joseph R.; Walker, Ian D.

    1994-01-01

    There is currently increasing interest and activity in the area of reliability and fault tolerance for robotics. This paper discusses the application of Standards in robot reliability, and surveys the literature of relevant existing standards. A bibliography of relevant Military and NASA standards for reliability and fault tolerance is included.

  14. Design and evaluation of fault-tolerant VLSI/WSI processor arrays. Final technical report, 1 July 1985-31 December 1987

    SciTech Connect

    Fortes, J.A.

    1987-12-31

    This document is the final report of work performed under the project entitled Design and Evaluation of Fault-Tolerant VLSI/WSI Processor Arrays supported by the Innovative Science and Technology Office of the Strategic Defense Initiative Organization and administered through the Office of Naval Research under Contract No. 00014-85-k-0588. With the concurrence of Dr. Clifford Lau, the Scientific Officer for this project, this final report consists of reprints of publications reporting work performed under the project. In the attached list of publications are papers where fault-tolerant systems for processor arrays are proposed and studied. Studies on algorithmic and software aspects relevant to the systems are also reported, as well as hardware and reconfigurability issues for fault-tolerant processor arrays.

  15. High-Intensity Radiated Field Fault-Injection Experiment for a Fault-Tolerant Distributed Communication System

    NASA Technical Reports Server (NTRS)

    Yates, Amy M.; Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Gonzalez, Oscar R.; Gray, W. Steven

    2010-01-01

    Safety-critical distributed flight control systems require robustness in the presence of faults. In general, these systems consist of a number of input/output (I/O) and computation nodes interacting through a fault-tolerant data communication system. The communication system transfers sensor data and control commands and can handle most faults under typical operating conditions. However, the performance of the closed-loop system can be adversely affected as a result of operating in harsh environments. In particular, High-Intensity Radiated Field (HIRF) environments have the potential to cause random fault manifestations in individual avionic components and to generate simultaneous system-wide communication faults that overwhelm existing fault management mechanisms. This paper presents the design of an experiment conducted at the NASA Langley Research Center's HIRF Laboratory to statistically characterize the faults that a HIRF environment can trigger on a single node of a distributed flight control system.

  16. Sensor Fault and Delay Tolerant Control for Networked Control Systems Subject to External Disturbances.

    PubMed

    Han, Shi-Yuan; Chen, Yue-Hui; Tang, Gong-You

    2017-03-28

    In this paper, the problem of sensor fault and delay tolerant control problem for a class of networked control systems under external disturbances is investigated. More precisely, the dynamic characteristics of the external disturbance and sensor fault are described as the output of exogenous systems first. The original sensor fault and delay tolerant control problem is reformulated as an equivalence problem with designed available system output and reformed performance index. The feedforward and feedback sensor fault tolerant controller (FFSFTC) can be obtained by utilizing the solutions of Riccati matrix equation and Stein matrix equation. Based on the designed fault diagnoser, the proposed FFSFTC is further reconstructed to compensate for the sensor fault and delayed measurement effects. Finally, numerical examples are provided to illustrate the effectiveness of our proposed FFSFTC with different cases with various types of sensor faults, measurement delays and external disturbances.

  17. Fault-Tolerant Conversion between the Steane and Reed-Muller Quantum Codes

    NASA Astrophysics Data System (ADS)

    Anderson, Jonas T.; Duclos-Cianci, Guillaume; Poulin, David

    2014-08-01

    Steane's 7-qubit quantum error-correcting code admits a set of fault-tolerant gates that generate the Clifford group, which in itself is not universal for quantum computation. The 15-qubit Reed-Muller code also does not admit a universal fault-tolerant gate set but possesses fault-tolerant T and control-control-Z gates. Combined with the Clifford group, either of these two gates generates a universal set. Here, we combine these two features by demonstrating how to fault-tolerantly convert between these two codes, providing a new method to realize universal fault-tolerant quantum computation. One interpretation of our result is that both codes correspond to the same subsystem code in different gauges. Our scheme extends to the entire family of quantum Reed-Muller codes.

  18. Fault tolerant system with imperfect coverage, reboot and server vacation

    NASA Astrophysics Data System (ADS)

    Jain, Madhu; Meena, Rakesh Kumar

    2016-12-01

    This study is concerned with the performance modeling of a fault tolerant system consisting of operating units supported by a combination of warm and cold spares. The on-line as well as warm standby units are subject to failures and are send for the repair to a repair facility having single repairman which is prone to failure. If the failed unit is not detected, the system enters into an unsafe state from which it is cleared by the reboot and recovery action. The server is allowed to go for vacation if there is no failed unit present in the system. Markov model is developed to obtain the transient probabilities associated with the system states. Runge-Kutta method is used to evaluate the system state probabilities and queueing measures. To explore the sensitivity and cost associated with the system, numerical simulation is conducted.

  19. Experimental magic state distillation for fault-tolerant quantum computing.

    PubMed

    Souza, Alexandre M; Zhang, Jingfu; Ryan, Colm A; Laflamme, Raymond

    2011-01-25

    Any physical quantum device for quantum information processing (QIP) is subject to errors in implementation. In order to be reliable and efficient, quantum computers will need error-correcting or error-avoiding methods. Fault-tolerance achieved through quantum error correction will be an integral part of quantum computers. Of the many methods that have been discovered to implement it, a highly successful approach has been to use transversal gates and specific initial states. A critical element for its implementation is the availability of high-fidelity initial states, such as |0〉 and the 'magic state'. Here, we report an experiment, performed in a nuclear magnetic resonance (NMR) quantum processor, showing sufficient quantum control to improve the fidelity of imperfect initial magic states by distilling five of them into one with higher fidelity.

  20. A Universal Operator Theoretic Framework for Quantum Fault Tolerance.

    NASA Astrophysics Data System (ADS)

    Gilbert, Gerald; Calderbank, Robert; Aggarwal, Vaneet; Hamrick, Michael; Weinstein, Yaakov

    2008-03-01

    We introduce a universal operator theoretic framework for quantum fault tolerance. This incorporates a top-down approach that implements a system-level criterion based on specification of the full system dynamics, applied at every level of error correction concatenation. This leads to more accurate determinations of error thresholds than could previously be obtained. The basis for the approach is the Quantum Computer Condition (QCC), an inequality governing the evolution of a quantum computer. In addition to more accurate determination of error threshold values, we show that the QCC provides a means to systematically determine optimality (or non-optimality) of different choices of error correction coding and error avoidance strategies. This is possible because, as we show, all known coding schemes are actually special cases of the QCC. We demonstrate this by introducing a new, operator theoretic form of entanglement assisted quantum error correction.

  1. Fault tolerant vector control of induction motor drive

    NASA Astrophysics Data System (ADS)

    Odnokopylov, G.; Bragin, A.

    2014-10-01

    For electric composed of technical objects hazardous industries, such as nuclear, military, chemical, etc. an urgent task is to increase their resiliency and survivability. The construction principle of vector control system fault-tolerant asynchronous electric. Displaying recovery efficiency three-phase induction motor drive in emergency mode using two-phase vector control system. The process of formation of a simulation model of the asynchronous electric unbalance in emergency mode. When modeling used coordinate transformation, providing emergency operation electric unbalance work. The results of modeling transient phase loss motor stator. During a power failure phase induction motor cannot save circular rotating field in the air gap of the motor and ensure the restoration of its efficiency at rated torque and speed.

  2. Buffered coscheduling for parallel programming and enhanced fault tolerance

    SciTech Connect

    Petrini, Fabrizio; Feng, Wu-chun

    2006-01-31

    A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

  3. Evolution of shuttle avionics redundancy management/fault tolerance

    NASA Technical Reports Server (NTRS)

    Boykin, J. C.; Thibodeau, J. R.; Schneider, H. E.

    1985-01-01

    The challenge of providing redundancy management (RM) and fault tolerance to meet the Shuttle Program requirements of fail operational/fail safe for the avionics systems was complicated by the critical program constraints of weight, cost, and schedule. The basic and sometimes false effectivity of less than pure RM designs is addressed. Evolution of the multiple input selection filter (the heart of the RM function) is discussed with emphasis on the subtle interactions of the flight control system that were found to be potentially catastrophic. Several other general RM development problems are discussed, with particular emphasis on the inertial measurement unit RM, indicative of the complexity of managing that three string system and its critical interfaces with the guidance and control systems.

  4. Validating Requirements for Fault Tolerant Systems Using Model Checking

    NASA Technical Reports Server (NTRS)

    Schneider, Francis; Easterbrook, Steve M.; Callahan, John R.; Holzmann, Gerard J.

    1997-01-01

    Model checking is shown to be an effective tool in validating the behavior of a fault tolerant embedded spacecraft controller. The case study presented here shows that by judiciously abstracting away extraneous complexity, the state space of the model could be exhaustively searched allowing critical functional requirements to be validated down to the design level. Abstracting away detail not germane to the problem of interest leaves by definition a partial specification behind. The success of this procedure shows that it is feasible to effectively validate a partial specification with this technique. Three anomalies were found in the system one of which is an error in the detailed requirements, and the other two are missing/ambiguous requirements. Because the method allows validation of partial specifications, it also is an effective methodology towards maintaining fidelity between a co-evolving specification and an implementation.

  5. Scheme for fault-tolerant holonomic computation on stabilizer codes

    NASA Astrophysics Data System (ADS)

    Oreshkov, Ognyan; Brun, Todd A.; Lidar, Daniel A.

    2009-08-01

    This paper generalizes and expands upon the work [O. Oreshkov, T. A. Brun, and D. A. Lidar, Phys. Rev. Lett. 102, 070502 (2009)] where we introduced a scheme for fault-tolerant holonomic quantum computation (HQC) on stabilizer codes. HQC is an all-geometric strategy based on non-Abelian adiabatic holonomies, which is known to be robust against various types of errors in the control parameters. The scheme we present shows that HQC is a scalable method of computation and opens the possibility for combining the benefits of error correction with the inherent resilience of the holonomic approach. We show that with the Bacon-Shor code the scheme can be implemented using Hamiltonian operators of weights 2 and 3.

  6. Fault tolerant quantum random number generator certified by Majorana fermions

    NASA Astrophysics Data System (ADS)

    Deng, Dong-Ling; Duan, Lu-Ming

    2013-03-01

    Braiding of Majorana fermions gives accurate topological quantum operations that are intrinsically robust to noise and imperfection, providing a natural method to realize fault-tolerant quantum information processing. Unfortunately, it is known that braiding of Majorana fermions is not sufficient for implementation of universal quantum computation. Here we show that topological manipulation of Majorana fermions provides the full set of operations required to generate random numbers by way of quantum mechanics and to certify its genuine randomness through violation of a multipartite Bell inequality. The result opens a new perspective to apply Majorana fermions for robust generation of certified random numbers, which has important applications in cryptography and other related areas. This work was supported by the NBRPC (973 Program) 2011CBA00300 (2011CBA00302), the IARPA MUSIQC program, the ARO and the AFOSR MURI program.

  7. Advanced information processing system - Status report. [for fault tolerant and damage tolerant data processing for aerospace vehicles

    NASA Technical Reports Server (NTRS)

    Brock, L. D.; Lala, J.

    1986-01-01

    The Advanced Information Processing System (AIPS) is designed to provide a fault tolerant and damage tolerant data processing architecture for a broad range of aerospace vehicles. The AIPS architecture also has attributes to enhance system effectiveness such as graceful degradation, growth and change tolerance, integrability, etc. Two key building blocks being developed by the AIPS program are a fault and damage tolerant processor and communication network. A proof-of-concept system is now being built and will be tested to demonstrate the validity and performance of the AIPS concepts.

  8. Analysis of GPS Abnormal Conditions within Fault Tolerant Control Laws

    NASA Astrophysics Data System (ADS)

    Al-Sinbol, Gahssan

    The Global Position System (GPS) is a critical element for the functionality of autonomous flying vehicles. The GPS operation at normal and abnormal conditions directly impacts the trajectory tracking performance of the autonomous Unmanned Aerial Vehicles (UAVs) controllers. The effects of GPS parameter variation must be well understood and user-friendly computational tools must be developed to facilitate the design and evaluation of fault tolerant control laws. This thesis presents the development of a simplified GPS error model in Matlab/Simulink and its use performing a sensitivity analysis of GPS parameters effect under system normal and abnormal operation on different UAV trajectory tracking controllers. The model statistically generates position and velocity errors, simulates the effect of GPS satellite configuration on the position and velocity measurement accuracy, and implements a set of failures to the GPS readings. The model and its graphical user interface was integrated within the WVU UAV simulation environment as a masked Simulink block. The effects on the controllers' trajectory tracking performance of the following GPS parameters were investigated within normal operation ranges and outside: time delay, update rate, error standard deviation, bias, and major position and velocity failures. Several sets of control laws with fixed and adaptive parameters and of different levels of complexity have been used in this investigation. A complex performance index formulated in terms of tracking errors and control activity was used for control laws performance evaluation. The composition of various metrics within the performance index was performed using fixed and variable weights depending on the local characteristics of the commanded trajectory. This study has revealed that GPS error parameters have a significant impact on control laws performance. The proposed GPS model has proved to be a valuable, flexible tool for testing and evaluation of the fault

  9. A Fault Tolerance Mechanism for On-Road Sensor Networks

    PubMed Central

    Feng, Lei; Guo, Shaoyong; Sun, Jialu; Yu, Peng; Li, Wenjing

    2016-01-01

    On-Road Sensor Networks (ORSNs) play an important role in capturing traffic flow data for predicting short-term traffic patterns, driving assistance and self-driving vehicles. However, this kind of network is prone to large-scale communication failure if a few sensors physically fail. In this paper, to ensure that the network works normally, an effective fault-tolerance mechanism for ORSNs which mainly consists of backup on-road sensor deployment, redundant cluster head deployment and an adaptive failure detection and recovery method is proposed. Firstly, based on the N − x principle and the sensors’ failure rate, this paper formulates the backup sensor deployment problem in the form of a two-objective optimization, which explains the trade-off between the cost and fault resumption. In consideration of improving the network resilience further, this paper introduces a redundant cluster head deployment model according to the coverage constraint. Then a common solving method combining integer-continuing and sequential quadratic programming is explored to determine the optimal location of these two deployment problems. Moreover, an Adaptive Detection and Resume (ADR) protocol is deigned to recover the system communication through route and cluster adjustment if there is a backup on-road sensor mismatch. The final experiments show that our proposed mechanism can achieve an average 90% recovery rate and reduce the average number of failed sensors at most by 35.7%. PMID:27918483

  10. Advanced Information Processing System (AIPS)-based fault tolerant avionics architecture for launch vehicles

    NASA Technical Reports Server (NTRS)

    Lala, Jaynarayan H.; Harper, Richard E.; Jaskowiak, Kenneth R.; Rosch, Gene; Alger, Linda S.; Schor, Andrei L.

    1990-01-01

    An avionics architecture for the advanced launch system (ALS) that uses validated hardware and software building blocks developed under the advanced information processing system program is presented. The AIPS for ALS architecture defined is preliminary, and reliability requirements can be met by the AIPS hardware and software building blocks that are built using the state-of-the-art technology available in the 1992-93 time frame. The level of detail in the architecture definition reflects the level of detail available in the ALS requirements. As the avionics requirements are refined, the architecture can also be refined and defined in greater detail with the help of analysis and simulation tools. A useful methodology is demonstrated for investigating the impact of the avionics suite to the recurring cost of the ALS. It is shown that allowing the vehicle to launch with selected detected failures can potentially reduce the recurring launch costs. A comparative analysis shows that validated fault-tolerant avionics built out of Class B parts can result in lower life-cycle-cost in comparison to simplex avionics built out of Class S parts or other redundant architectures.

  11. Investigation of an Advanced Fault Tolerant Integrated Avionics System

    DTIC Science & Technology

    1986-03-01

    Fault Detection and Isolation 50 5.4.2 Cockpit Fault Monitoring and Reconfiguration 53 Logical...Management Design Considerations 5.2.2.1 Authority Hierarchy Redundancy management involves not only fault detection and isolation but action to deselect... Fault Detection and Isolation in the event of a fault in an active channel, three events must transpire: a) The fault must be detected, b) The

  12. Provable Transient Recovery for Frame-Based, Fault-Tolerant Computing Systems

    NASA Technical Reports Server (NTRS)

    DiVito, Ben L.; Butler, Ricky W.

    1992-01-01

    We present a formal verification of the transient fault recovery aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system architecture for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization accommodates a wide variety of voting schemes for purging the effects of transients.

  13. Efficient BDD-Based Planning for Non-Deterministic, Fault-Tolerant, and Adversarial Domains

    DTIC Science & Technology

    2003-06-01

    operator ( ). If the power is on (that is, AS> is = >@.A ), they increase or decrease the position, otherwise they cause no position change...tolerant planning and adversarial planning. Fault tolerant plan- ning addresses domains where non-determinism is caused by rare errors. The current...for synthesizing fault tolerant plans. Adversarial planning considers situations where non-determinism is caused by uncontrollable, but known

  14. Fault tolerance of artificial neural networks with applications in critical systems

    NASA Technical Reports Server (NTRS)

    Protzel, Peter W.; Palumbo, Daniel L.; Arras, Michael K.

    1992-01-01

    This paper investigates the fault tolerance characteristics of time continuous recurrent artificial neural networks (ANN) that can be used to solve optimization problems. The principle of operations and performance of these networks are first illustrated by using well-known model problems like the traveling salesman problem and the assignment problem. The ANNs are then subjected to 13 simultaneous 'stuck at 1' or 'stuck at 0' faults for network sizes of up to 900 'neurons'. The effects of these faults is demonstrated and the cause for the observed fault tolerance is discussed. An application is presented in which a network performs a critical task for a real-time distributed processing system by generating new task allocations during the reconfiguration of the system. The performance degradation of the ANN under the presence of faults is investigated by large-scale simulations, and the potential benefits of delegating a critical task to a fault tolerant network are discussed.

  15. Fault-tolerant VHDL descriptions: a case study for SEU-tolerant digital library

    NASA Astrophysics Data System (ADS)

    Tomczak, M.; Swiercz, B.; Napieralski, A.

    2006-10-01

    This paper presents a new cost-effective method of designing Single Event Upset (SEU)-tolerant digital systems based on Commercial-Off-The-Shelf (COTS) Field-Programmable-Gate-Array (FPGA) devices. The project was carried out in cooperation of Technical University of Lodz (TUL) with Deutsches Elektronen-Synchrotron (DESY). DESY is a high-energy particle physics research centre, located in Hamburg, Germany, and has been chosen as a home site for a new generation particle collider - X-Ray Free Electron Laser (X-FEL) accelerator. A need of implementing digital control systems inside accelerators main tunnel, brought a new hardware approach to low-cost design reliable compex circuits with respect to Single Event Effects (SEEs). The goal was to develop a high performance method without modifications in the FPGA architecture and without high area penalties. A SEU-tolerant, digital library has been created. From basic gates, through combinational and sequential cells to some more sophisticated units like memory blocks, code converters or arithmetical functions cells, in all elements upset detection and mitigation schemes have been implemented. The library was described in Very High Speed Integrated Circuit Hardware Description Language (VHDL).

  16. Actuator fault tolerant multi-controller scheme using set separation based diagnosis

    NASA Astrophysics Data System (ADS)

    Seron, María M.; De Doná, José A.

    2010-11-01

    We present a fault tolerant control strategy based on a new principle for actuator fault diagnosis. The scheme employs a standard bank of observers which match the different fault situations that can occur in the plant. Each of these observers has an associated estimation error with distinctive dynamics when an estimator matches the current fault situation of the plant. Based on the information from each observer, a fault detection and isolation (FDI) module is able to reconfigure the control loop by selecting the appropriate control law from a bank of controllers, each of them designed to stabilise and achieve reference tracking for one of the given fault models. The main contribution of this article is to propose a new FDI principle which exploits the separation of sets that characterise healthy system operation from sets that characterise transitions from healthy to faulty behaviour. The new principle allows to provide pre-checkable conditions for guaranteed fault tolerance of the overall multi-controller scheme.

  17. Fault-tolerant measurement-based quantum computing with continuous-variable cluster states.

    PubMed

    Menicucci, Nicolas C

    2014-03-28

    A long-standing open question about Gaussian continuous-variable cluster states is whether they enable fault-tolerant measurement-based quantum computation. The answer is yes. Initial squeezing in the cluster above a threshold value of 20.5 dB ensures that errors from finite squeezing acting on encoded qubits are below the fault-tolerance threshold of known qubit-based error-correcting codes. By concatenating with one of these codes and using ancilla-based error correction, fault-tolerant measurement-based quantum computation of theoretically indefinite length is possible with finitely squeezed cluster states.

  18. Survey and future directions of fault-tolerant distributed computing on board spacecraft

    NASA Astrophysics Data System (ADS)

    Fayyaz, Muhammad; Vladimirova, Tanya

    2016-12-01

    Current and future space missions demand highly reliable on-board computing systems, which are capable of carrying out high-performance data processing. At present, no single computing scheme satisfies both, the highly reliable operation requirement and the high-performance computing requirement. The aim of this paper is to review existing systems and offer a new approach to addressing the problem. In the first part of the paper, a detailed survey of fault-tolerant distributed computing systems for space applications is presented. Fault types and assessment criteria for fault-tolerant systems are introduced. Redundancy schemes for distributed systems are analyzed. A review of the state-of-the-art on fault-tolerant distributed systems is presented and limitations of current approaches are discussed. In the second part of the paper, a new fault-tolerant distributed computing platform with wireless links among the computing nodes is proposed. Novel algorithms, enabling important aspects of the architecture, such as time slot priority adaptive fault-tolerant channel access and fault-tolerant distributed computing using task migration are introduced.

  19. Design of a fault-tolerant decision-making system for biomedical applications.

    PubMed

    Faust, Oliver; Acharya, U Rajendra; Sputh, Bernhard H C; Tamura, Toshiyo

    2013-01-01

    This paper describes the design of a fault-tolerant classification system for medical applications. The design process follows the systems engineering methodology: in the agreement phase, we make the case for fault tolerance in diagnosis systems for biomedical applications. The argument extends the idea that machine diagnosis systems mimic the functionality of human decision-making, but in many cases they do not achieve the fault tolerance of the human brain. After making the case for fault tolerance, both requirements and specification for the fault-tolerant system are introduced before the implementation is discussed. The system is tested with fault and use cases to build up trust in the implemented system. This structured approach aided in the realisation of the fault-tolerant classification system. During the specification phase, we produced a formal model that enabled us to discuss what fault tolerance, reliability and safety mean for this particular classification system. Furthermore, such a formal basis for discussion is extremely useful during the initial stages of the design, because it helps to avoid big mistakes caused by a lack of overview later on in the project. During the implementation, we practiced component reuse by incorporating a reliable classification block, which was developed during a previous project, into the current design. Using a well-structured approach and practicing component reuse we follow best practice for both research and industry projects, which enabled us to realise the fault-tolerant classification system on time and within budget. This system can serve in a wide range of future health care systems.

  20. Design of fault-tolerant circuits for photovoltaic concentrators

    NASA Astrophysics Data System (ADS)

    Gonzalez, C. C.; Ross, R. G., Jr.

    1988-07-01

    A methodology for fault-tolerant design of photovoltaic concentrator module and array circuitry presented. Results are provided in the form of example analyses and a complete set of curves giving array power loss versus fraction of open-circuit failures for a broad variety of series-parallel configurations with bypass diodes. Specific curves are provided for single, four, eight, and sixteen-parallel-string source circuits with varying bypass diode frequencies. An example case is presented in a step-by-step fashion to assist the module or array designer in using the above mentioned curves to calculate expected power loss for other concentrator designs. Optimum circuit configurations must also reflect the costs of incorporating circuit redundancy features and the life-cycle tradeoffs associated with repair and replacement of failed modules. To this end, module replacement strategies are also investigated based on a set of projected module and array costs. The results highlight circuit design configurations and module replacement strategies that maximize the array benefit to cost ratio.

  1. Reactive system verification case study: Fault-tolerant transputer communication

    NASA Technical Reports Server (NTRS)

    Crane, D. Francis; Hamory, Philip J.

    1993-01-01

    A reactive program is one which engages in an ongoing interaction with its environment. A system which is controlled by an embedded reactive program is called a reactive system. Examples of reactive systems are aircraft flight management systems, bank automatic teller machine (ATM) networks, airline reservation systems, and computer operating systems. Reactive systems are often naturally modeled (for logical design purposes) as a composition of autonomous processes which progress concurrently and which communicate to share information and/or to coordinate activities. Formal (i.e., mathematical) frameworks for system verification are tools used to increase the users' confidence that a system design satisfies its specification. A framework for reactive system verification includes formal languages for system modeling and for behavior specification and decision procedures and/or proof-systems for verifying that the system model satisfies the system specifications. Using the Ostroff framework for reactive system verification, an approach to achieving fault-tolerant communication between transputers was shown to be effective. The key components of the design, the decoupler processes, may be viewed as discrete-event-controllers introduced to constrain system behavior such that system specifications are satisfied. The Ostroff framework was also effective. The expressiveness of the modeling language permitted construction of a faithful model of the transputer network. The relevant specifications were readily expressed in the specification language. The set of decision procedures provided was adequate to verify the specifications of interest. The need for improved support for system behavior visualization is emphasized.

  2. ALLIANCE: An architecture for fault tolerant multi-robot cooperation

    SciTech Connect

    Parker, L.E.

    1995-02-01

    ALLIANCE is a software architecture that facilitates the fault tolerant cooperative control of teams of heterogeneous mobile robots performing missions composed of loosely coupled, largely independent subtasks. ALLIANCE allows teams of robots, each of which possesses a variety of high-level functions that it can perform during a mission, to individually select appropriate actions throughout the mission based on the requirements of the mission, the activities of other robots, the current environmental conditions, and the robot`s own internal states. ALLIANCE is a fully distributed, behavior-based architecture that incorporates the use of mathematically modeled motivations (such as impatience and acquiescence) within each robot to achieve adaptive action selection. Since cooperative robotic teams usually work in dynamic and unpredictable environments, this software architecture allows the robot team members to respond robustly, reliably, flexibly, and coherently to unexpected environmental changes and modifications in the robot team that may occur due to mechanical failure, the learning of new skills, or the addition or removal of robots from the team by human intervention. The feasibility of this architecture is demonstrated in an implementation on a team of mobile robots performing a laboratory version of hazardous waste cleanup.

  3. Cluster-based architecture for fault-tolerant quantum computation

    SciTech Connect

    Fujii, Keisuke; Yamamoto, Katsuji

    2010-04-15

    We present a detailed description of an architecture for fault-tolerant quantum computation, which is based on the cluster model of encoded qubits. In this cluster-based architecture, concatenated computation is implemented in a quite different way from the usual circuit-based architecture where physical gates are recursively replaced by logical gates with error-correction gadgets. Instead, some relevant cluster states, say fundamental clusters, are recursively constructed through verification and postselection in advance for the higher-level one-way computation, which namely provides error-precorrection of gate operations. A suitable code such as the Steane seven-qubit code is adopted for transversal operations. This concatenated construction of verified fundamental clusters has a simple transversal structure of logical errors, and achieves a high noise threshold {approx}3% for computation by using appropriate verification procedures. Since the postselection is localized within each fundamental cluster with the help of deterministic bare controlled-Z gates without verification, divergence of resources is restrained, which reconciles postselection with scalability.

  4. Cluster-based architecture for fault-tolerant quantum computation

    NASA Astrophysics Data System (ADS)

    Fujii, Keisuke; Yamamoto, Katsuji

    2010-04-01

    We present a detailed description of an architecture for fault-tolerant quantum computation, which is based on the cluster model of encoded qubits. In this cluster-based architecture, concatenated computation is implemented in a quite different way from the usual circuit-based architecture where physical gates are recursively replaced by logical gates with error-correction gadgets. Instead, some relevant cluster states, say fundamental clusters, are recursively constructed through verification and postselection in advance for the higher-level one-way computation, which namely provides error-precorrection of gate operations. A suitable code such as the Steane seven-qubit code is adopted for transversal operations. This concatenated construction of verified fundamental clusters has a simple transversal structure of logical errors, and achieves a high noise threshold ~3% for computation by using appropriate verification procedures. Since the postselection is localized within each fundamental cluster with the help of deterministic bare controlled-Z gates without verification, divergence of resources is restrained, which reconciles postselection with scalability.

  5. Lightweight storage and overlay networks for fault tolerance.

    SciTech Connect

    Oldfield, Ron A.

    2010-01-01

    The next generation of capability-class, massively parallel processing (MPP) systems is expected to have hundreds of thousands to millions of processors, In such environments, it is critical to have fault-tolerance mechanisms, including checkpoint/restart, that scale with the size of applications and the percentage of the system on which the applications execute. For application-driven, periodic checkpoint operations, the state-of-the-art does not provide a scalable solution. For example, on today's massive-scale systems that execute applications which consume most of the memory of the employed compute nodes, checkpoint operations generate I/O that consumes nearly 80% of the total I/O usage. Motivated by this observation, this project aims to improve I/O performance for application-directed checkpoints through the use of lightweight storage architectures and overlay networks. Lightweight storage provide direct access to underlying storage devices. Overlay networks provide caching and processing capabilities in the compute-node fabric. The combination has potential to signifcantly reduce I/O overhead for large-scale applications. This report describes our combined efforts to model and understand overheads for application-directed checkpoints, as well as implementation and performance analysis of a checkpoint service that uses available compute nodes as a network cache for checkpoint operations.

  6. Fault-Tolerant Software-Defined Radio on Manycore

    NASA Technical Reports Server (NTRS)

    Ricketts, Scott

    2015-01-01

    Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.

  7. Fault-tolerant quantum blind signature protocols against collective noise

    NASA Astrophysics Data System (ADS)

    Zhang, Ming-Hui; Li, Hui-Fang

    2016-10-01

    This work proposes two fault-tolerant quantum blind signature protocols based on the entanglement swapping of logical Bell states, which are robust against two kinds of collective noises: the collective-dephasing noise and the collective-rotation noise, respectively. Both of the quantum blind signature protocols are constructed from four-qubit decoherence-free (DF) states, i.e., logical Bell qubits. The initial message is encoded on the logical Bell qubits with logical unitary operations, which will not destroy the anti-noise trait of the logical Bell qubits. Based on the fundamental property of quantum entanglement swapping, the receiver simply performs two Bell-state measurements (rather than four-qubit joint measurements) on the logical Bell qubits to verify the signature, which makes the protocols more convenient in a practical application. Different from the existing quantum signature protocols, our protocols can offer the high fidelity of quantum communication with the employment of logical qubits. Moreover, we hereinafter prove the security of the protocols against some individual eavesdropping attacks, and we show that our protocols have the characteristics of unforgeability, undeniability and blindness.

  8. Fault-tolerant authenticated quantum dialogue using logical Bell states

    NASA Astrophysics Data System (ADS)

    Ye, Tian-Yu

    2015-09-01

    Two fault-tolerant authenticated quantum dialogue protocols are proposed in this paper by employing logical Bell states as the quantum resource, which combat the collective-dephasing noise and the collective-rotation noise, respectively. The two proposed protocols each can accomplish the mutual identity authentication and the dialogue between two participants simultaneously and securely over one kind of collective noise channels. In each of two proposed protocols, the information transmitted through the classical channel is assumed to be eavesdroppable and modifiable. The key for choosing the measurement bases of sample logical qubits is pre-shared privately between two participants. The Bell state measurements rather than the four-qubit joint measurements are adopted for decoding. The two participants share the initial states of message logical Bell states with resort to the direct transmission of auxiliary logical Bell states so that the information leakage problem is avoided. The impersonation attack, the man-in-the-middle attack, the modification attack and the Trojan horse attacks from Eve all are detectable.

  9. Fault tolerant channel-encrypting quantum dialogue against collective noise

    NASA Astrophysics Data System (ADS)

    Ye, TianYu

    2015-04-01

    In this paper, two fault tolerant channel-encrypting quantum dialogue (QD) protocols against collective noise are presented. One is against collective-dephasing noise, while the other is against collective-rotation noise. The decoherent-free states, each of which is composed of two physical qubits, act as traveling states combating collective noise. Einstein-Podolsky-Rosen pairs, which play the role of private quantum key, are securely shared between two participants over a collective-noise channel in advance. Through encryption and decryption with private quantum key, the initial state of each traveling two-photon logical qubit is privately shared between two participants. Due to quantum encryption sharing of the initial state of each traveling logical qubit, the issue of information leakage is overcome. The private quantum key can be repeatedly used after rotation as long as the rotation angle is properly chosen, making quantum resource economized. As a result, their information-theoretical efficiency is nearly up to 66.7%. The proposed QD protocols only need single-photon measurements rather than two-photon joint measurements for quantum measurements. Security analysis shows that an eavesdropper cannot obtain anything useful about secret messages during the dialogue process without being discovered. Furthermore, the proposed QD protocols can be implemented with current techniques in experiment.

  10. Fault-tolerant error correction with the gauge color code

    NASA Astrophysics Data System (ADS)

    Brown, Benjamin J.; Nickerson, Naomi H.; Browne, Dan E.

    2016-07-01

    The constituent parts of a quantum computer are inherently vulnerable to errors. To this end, we have developed quantum error-correcting codes to protect quantum information from noise. However, discovering codes that are capable of a universal set of computational operations with the minimal cost in quantum resources remains an important and ongoing challenge. One proposal of significant recent interest is the gauge color code. Notably, this code may offer a reduced resource cost over other well-studied fault-tolerant architectures by using a new method, known as gauge fixing, for performing the non-Clifford operations that are essential for universal quantum computation. Here we examine the gauge color code when it is subject to noise. Specifically, we make use of single-shot error correction to develop a simple decoding algorithm for the gauge color code, and we numerically analyse its performance. Remarkably, we find threshold error rates comparable to those of other leading proposals. Our results thus provide the first steps of a comparative study between the gauge color code and other promising computational architectures.

  11. Quantum error correction and fault-tolerant quantum computation

    NASA Astrophysics Data System (ADS)

    Lai, Ching-Yi

    Quantum computers need to be protected by quantum error-correcting codes against decoherence. One of the most interesting and useful classes of quantum codes is the class of quantum stabilizer codes. Entanglement-assisted (EA) quantum codes are a class of stabilizer codes that make use of preshared entanglement between the sender and the receiver. We provide several code constructions for entanglement-assisted quantum codes. The MacWilliams identity for quantum codes leads to linear programming bounds on the minimum distance. We find new constraints on the simplified stabilizer group and the logical group, which help improve the linear programming bounds on entanglement-assisted quantum codes. The results also can be applied to standard stabilizer codes. In the real world, quantum gates are faulty. To implement quantum computation fault-tolerantly, quantum codes with certain properties are needed. We first analyze Knill's postselection scheme in a two-dimensional architecture. The error performance of this scheme is better than other known concatenated codes. Then we propose several methods to protect syndrome extraction against measurement errors.

  12. Proposal of fault-tolerant tomographic image reconstruction

    NASA Astrophysics Data System (ADS)

    Kudo, Hiroyuki; Takaki, Keita; Yamazaki, Fukashi; Nemoto, Takuya

    2016-10-01

    This paper deals with tomographic image reconstruction under the situation where some of projection data bins are contaminated with abnormal data. Such situations occur in various instances of tomography. We propose a new reconstruction algorithm called the Fault-Tolerant reconstruction outlined as follows. The least-squares (L2- norm) error function || Ax- b||22 used in ordinary iterative reconstructions is sensitive to the existence of abnormal data. The proposed algorithm utilizes the L1-norm error function || Ax- b||11 instead of the L2-norm, and we develop a row-action-type iterative algorithm using the proximal splitting framework in convex optimization fields. We also propose an improved version of the L1-norm reconstruction called the L1-TV reconstruction, in which a weak Total Variation (TV) penalty is added to the cost function. Simulation results demonstrate that reconstructed images with the L2-norm were severely damaged by the effect of abnormal bins, whereas images with the L1-norm and L1-TV reconstructions were robust to the existence of abnormal bins.

  13. Fault-tolerant error correction with the gauge color code

    PubMed Central

    Brown, Benjamin J.; Nickerson, Naomi H.; Browne, Dan E.

    2016-01-01

    The constituent parts of a quantum computer are inherently vulnerable to errors. To this end, we have developed quantum error-correcting codes to protect quantum information from noise. However, discovering codes that are capable of a universal set of computational operations with the minimal cost in quantum resources remains an important and ongoing challenge. One proposal of significant recent interest is the gauge color code. Notably, this code may offer a reduced resource cost over other well-studied fault-tolerant architectures by using a new method, known as gauge fixing, for performing the non-Clifford operations that are essential for universal quantum computation. Here we examine the gauge color code when it is subject to noise. Specifically, we make use of single-shot error correction to develop a simple decoding algorithm for the gauge color code, and we numerically analyse its performance. Remarkably, we find threshold error rates comparable to those of other leading proposals. Our results thus provide the first steps of a comparative study between the gauge color code and other promising computational architectures. PMID:27470619

  14. Fault-tolerant error correction with the gauge color code.

    PubMed

    Brown, Benjamin J; Nickerson, Naomi H; Browne, Dan E

    2016-07-29

    The constituent parts of a quantum computer are inherently vulnerable to errors. To this end, we have developed quantum error-correcting codes to protect quantum information from noise. However, discovering codes that are capable of a universal set of computational operations with the minimal cost in quantum resources remains an important and ongoing challenge. One proposal of significant recent interest is the gauge color code. Notably, this code may offer a reduced resource cost over other well-studied fault-tolerant architectures by using a new method, known as gauge fixing, for performing the non-Clifford operations that are essential for universal quantum computation. Here we examine the gauge color code when it is subject to noise. Specifically, we make use of single-shot error correction to develop a simple decoding algorithm for the gauge color code, and we numerically analyse its performance. Remarkably, we find threshold error rates comparable to those of other leading proposals. Our results thus provide the first steps of a comparative study between the gauge color code and other promising computational architectures.

  15. RAID Unbound: Storage Fault Tolerance in a Distributed Environment

    NASA Technical Reports Server (NTRS)

    Ritchie, Brian

    1996-01-01

    Mirroring, data replication, backup, and more recently, redundant arrays of independent disks (RAID) are all technologies used to protect and ensure access to critical company data. A new set of problems has arisen as data becomes more and more geographically distributed. Each of the technologies listed above provides important benefits; but each has failed to adapt fully to the realities of distributed computing. The key to data high availability and protection is to take the technologies' strengths and 'virtualize' them across a distributed network. RAID and mirroring offer high data availability, which data replication and backup provide strong data protection. If we take these concepts at a very granular level (defining user, record, block, file, or directory types) and them liberate them from the physical subsystems with which they have traditionally been associated, we have the opportunity to create a highly scalable network wide storage fault tolerance. The network becomes the virtual storage space in which the traditional concepts of data high availability and protection are implemented without their corresponding physical constraints.

  16. Privacy-Assured Aggregation Protocol for Smart Metering: A Proactive Fault-Tolerant Approach [Proactive Fault-Tolerant Aggregation Protocol for Privacy-Assured Smart Metering

    SciTech Connect

    Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.; Rao, Nageswara S. V.

    2016-06-01

    Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively by using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.

  17. Privacy-Assured Aggregation Protocol for Smart Metering: A Proactive Fault-Tolerant Approach [Proactive Fault-Tolerant Aggregation Protocol for Privacy-Assured Smart Metering

    DOE PAGES

    Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.; ...

    2016-06-01

    Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less

  18. A fault-tolerant multiprocessor architecture for aircraft, volume 1. [autopilot configuration

    NASA Technical Reports Server (NTRS)

    Smith, T. B.; Hopkins, A. L.; Taylor, W.; Ausrotas, R. A.; Lala, J. H.; Hanley, L. D.; Martin, J. H.

    1978-01-01

    A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed.

  19. Sensor fault detection and isolation over wireless sensor network based on hardware redundancy

    NASA Astrophysics Data System (ADS)

    Hao, Jingjing; Kinnaert, Michel

    2017-01-01

    In order to diagnose sensor faults with small magnitude in wireless sensor networks, distinguishability measures are defined to indicate the performance for fault detection and isolation (FDI) at each node. A systematic method is then proposed to determine the information to be exchanged between nodes to achieve FDI specifications while limiting the computation complexity and communication cost.

  20. Reliability model derivation of a fault-tolerant, dual, spare-switching, digital computer system

    NASA Technical Reports Server (NTRS)

    1974-01-01

    A computer based reliability projection aid, tailored specifically for application in the design of fault-tolerant computer systems, is described. Its more pronounced characteristics include the facility for modeling systems with two distinct operational modes, measuring the effect of both permanent and transient faults, and calculating conditional system coverage factors. The underlying conceptual principles, mathematical models, and computer program implementation are presented.

  1. Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

    NASA Technical Reports Server (NTRS)

    Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

    1992-01-01

    Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation.

  2. Parameter Estimation Analysis for Hybrid Adaptive Fault Tolerant Control

    NASA Astrophysics Data System (ADS)

    Eshak, Peter B.

    Research efforts have increased in recent years toward the development of intelligent fault tolerant control laws, which are capable of helping the pilot to safely maintain aircraft control at post failure conditions. Researchers at West Virginia University (WVU) have been actively involved in the development of fault tolerant adaptive control laws in all three major categories: direct, indirect, and hybrid. The first implemented design to provide adaptation was a direct adaptive controller, which used artificial neural networks to generate augmentation commands in order to reduce the modeling error. Indirect adaptive laws were implemented in another controller, which utilized online PID to estimate and update the controller parameter. Finally, a new controller design was introduced, which integrated both direct and indirect control laws. This controller is known as hybrid adaptive controller. This last control design outperformed the two earlier designs in terms of less NNs effort and better tracking quality. The performance of online PID has an important role in the quality of the hybrid controller; therefore, the quality of the estimation will be of a great importance. Unfortunately, PID is not perfect and the online estimation process has some inherited issues; the online PID estimates are primarily affected by delays and biases. In order to ensure updating reliable estimates to the controller, the estimator consumes some time to converge. Moreover, the estimator will often converge to a biased value. This thesis conducts a sensitivity analysis for the estimation issues, delay and bias, and their effect on the tracking quality. In addition, the performance of the hybrid controller as compared to direct adaptive controller is explored. In order to serve this purpose, a simulation environment in MATLAB/SIMULINK has been created. The simulation environment is customized to provide the user with the flexibility to add different combinations of biases and delays to

  3. Adaptive fault-tolerant routing in hypercube multicomputers

    NASA Technical Reports Server (NTRS)

    Chen, Ming-Syan; Shin, Kang G.

    1990-01-01

    A connected hypercube with faulty links and/or nodes is called an injured hypercube. To enable any non-faulty node to communicate with any other non-faulty node in an injured hypercube, the information on component failures has to be made available to non-faulty nodes so as to route messages around the faulty components. A distributed adaptive fault tolerant routing scheme is proposed for an injured hypercube in which each node is required to know only the condition of its own links. Despite its simplicity, this scheme is shown to be capable of routing messages successfully in an injured hypercube as long as the number of faulty components is less than n. Moreover, it is proved that this scheme routes messages via shortest paths with a rather high probabiltiy and the expected length of a resulting path is very close to that of a shortest path. Since the assumption that the number of faulty components is less than n in an n-dimensional hypercube might limit the usefulness of the above scheme, a routing scheme is introduced based on depth-first search which works in the presence of an arbitrary number of faulty components. Due to the insufficient information on faulty components, the paths chosen by the above scheme may not always be the shortest. To guarantee that all messages be routed via shortest paths, it is proposed that every mode be equipped with more information than that on its own links. The effects of this additional information on routing efficiency are analyzed, and the additional information to be kept at each node for the shortest path routing is determined. Several examples and remarks are also given to illustrate the results.

  4. Fault-tolerant cooperative tasking for multi-agent systems

    NASA Astrophysics Data System (ADS)

    Karimadini, Mohammad; Lin, Hai

    2011-12-01

    A natural way for cooperative tasking in multi-agent systems is through a top-down design by decomposing a global task into subtasks for each individual agent such that the accomplishments of these subtasks will guarantee the achievement of the global task. In our previous works [Karimadini, M., and Lin, H. (2011c), 'Guaranteed Global Performance Through Local Coordinations', Automatica, 47, 890--898; Karimadini, M., and Lin, H. (2011a), 'Cooperative Tasking for Deterministic Specification Automata', submitted for publication, online available at: http://arxiv.org/abs/1101.2002], we presented necessary and sufficient conditions on the decomposability of a global task automaton between cooperative agents. As a follow-up work, this article deals with the robustness issues of the proposed top-down design approach with respect to event failures in the multi-agent systems. The main concern under event failure is whether a previously decomposable task can still be achieved collectively by the agents, and if not, we would like to investigate that under what conditions the global task could be robustly accomplished. This is actually the fault-tolerance issue of the top-down design, and the results provide designers with hints on which events are fragile with respect to failures, and whether redundancies are needed. The main objective of this article is to identify necessary and sufficient conditions on failed events under which a decomposable global task can still be achieved successfully. For such a purpose, a notion called passivity is introduced to characterise the type of event failures. The passivity is found to reflect the redundancy of communication links over shared events, based on which necessary and sufficient conditions for the reliability of cooperative tasking under event failures are derived, followed by illustrative examples and remarks for the derived conditions.

  5. Fault detection and fault tolerant control of a smart base isolation system with magneto-rheological damper

    NASA Astrophysics Data System (ADS)

    Wang, Han; Song, Gangbing

    2011-08-01

    Fault detection and isolation (FDI) in real-time systems can provide early warnings for faulty sensors and actuator signals to prevent events that lead to catastrophic failures. The main objective of this paper is to develop FDI and fault tolerant control techniques for base isolation systems with magneto-rheological (MR) dampers. Thus, this paper presents a fixed-order FDI filter design procedure based on linear matrix inequalities (LMI). The necessary and sufficient conditions for the existence of a solution for detecting and isolating faults using the H_{\\infty } formulation is provided in the proposed filter design. Furthermore, an FDI-filter-based fuzzy fault tolerant controller (FFTC) for a base isolation structure model was designed to preserve the pre-specified performance of the system in the presence of various unknown faults. Simulation and experimental results demonstrated that the designed filter can successfully detect and isolate faults from displacement sensors and accelerometers while maintaining excellent performance of the base isolation technology under faulty conditions.

  6. Fault-Tolerant, Real-Time, Multi-Core Computer System

    NASA Technical Reports Server (NTRS)

    Gostelow, Kim P.

    2012-01-01

    A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.

  7. Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.

    PubMed

    Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein

    2015-12-01

    Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.

  8. Performance and Fault-Tolerance of Neural Networks for Optimization

    DTIC Science & Technology

    1991-06-01

    following, we will first consider two types of faults of the active elements that correspond to the high- est failure rate. These are commonly called "stuck... types . We use the same locations for stuck-at-1 and stuck-at-0 faults, in order to compare the effect of a different fault type . Otherwise it would...not be possible to tell whether different results are caused by the different locations or by the different fault types . This means that the above

  9. Design of on-board Bluetooth wireless network system based on fault-tolerant technology

    NASA Astrophysics Data System (ADS)

    You, Zheng; Zhang, Xiangqi; Yu, Shijie; Tian, Hexiang

    2007-11-01

    In this paper, the Bluetooth wireless data transmission technology is applied in on-board computer system, to realize wireless data transmission between peripherals of the micro-satellite integrating electronic system, and in view of the high demand of reliability of a micro-satellite, a design of Bluetooth wireless network based on fault-tolerant technology is introduced. The reliability of two fault-tolerant systems is estimated firstly using Markov model, then the structural design of this fault-tolerant system is introduced; several protocols are established to make the system operate correctly, some related problems are listed and analyzed, with emphasis on Fault Auto-diagnosis System, Active-standby switch design and Data-Integrity process.

  10. Adaptive Fault Tolerance for Many-Core Based Space-Borne Computing

    NASA Technical Reports Server (NTRS)

    James, Mark; Springer, Paul; Zima, Hans

    2010-01-01

    This paper describes an approach to providing software fault tolerance for future deep-space robotic NASA missions, which will require a high degree of autonomy supported by an enhanced on-board computational capability. Such systems have become possible as a result of the emerging many-core technology, which is expected to offer 1024-core chips by 2015. We discuss the challenges and opportunities of this new technology, focusing on introspection-based adaptive fault tolerance that takes into account the specific requirements of applications, guided by a fault model. Introspection supports runtime monitoring of the program execution with the goal of identifying, locating, and analyzing errors. Fault tolerance assertions for the introspection system can be provided by the user, domain-specific knowledge, or via the results of static or dynamic program analysis. This work is part of an on-going project at the Jet Propulsion Laboratory in Pasadena, California.

  11. Fault Tolerance Assistant (FTA): An Exception Handling Programming Model for MPI Applications

    SciTech Connect

    Fang, Aiman; Laguna, Ignacio; Sato, Kento; Islam, Tanzima; Mohror, Kathryn

    2016-05-23

    Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enables failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.

  12. Fault-Tolerant Consensus of Multi-Agent System With Distributed Adaptive Protocol.

    PubMed

    Chen, Shun; Ho, Daniel W C; Li, Lulu; Liu, Ming

    2015-10-01

    In this paper, fault-tolerant consensus in multi-agent system using distributed adaptive protocol is investigated. Firstly, distributed adaptive online updating strategies for some parameters are proposed based on local information of the network structure. Then, under the online updating parameters, a distributed adaptive protocol is developed to compensate the fault effects and the uncertainty effects in the leaderless multi-agent system. Based on the local state information of neighboring agents, a distributed updating protocol gain is developed which leads to a fully distributed continuous adaptive fault-tolerant consensus protocol design for the leaderless multi-agent system. Furthermore, a distributed fault-tolerant leader-follower consensus protocol for multi-agent system is constructed by the proposed adaptive method. Finally, a simulation example is given to illustrate the effectiveness of the theoretical analysis.

  13. Fault-tolerant control for a class of non-linear systems with dead-zone

    NASA Astrophysics Data System (ADS)

    Chen, Mou; Jiang, Bin; Guo, William W.

    2016-05-01

    In this paper, a fault-tolerant control scheme is proposed for a class of single-input and single-output non-linear systems with the unknown time-varying system fault and the dead-zone. The non-linear state observer is designed for the non-linear system using differential mean value theorem, and the non-linear fault estimator that estimates the unknown time-varying system fault is developed. On the basis of the designed fault estimator, the observer-based fault-tolerant tracking control is then developed using the backstepping technique for non-linear systems with the dead-zone. The stability of the whole closed-loop system is rigorously proved via Lyapunov analysis and the satisfactory tracking control performance is guaranteed in the presence of the unknown time-varying system fault and the dead-zone. Numerical simulation results are presented to illustrate the effectiveness of the proposed backstepping fault-tolerant control scheme for non-linear systems.

  14. Fault tolerant control of multivariable processes using auto-tuning PID controller.

    PubMed

    Yu, Ding-Li; Chang, T K; Yu, Ding-Wen

    2005-02-01

    Fault tolerant control of dynamic processes is investigated in this paper using an auto-tuning PID controller. A fault tolerant control scheme is proposed composing an auto-tuning PID controller based on an adaptive neural network model. The model is trained online using the extended Kalman filter (EKF) algorithm to learn system post-fault dynamics. Based on this model, the PID controller adjusts its parameters to compensate the effects of the faults, so that the control performance is recovered from degradation. The auto-tuning algorithm for the PID controller is derived with the Lyapunov method and therefore, the model predicted tracking error is guaranteed to converge asymptotically. The method is applied to a simulated two-input two-output continuous stirred tank reactor (CSTR) with various faults, which demonstrate the applicability of the developed scheme to industrial processes.

  15. Fault tolerant control for switching discrete-time systems with delays: an improved cone complementarity approach

    NASA Astrophysics Data System (ADS)

    Benzaouia, Abdellah; Ouladsine, Mustapha; Ananou, Bouchra

    2014-10-01

    In this paper, fault tolerant control problem for discrete-time switching systems with delay is studied. Sufficient conditions of building an observer are obtained by using multiple Lyapunov function. These conditions are worked out in a new way, using cone complementarity technique, to obtain new LMIs with slack variables and multiple weighted residual matrices. The obtained results are applied on a numerical example showing fault detection, localisation of fault and reconfiguration of the control to maintain asymptotic stability even in the presence of a permanent sensor fault.

  16. Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

    SciTech Connect

    Panda, Dhabaleswar Kumar; Beckman, Pete

    2011-07-28

    With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system through fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on top of

  17. The superior fault tolerance of artificial neural network training with a fault/noise injection-based genetic algorithm.

    PubMed

    Su, Feng; Yuan, Peijiang; Wang, Yangzhen; Zhang, Chen

    2016-10-01

    Artificial neural networks (ANNs) are powerful computational tools that are designed to replicate the human brain and adopted to solve a variety of problems in many different fields. Fault tolerance (FT), an important property of ANNs, ensures their reliability when significant portions of a network are lost. In this paper, a fault/noise injection-based (FIB) genetic algorithm (GA) is proposed to construct fault-tolerant ANNs. The FT performance of an FIB-GA was compared with that of a common genetic algorithm, the back-propagation algorithm, and the modification of weights algorithm. The FIB-GA showed a slower fitting speed when solving the exclusive OR (XOR) problem and the overlapping classification problem, but it significantly reduced the errors in cases of single or multiple faults in ANN weights or nodes. Further analysis revealed that the fit weights showed no correlation with the fitting errors in the ANNs constructed with the FIB-GA, suggesting a relatively even distribution of the various fitting parameters. In contrast, the output weights in the training of ANNs implemented with the use the other three algorithms demonstrated a positive correlation with the errors. Our findings therefore indicate that a combination of the fault/noise injection-based method and a GA is capable of introducing FT to ANNs and imply that the distributed ANNs demonstrate superior FT performance.

  18. Error latency estimation using functional fault modeling

    NASA Technical Reports Server (NTRS)

    Manthani, S. R.; Saxena, N. R.; Robinson, J. P.

    1983-01-01

    A complete modeling of faults at gate level for a fault tolerant computer is both infeasible and uneconomical. Functional fault modeling is an approach where units are characterized at an intermediate level and then combined to determine fault behavior. The applicability of functional fault modeling to the FTMP is studied. Using this model a forecast of error latency is made for some functional blocks. This approach is useful in representing larger sections of the hardware and aids in uncovering system level deficiencies.

  19. Autonomous Decentralized Loop network - ADL aiming at fault-tolerance

    NASA Astrophysics Data System (ADS)

    Kanbe, Seiichiro; Ashida, Akira; Tanaka, Toshiyuki; Mori, Kinji; Ihara, Hirokazu

    An Autonomous Decentralized System (ADS) network is proposed which provides fault detection, fault recovery, transmission, and maintenance for a space system in a distributed manner. An Autonomous Decentralized Loop (ADL) network system is presented as an application of ADS. The ADL system construction, communication protocol, transmission control, and fault detection and recovery are examined. The ADS features autonomous nodes which allow no subsystem to be down without advance notice. The functional availability of ADL is compared with that of a two-redundant loop.

  20. Design and analysis of linear fault-tolerant permanent-magnet vernier machines.

    PubMed

    Xu, Liang; Ji, Jinghua; Liu, Guohai; Du, Yi; Liu, Hu

    2014-01-01

    This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis.

  1. Design and Analysis of Linear Fault-Tolerant Permanent-Magnet Vernier Machines

    PubMed Central

    Xu, Liang; Liu, Guohai; Du, Yi; Liu, Hu

    2014-01-01

    This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis. PMID:24982959

  2. Evolutionary Based Techniques for Fault Tolerant Field Programmable Gate Arrays

    NASA Technical Reports Server (NTRS)

    Larchev, Gregory V.; Lohn, Jason D.

    2006-01-01

    The use of SRAM-based Field Programmable Gate Arrays (FPGAs) is becoming more and more prevalent in space applications. Commercial-grade FPGAs are potentially susceptible to permanently debilitating Single-Event Latchups (SELs). Repair methods based on Evolutionary Algorithms may be applied to FPGA circuits to enable successful fault recovery. This paper presents the experimental results of applying such methods to repair four commonly used circuits (quadrature decoder, 3-by-3-bit multiplier, 3-by-3-bit adder, 440-7 decoder) into which a number of simulated faults have been introduced. The results suggest that evolutionary repair techniques can improve the process of fault recovery when used instead of or as a supplement to Triple Modular Redundancy (TMR), which is currently the predominant method for mitigating FPGA faults.

  3. The Impact of a Fault Tolerant MPI on Scalable Systems Services and Applications

    SciTech Connect

    Graham, Richard L; Hursey, Joshua J; Vallee, Geoffroy R; Naughton, III, Thomas J; Boehm, Swen

    2012-01-01

    Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional checkpoint/restart techniques alone. Central to many petascale applications is an MPI standard that lacks support for ABFT. The Run-Through Stabilization (RTS) proposal, under consideration for MPI 3, allows an application to continue execution when processes fail. The requirements of scalable, fault tolerant MPI implementations and applications will stress the capabilities of many system services. System services must evolve to efficiently support such applications and libraries in the presence of system component failures. This paper discusses how the RTS proposal impacts system services, highlighting specific requirements. Early experimentation results from Cray systems at ORNL using prototype MPI and runtime implementations are presented. Additionally, this paper outlines fault tolerance techniques targeted at leadership class applications.

  4. Fault-tolerant onboard digital information switching and routing for communications satellites

    NASA Technical Reports Server (NTRS)

    Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.; Kim, Heechul

    1993-01-01

    The NASA Lewis Research Center is developing an information-switching processor for future meshed very-small-aperture terminal (VSAT) communications satellites. The information-switching processor will switch and route baseband user data onboard the VSAT satellite to connect thousands of Earth terminals. Fault tolerance is a critical issue in developing information-switching processor circuitry that will provide and maintain reliable communications services. In parallel with the conceptual development of the meshed VSAT satellite network architecture, NASA designed and built a simple test bed for developing and demonstrating baseband switch architectures and fault-tolerance techniques. The meshed VSAT architecture and the switching demonstration test bed are described, and the initial switching architecture and the fault-tolerance techniques that were developed and tested are discussed.

  5. Evaluation of fault-tolerant parallel-processor architectures over long space missions

    NASA Technical Reports Server (NTRS)

    Johnson, Sally C.

    1989-01-01

    The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.

  6. Hybrid routing technique for a fault-tolerant, integrated information network

    NASA Technical Reports Server (NTRS)

    Meredith, B. D.

    1986-01-01

    The evolutionary growth of the space station and the diverse activities onboard are expected to require a hierarchy of integrated, local area networks capable of supporting data, voice, and video communications. In addition, fault-tolerant network operation is necessary to protect communications between critical systems attached to the net and to relieve the valuable human resources onboard the space station of time-critical data system repair tasks. A key issue for the design of the fault-tolerant, integrated network is the development of a robust routing algorithm which dynamically selects the optimum communication paths through the net. A routing technique is described that adapts to topological changes in the network to support fault-tolerant operation and system evolvability.

  7. Evaluating and extending user-level fault tolerance in MPI applications

    DOE PAGES

    Laguna, Ignacio; Richards, David F.; Gamblin, Todd; ...

    2016-01-11

    The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulkmore » synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.« less

  8. Determining and improving the fault tolerance of multilayer perceptrons in a pattern-recognition application.

    PubMed

    Emmerson, M D; Damper, R I

    1993-01-01

    We investigate empirically the performance under damage conditions of single- and multilayer perceptrons (MLP's), with various numbers of hidden units, in a representative pattern-recognition task. While some degree of graceful degradation was observed, the single-layer perceptron was considerably less fault tolerant than any of the multilayer perceptrons, including one with fewer adjustable weights. Our initial hypothesis that fault tolerance would be significantly improved for multilayer nets with larger numbers of hidden units proved incorrect. Indeed, there appeared to be a liability to having excess hidden units. A simple technique (called augmentation) is described, which was successful in translating excess hidden units into improved fault tolerance. Finally, our results were supported by applying singular value decomposition (SVD) analysis to the MLP's internal representations.

  9. The method providing fault-tolerance for information and control systems of the industrial mechatronic objects

    NASA Astrophysics Data System (ADS)

    Melnik, E. V.; Klimenko, A. B.; Korobkin, V. V.

    2017-02-01

    The paper deals with the provision of information and control system fault-tolerance. Nowadays, a huge quantity of industrial mechatronic objects operate within hazardous environments, where the human is not supposed to be. So the question of fault-tolerant information and control system design and development becomes the cornerstone of a large amount of industrial mechatronic objects. Within this paper, a new complex method of providing the reconfigurable systems fault-tolerance is represented. It bases on performance redundancy and decentralized dispatching principles. The key term within the method presented is a ‘configuration’, so the model of the configuration forming problem is represented too, and simulation results are given and discussed briefly.

  10. Evaluating and extending user-level fault tolerance in MPI applications

    SciTech Connect

    Laguna, Ignacio; Richards, David F.; Gamblin, Todd; Schulz, Martin; de Supinski, Bronis R.; Mohror, Kathryn; Pritchard, Howard

    2016-01-11

    The user-level failure mitigation (ULFM) interface has been proposed to provide fault-tolerant semantics in the Message Passing Interface (MPI). Previous work presented performance evaluations of ULFM; yet questions related to its programability and applicability, especially to non-trivial, bulk synchronous applications, remain unanswered. In this article, we present our experiences on using ULFM in a case study with a large, highly scalable, bulk synchronous molecular dynamics application to shed light on the advantages and difficulties of this interface to program fault-tolerant MPI applications. We found that, although ULFM is suitable for master–worker applications, it provides few benefits for more common bulk synchronous MPI applications. Furthermore, to address these limitations, we introduce a new, simpler fault-tolerant interface for complex, bulk synchronous MPI programs with better applicability and support than ULFM for application-level recovery mechanisms, such as global rollback.

  11. Cascade structures of fault-tolerant control schemes with the static and dynamic output controllers

    NASA Astrophysics Data System (ADS)

    Krokavec, Dušan; Filasová, Anna

    2017-01-01

    An enhanced approach to fault-tolerant control design is proposed in the paper for linear systems subject to cascade control strategy, while static and dynamic output controllers are employed to maintain the stability of the overall interconnected control structure. The controller gains are solved simultaneously using two-step linear matrix inequality formulation, conditioned by linear matrix equalities. A simulation example, subject to a system matrix parameter fault, demonstrates the effiectiveness of the proposed method of design and cascade control technique.

  12. A novel mathematical setup for fault tolerant control systems with state-dependent failure process

    NASA Astrophysics Data System (ADS)

    Chitraganti, S.; Aberkane, S.; Aubrun, C.

    2014-12-01

    In this paper, we consider a fault tolerant control system (FTCS) with state- dependent failures and provide a tractable mathematical model to handle the state-dependent failures. By assuming abrupt changes in system parameters, we use a jump process modelling of failure process and the fault detection and isolation (FDI) process. In particular, we assume that the failure rates of the failure process vary according to which set the state of the system belongs to.

  13. Fault-Tolerant, Multiple-Zone Temperature Control

    NASA Technical Reports Server (NTRS)

    Granger, James; Franklin, Brian; Michalik, Martin; Yates, Phillip; Peterson, Erik; Borders, James

    2008-01-01

    A computer program has been written as an essential part of an electronic temperature control system for a spaceborne instrument that contains several zones. The system was developed because the temperature and the rate of change of temperature in each zone are required to be maintained to within limits that amount to degrees of precision thought to be unattainable by use of simple bimetallic thermostats. The software collects temperature readings from six platinum resistance thermometers, calculates temperature errors from the readings, and implements a proportional + integral + derivative (PID) control algorithm that adjusts heater power levels. The software accepts, via a serial port, commands to change its operational parameters. The software attempts to detect and mitigate a host of potential faults. It is robust to many kinds of faults in that it can maintain PID control in the presence of those faults.

  14. Energy-efficient fault tolerance in multiprocessor real-time systems

    NASA Astrophysics Data System (ADS)

    Guo, Yifeng

    The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is

  15. Fault tolerance control of phase current in permanent magnet synchronous motor control system

    NASA Astrophysics Data System (ADS)

    Chen, Kele; Chen, Ke; Chen, Xinglong; Li, Jinying

    2014-08-01

    As the Photoelectric tracking system develops from earth based platform to all kinds of moving platform such as plane based, ship based, car based, satellite based and missile based, the fault tolerance control system of phase current sensor is studied in order to detect and control of failure of phase current sensor on a moving platform. By using a DC-link current sensor and the switching state of the corresponding SVPWM inverter, the failure detection and fault control of three phase current sensor is achieved. Under such conditions as one failure, two failures and three failures, fault tolerance is able to be controlled. The reason why under the method, there exists error between fault tolerance control and actual phase current, is analyzed, and solution to weaken the error is provided. The experiment based on permanent magnet synchronous motor system is conducted, and the method is proven to be capable of detecting the failure of phase current sensor effectively and precisely, and controlling the fault tolerance simultaneously. With this method, even though all the three phase current sensors malfunction, the moving platform can still work by reconstructing the phase current of the motor.

  16. Refinement for fault-tolerance: An aircraft hand-off protocol

    NASA Technical Reports Server (NTRS)

    Marzullo, Keith; Schneider, Fred B.; Dehn, Jon

    1994-01-01

    Part of the Advanced Automation System (AAS) for air-traffic control is a protocol to permit flight hand-off from one air-traffic controller to another. The protocol must be fault-tolerant and, therefore, is subtle -- an ideal candidate for the application of formal methods. This paper describes a formal method for deriving fault-tolerant protocols that is based on refinement and proof outlines. The AAS hand-off protocol was actually derived using this method; that derivation is given.

  17. Enhanced fault-tolerant quantum computing in d-level systems.

    PubMed

    Campbell, Earl T

    2014-12-05

    Error-correcting codes protect quantum information and form the basis of fault-tolerant quantum computing. Leading proposals for fault-tolerant quantum computation require codes with an exceedingly rare property, a transversal non-Clifford gate. Codes with the desired property are presented for d-level qudit systems with prime d. The codes use n=d-1 qudits and can detect up to ∼d/3 errors. We quantify the performance of these codes for one approach to quantum computation known as magic-state distillation. Unlike prior work, we find performance is always enhanced by increasing d.

  18. Fault-tolerant linear optical quantum computing with small-amplitude coherent States.

    PubMed

    Lund, A P; Ralph, T C; Haselgrove, H L

    2008-01-25

    Quantum computing using two coherent states as a qubit basis is a proposed alternative architecture with lower overheads but has been questioned as a practical way of performing quantum computing due to the fragility of diagonal states with large coherent amplitudes. We show that using error correction only small amplitudes (alpha>1.2) are required for fault-tolerant quantum computing. We study fault tolerance under the effects of small amplitudes and loss using a Monte Carlo simulation. The first encoding level resources are orders of magnitude lower than the best single photon scheme.

  19. The SIFT computer and its development. [Software Implemented Fault Tolerance for aircraft control

    NASA Technical Reports Server (NTRS)

    Goldberg, J.

    1981-01-01

    Software Implemented Fault Tolerance (SIFT) is an aircraft control computer designed to allow failure probability of less than 10 to the -10th/hour. The system is based on advanced fault-tolerance computing and validation methodology. Since confirmation of reliability by observation is essentially impossible, system reliability is estimated by a Markov model. A mathematical proof is used to justify the validity of the Markov model. System design is represented by a hierarchy of abstract models, and the design proof comprises mathematical proofs that each model is, in fact, an elaboration of the next more abstract model.

  20. Fault-tolerant computer study. [logic designs for building block circuits

    NASA Technical Reports Server (NTRS)

    Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.

    1981-01-01

    A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.

  1. Validation Methods Research for Fault-Tolerant Avionics and Control Systems: Working Group Meeting, 2

    NASA Technical Reports Server (NTRS)

    Gault, J. W. (Editor); Trivedi, K. S. (Editor); Clary, J. B. (Editor)

    1980-01-01

    The validation process comprises the activities required to insure the agreement of system realization with system specification. A preliminary validation methodology for fault tolerant systems documented. A general framework for a validation methodology is presented along with a set of specific tasks intended for the validation of two specimen system, SIFT and FTMP. Two major areas of research are identified. First, are those activities required to support the ongoing development of the validation process itself, and second, are those activities required to support the design, development, and understanding of fault tolerant systems.

  2. Self-stabilizing byzantine-fault-tolerant clock synchronization system and method

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R. (Inventor)

    2012-01-01

    Systems and methods for rapid Byzantine-fault-tolerant self-stabilizing clock synchronization are provided. The systems and methods are based on a protocol comprising a state machine and a set of monitors that execute once every local oscillator tick. The protocol is independent of specific application specific requirements. The faults are assumed to be arbitrary and/or malicious. All timing measures of variables are based on the node's local clock and thus no central clock or externally generated pulse is used. Instances of the protocol are shown to tolerate bursts of transient failures and deterministically converge with a linear convergence time with respect to the synchronization period as predicted.

  3. Fault-tolerant controlled deterministic secure quantum communication using EPR states against collective noise

    NASA Astrophysics Data System (ADS)

    Kao, Shih-Hung; Yang, Chun-Wei; Hwang, Tzonelih

    2016-11-01

    This paper proposes two new fault-tolerant controlled deterministic secure quantum communication (CDSQC) protocols based only on Einstein-Podolsky-Rosen (EPR) entangled states. The proposed protocols are designed to be robust against the collective-dephasing noise and the collective-rotation noise, respectively. Compared to the existing fault-tolerant controlled quantum communication protocols, the proposed protocols not only can do without a quantum channel between the receiver and the controller as the state-of-the-art protocols do, but also have the advantage that the number of quantum particles required in the CDSQC protocols is reduced owing to the use of the simplest entangled states.

  4. A fault-tolerant voltage measurement method for series connected battery packs

    NASA Astrophysics Data System (ADS)

    Xia, Bing; Mi, Chris

    2016-03-01

    This paper proposes a fault-tolerant voltage measurement method for battery management systems. Instead of measuring the voltage of individual cells, the proposed method measures the voltage sum of multiple battery cells without additional voltage sensors. A matrix interpretation is developed to demonstrate the viability of the proposed sensor topology to distinguish between sensor faults and cell faults. A methodology is introduced to isolate sensor and cell faults by locating abnormal signals. A measurement electronic circuit is proposed to implement the design concept. Simulation and experiment results support the mathematical analysis and validate the feasibility and robustness of the proposed method. In addition, the measurement problem is generalized and the condition for valid sensor topology is discovered. The tuning of design parameters are analyzed based on fault detection reliability and noise levels.

  5. Generating a fault-tolerant global clock using high-speed control signals for the MetaNet architecture

    SciTech Connect

    Ofek, Y. )

    1994-05-01

    This work describes a new technique, based on exchanging control signals between neighboring nodes, for constructing a stable and fault-tolerant global clock in a distributed system with an arbitrary topology. It is shown that it is possible to construct a global clock reference with time step that is much smaller than the propagation delay over the network's links. The synchronization algorithm ensures that the global clock tick' has a stable periodicity, and therefore, it is possible to tolerate failures of links and clocks that operate faster and/or slower than nominally specified, as well as hard failures. The approach taken in this work is to generate a global clock from the ensemble of the local transmission clocks and not to directly synchronize these high-speed clocks. The steady-state algorithm, which generates the global clock, is executed in hardware by the network interface of each node. At the network interface, it is possible to measure accurately the propagation delay between neighboring nodes with a small error or uncertainty and thereby to achieve global synchronization that is proportional to these error measurements. It is shown that the local clock drift (or rate uncertainty) has only a secondary effect on the maximum global clock rate. The synchronization algorithm can tolerate any physical failure. 18 refs.

  6. Fault tolerant formation control of nonholonomic mobile robots using online approximators

    NASA Astrophysics Data System (ADS)

    Thumati, Balaje T.; Dierks, Travis A.; Jagannathan, S.

    2010-04-01

    For unmanned systems, it is desirable to have some sort of fault tolerant ability in order to accomplish the mission. Therefore, in this paper, the fault tolerant control of a formation of nonholonomic mobile robots in the presence unknown faults is undertaken. Initially, a kinematic/torque leader-follower formation control law is developed for the robots under the assumption of normal operation, and the stability of the formation is verified using Lyapunov theory. Subsequently, the control law for the formation is modified by incorporating an additional term, and this new control law compensates the effects of the faults. Moreover, the faults could be incipient or abrupt in nature. The additional term used in the modified control law is a function of the unknown fault dynamics which are recovered using the online learning capabilities of online approximators. Additionally, asymptotic convergence of the FDA scheme and the formation errors in the presence of faults is shown using Lyapunov theory. Finally, numerical results are provided to verify the theoretical conjectures.

  7. Formal specification of requirements for analytical redundancy-based fault-tolerant flight control systems

    NASA Astrophysics Data System (ADS)

    Del Gobbo, Diego

    2000-10-01

    Flight control systems are undergoing a rapid process of automation. The use of Fly-By-Wire digital flight control systems in commercial aviation (Airbus 320 and Boeing FBW-B777) is a clear sign of this trend. The increased automation goes in parallel with an increased complexity of flight control systems with obvious consequences on reliability and safety. Flight control systems must meet strict fault-tolerance requirements. The standard solution to achieving fault tolerance capability relies on multi-string architectures. On the other hand, multi-string architectures further increase the complexity of the system inducing a reduction of overall reliability. In the past two decades a variety of techniques based on analytical redundancy have been suggested for fault diagnosis purposes. While research on analytical redundancy has obtained desirable results, a design methodology involving requirements specification and feasibility analysis of analytical redundancy based fault tolerant flight control systems is missing. The main objective of this research work is to describe within a formal framework the implications of adopting analytical redundancy as a basis to achieve fault tolerance. The research activity involves analysis of the analytical redundancy approach, analysis of flight control system informal requirements, and re-engineering (modeling and specification) of the fault tolerance requirements. The USAF military specification MIL-F-9490D and supporting documents are adopted as source for the flight control informal requirements. The De Havilland DHC-2 general aviation aircraft equipped with standard autopilot control functions is adopted as pilot application. Relational algebra is adopted as formal framework for the specification of the requirements. The detailed analysis and formalization of the requirements resulted in a better definition of the fault tolerance problem in the framework of analytical redundancy. Fault tolerance requirements and related

  8. Flight elements: Fault detection and fault management

    NASA Technical Reports Server (NTRS)

    Lum, H.; Patterson-Hine, A.; Edge, J. T.; Lawler, D.

    1990-01-01

    Fault management for an intelligent computational system must be developed using a top down integrated engineering approach. An approach proposed includes integrating the overall environment involving sensors and their associated data; design knowledge capture; operations; fault detection, identification, and reconfiguration; testability; causal models including digraph matrix analysis; and overall performance impacts on the hardware and software architecture. Implementation of the concept to achieve a real time intelligent fault detection and management system will be accomplished via the implementation of several objectives, which are: Development of fault tolerant/FDIR requirement and specification from a systems level which will carry through from conceptual design through implementation and mission operations; Implementation of monitoring, diagnosis, and reconfiguration at all system levels providing fault isolation and system integration; Optimize system operations to manage degraded system performance through system integration; and Lower development and operations costs through the implementation of an intelligent real time fault detection and fault management system and an information management system.

  9. LQCD workflow execution framework: Models, provenance and fault-tolerance

    NASA Astrophysics Data System (ADS)

    Piccoli, Luciano; Dubey, Abhishek; Simone, James N.; Kowalkowlski, James B.

    2010-04-01

    Large computing clusters used for scientific processing suffer from systemic failures when operated over long continuous periods for executing workflows. Diagnosing job problems and faults leading to eventual failures in this complex environment is difficult, specifically when the success of an entire workflow might be affected by a single job failure. In this paper, we introduce a model-based, hierarchical, reliable execution framework that encompass workflow specification, data provenance, execution tracking and online monitoring of each workflow task, also referred to as participants. The sequence of participants is described in an abstract parameterized view, which is translated into a concrete data dependency based sequence of participants with defined arguments. As participants belonging to a workflow are mapped onto machines and executed, periodic and on-demand monitoring of vital health parameters on allocated nodes is enabled according to pre-specified rules. These rules specify conditions that must be true pre-execution, during execution and post-execution. Monitoring information for each participant is propagated upwards through the reflex and healing architecture, which consists of a hierarchical network of decentralized fault management entities, called reflex engines. They are instantiated as state machines or timed automatons that change state and initiate reflexive mitigation action(s) upon occurrence of certain faults. We describe how this cluster reliability framework is combined with the workflow execution framework using formal rules and actions specified within a structure of first order predicate logic that enables a dynamic management design that reduces manual administrative workload, and increases cluster-productivity.

  10. Fault-tolerant quantum computation with asymmetric Bacon-Shor codes

    NASA Astrophysics Data System (ADS)

    Brooks, Peter; Preskill, John

    2012-02-01

    Bacon-Shor codes are quantum subsystem codes which are constructed by combining together two quantum repetition codes, one protecting against Z (phase) errors and the other protecting against X (bit flip) errors. In many situations, for example flux qubits, the noise is biased such that faults that produce Z errors are much more common than faults that produce X errors; in these cases it is natural to consider an asymmetric Bacon-Shor code where the code protecting against Z errors is longer than the code protecting against X errors. This work describes fault-tolerant constructions for gadgets that achieve universal fault-tolerant quantum computation using asymmetric Bacon-Shor codes. Gadgets take advantage of the Bacon-Shor structure by breaking up into parallel smaller gadgets that act on a single row or column, with majority voting of the separate results. For a bias of ɛ/ɛ' = 10^4, we prove a threshold around 2.5 x10-3. The effective error strength is shown to decrease rapidly (faster than polynomial) with decreasing ɛ. Therefore it may be practical to use Bacon-Shor codes directly with no additional concatenation. This could greatly reduce the resource overhead required for fault-tolerant computation with biased noise.

  11. Step-by-step magic state encoding for efficient fault-tolerant quantum computation.

    PubMed

    Goto, Hayato

    2014-12-16

    Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation.

  12. Software reliability models for fault-tolerant avionics computers and related topics

    NASA Technical Reports Server (NTRS)

    Miller, Douglas R.

    1987-01-01

    Software reliability research is briefly described. General research topics are reliability growth models, quality of software reliability prediction, the complete monotonicity property of reliability growth, conceptual modelling of software failure behavior, assurance of ultrahigh reliability, and analysis techniques for fault-tolerant systems.

  13. Fault-tolerant system considerations for a redundant strapdown inertial measurement unit

    NASA Technical Reports Server (NTRS)

    Motyka, P.; Ornedo, R.; Mangoubi, R.

    1984-01-01

    The development and evaluation of a fault-tolerant system for the Redundant Strapdown Inertial Measurement Unit (RSDIMU) being developed and evaluated by the NASA Langley Research Center was continued. The RSDIMU consists of four two-degree-of-freedom gyros and accelerometers mounted on the faces of a semi-octahedron which can be separated into two halves for damage protection. Compensated and uncompensated fault-tolerant system failure decision algorithms were compared. An algorithm to compensate for sensor noise effects in the fault-tolerant system thresholds was evaluated via simulation. The effects of sensor location and magnitude of the vehicle structural modes on system performance were assessed. A threshold generation algorithm, which incorporates noise compensation and filtered parity equation residuals for structural mode compensation, was evaluated. The effects of the fault-tolerant system on navigational accuracy were also considered. A sensor error parametric study was performed in an attempt to improve the soft failure detection capability without obtaining false alarms. Also examined was an FDI system strategy based on the pairwise comparison of sensor measurements. This strategy has the specific advantage of, in many instances, successfully detecting and isolating up to two simultaneously occurring failures.

  14. Design of a 2*2 fault-tolerant switching element

    SciTech Connect

    Woei Lin; Chuan-lin Wu

    1982-01-01

    The architecture of a 2*2 fault-tolerant switching element which can be used to modularly construct interconnection networks for multiprocessing and local computer networking is described. The switching element uses distributed control and circuit switching. Its good gate-to-pin ratio can facilitate VLSI implementation. 18 references.

  15. Cost and benefits design optimization model for fault tolerant flight control systems

    NASA Technical Reports Server (NTRS)

    Rose, J.

    1982-01-01

    Requirements and specifications for a method of optimizing the design of fault-tolerant flight control systems are provided. Algorithms that could be used for developing new and modifying existing computer programs are also provided, with recommendations for follow-on work.

  16. A survey of provably correct fault-tolerant clock synchronization techniques

    NASA Technical Reports Server (NTRS)

    Butler, Ricky W.

    1988-01-01

    Six provably correct fault-tolerant clock synchronization algorithms are examined. These algorithms are all presented in the same notation to permit easier comprehension and comparison. The advantages and disadvantages of the different techniques are examined and issues related to the implementation of these algorithms are discussed. The paper argues for the use of such algorithms in life-critical applications.

  17. Fault tolerant quantum key distribution protocol with collective random unitary noise

    NASA Astrophysics Data System (ADS)

    Wang, Xiang-Bin

    2005-11-01

    We propose an easy implementable prepare-and-measure protocol for robust quantum key distribution with photon polarization. The protocol is fault tolerant against collective random unitary channel noise. The protocol does not need any collective quantum measurement or quantum memory. A security proof and a specific linear optical realization using spontaneous parametric down conversion are given.

  18. An adaptive fault-tolerant event detection scheme for wireless sensor networks.

    PubMed

    Yim, Sung-Jib; Choi, Yoon-Hwa

    2010-01-01

    In this paper, we present an adaptive fault-tolerant event detection scheme for wireless sensor networks. Each sensor node detects an event locally in a distributed manner by using the sensor readings of its neighboring nodes. Confidence levels of sensor nodes are used to dynamically adjust the threshold for decision making, resulting in consistent performance even with increasing number of faulty nodes. In addition, the scheme employs a moving average filter to tolerate most transient faults in sensor readings, reducing the effective fault probability. Only three bits of data are exchanged to reduce the communication overhead in detecting events. Simulation results show that event detection accuracy and false alarm rate are kept very high and low, respectively, even in the case where 50% of the sensor nodes are faulty.

  19. Robust Gain-Scheduled Fault Tolerant Control for a Transport Aircraft

    NASA Technical Reports Server (NTRS)

    Shin, Jong-Yeob; Gregory, Irene

    2007-01-01

    This paper presents an application of robust gain-scheduled control concepts using a linear parameter-varying (LPV) control synthesis method to design fault tolerant controllers for a civil transport aircraft. To apply the robust LPV control synthesis method, the nonlinear dynamics must be represented by an LPV model, which is developed using the function substitution method over the entire flight envelope. The developed LPV model associated with the aerodynamic coefficient uncertainties represents nonlinear dynamics including those outside the equilibrium manifold. Passive and active fault tolerant controllers (FTC) are designed for the longitudinal dynamics of the Boeing 747-100/200 aircraft in the presence of elevator failure. Both FTC laws are evaluated in the full nonlinear aircraft simulation in the presence of the elevator fault and the results are compared to show pros and cons of each control law.

  20. SFTP: A Secure and Fault-Tolerant Paradigm against Blackhole Attack in MANET

    NASA Astrophysics Data System (ADS)

    KumarRout, Jitendra; Kumar Bhoi, Sourav; Kumar Panda, Sanjaya

    2013-02-01

    Security issues in MANET are a challenging task nowadays. MANETs are vulnerable to passive attacks and active attacks because of a limited number of resources and lack of centralized authority. Blackhole attack is an attack in network layer which degrade the network performance by dropping the packets. In this paper, we have proposed a Secure Fault-Tolerant Paradigm (SFTP) which checks the Blackhole attack in the network. The three phases used in SFTP algorithm are designing of coverage area to find the area of coverage, Network Connection algorithm to design a fault-tolerant model and Route Discovery algorithm to discover the route and data delivery from source to destination. SFTP gives better network performance by making the network fault free.

  1. Experimental Robot Position Sensor Fault Tolerance Using Accelerometers and Joint Torque Sensors

    NASA Technical Reports Server (NTRS)

    Aldridge, Hal A.; Juang, Jer-Nan

    1997-01-01

    Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. The proposed method uses joint torque sensors found in most existing advanced robot designs along with easily locatable, lightweight accelerometers to provide a joint position sensor fault recovery mode. This mode uses the torque sensors along with a virtual passive control law for stability and accelerometers for joint position information. Two methods for conversion from Cartesian acceleration to joint position based on robot kinematics, not integration, are presented. The fault tolerant control method was tested on several joints of a laboratory robot. The controllers performed well with noisy, biased data and a model with uncertain parameters.

  2. General linear codes for fault-tolerant matrix operations on processor arrays

    NASA Technical Reports Server (NTRS)

    Nair, V. S. S.; Abraham, J. A.

    1988-01-01

    Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.

  3. Fault-Tolerant Tracker for Interconnected Large-Scale Nonlinear Systems with Input Constraint

    NASA Astrophysics Data System (ADS)

    Shiu, Y. C.; Tsai, J. S. H.; Guo, S. M.; Shieh, L. S.; Han, Z.

    This paper presents the decentralized fault-tolerant tracker based on the model predictive control (MPC) for a class of unknown interconnected large-scale sampled-data nonlinear systems. Due to the computational requirements of MPC and the system information is unknown, the observer/Kalman filter identification (OKID) method is utilized to determine decentralized appropriate (low-) order discrete-time linear models. Then, to overcome the effect of modeling error on the identified linear model of each subsystem, the improved observers with the high-gain property based on the digital redesign approach will be presented. Once fault is detected in each decentralized controller, one of the backup control configurations in each decentralized subsystem is switched to using the soft switching approach. Thus, the decentralized fault-tolerant control with the closed-loop decoupling property can be achieved through the above approach with high-gain property decentralized observer/tracker.

  4. A set-associative, fault-tolerant cache design

    NASA Technical Reports Server (NTRS)

    Lamet, Dan; Frenzel, James F.

    1992-01-01

    The design of a defect-tolerant control circuit for a set-associative cache memory is presented. The circuit maintains the stack ordering necessary for implementing the Least Recently Used (LRU) replacement algorithm. A discussion of programming techniques for bypassing defective blocks is included.

  5. Error channels and the threshold for fault-tolerant quantum computation

    NASA Astrophysics Data System (ADS)

    Eastin, Bryan

    The threshold for fault-tolerant quantum computation depends on the available resources, including knowledge about the error model. I investigate the utility of such knowledge by designing a fault-tolerant procedure tailored to a restricted stochastic Pauli channel and studying the corresponding threshold for quantum computation. Surprisingly, I find that tailoring yields, at best, modest gains in the threshold, while substantial losses occur for error models only marginally different from the assumed channel. This result is shown to derive from the fact that the ancillae used in threshold estimation are of exceedingly high quality and, thus, difficult to improve upon. Motivated by this discovery, I propose a tractable algebraic algorithm for predicting the outcome of threshold estimates, one which approximates ancillae as having independent and identically distributed errors on their constituent qubits. In the limit of an infinitely large code, the algorithm simplifies tremendously, yielding a rigorous threshold bound given the availability of ancillae with i.i.d. errors. I use this bound as a metric to judge the relative performance of various fault-tolerant procedures in combination with different error models. Modest gains in the threshold are observed for certain restricted error models, and, for the assumed ancillae, Knill's fault-tolerant method is found to be superior to that of Steane. My algorithm generally yields high threshold bounds, reflecting the computational value of large, low-error ancillae. In an effort to render these bounds achievable, I develop a novel procedure for directly constructing large ancillae. Numerically, the scaling and average error properties of this procedure are found to be encouraging, and, though it is not fault-tolerant, I prove that each error can spread to only one additional location. Promising means of improving the ancillae are proposed, and I discuss briefly the challenges associated with preparing the cat states

  6. Results and perspectives on fault tolerant control for a class of hybrid systems

    NASA Astrophysics Data System (ADS)

    Jiang, Bin; Yang, Hao; Cocquempot, Vincent

    2011-02-01

    This article addresses the fault tolerant control (FTC) issue for a class of hybrid systems (HS) modelled by hybrid automata. Two kinds of faults are considered: continuous fault that affects each continuous system mode; discrete fault that affects the switching conditions. In these two faulty cases, the FTC design has two main objectives: (1) maintain the continuous performances including various stabilities of the origin and the output tracking/regulation behaviours along the trajectories of HS; (2) maintain the discrete specifications that have to be followed by HS, e.g. a desired switching sequence. The following three FTC methodologies are considered: FTC for HS with continuous stability goal; FTC for HS with discrete specifications; supervisory FTC design via hybrid control techniques. Some perspectives are also provided. This article provides the readers a survey on the main techniques that can be used to achieve these FTC goals of HS.

  7. A robust adaptive nonlinear fault-tolerant controller via norm estimation for reusable launch vehicles

    NASA Astrophysics Data System (ADS)

    Hu, Chaofang; Gao, Zhifei; Ren, Yanli; Liu, Yunbing

    2016-11-01

    In this paper, a reusable launch vehicle (RLV) attitude control problem with actuator faults is addressed via the robust adaptive nonlinear fault-tolerant control (FTC) with norm estimation. Firstly, the accurate tracking task of attitude angles in the presence of parameter uncertainties and external disturbances is considered. A fault-free controller is proposed using dynamic surface control (DSC) combined with fuzzy adaptive approach. Furthermore, the minimal learning parameter strategy via norm estimation technique is introduced to reduce the multi-parameter adaptive computation burden of fuzzy approximation of the lump uncertainties. Secondly, a compensation controller is designed to handle the partial loss fault of actuator effectiveness. The unknown maximum eigenvalue of actuator efficiency loss factors is estimated online. Moreover, stability analysis guarantees that all signals of the closed-loop control system are semi-global uniformly ultimately bounded. Finally, illustrative simulations show the effectiveness of the proposed method.

  8. A Self-Stabilizing Hybrid-Fault Tolerant Synchronization Protocol

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2014-01-01

    In this report we present a strategy for solving the Byzantine general problem for self-stabilizing a fully connected network from an arbitrary state and in the presence of any number of faults with various severities including any number of arbitrary (Byzantine) faulty nodes. Our solution applies to realizable systems, while allowing for differences in the network elements, provided that the number of arbitrary faults is not more than a third of the network size. The only constraint on the behavior of a node is that the interactions with other nodes are restricted to defined links and interfaces. Our solution does not rely on assumptions about the initial state of the system and no central clock nor centrally generated signal, pulse, or message is used. Nodes are anonymous, i.e., they do not have unique identities. We also present a mechanical verification of a proposed protocol. A bounded model of the protocol is verified using the Symbolic Model Verifier (SMV). The model checking effort is focused on verifying correctness of the bounded model of the protocol as well as confirming claims of determinism and linear convergence with respect to the self-stabilization period. We believe that our proposed solution solves the general case of the clock synchronization problem.

  9. Fault Injection and Monitoring Capability for a Fault-Tolerant Distributed Computation System

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo; Yates, Amy M.; Malekpour, Mahyar R.

    2010-01-01

    The Configurable Fault-Injection and Monitoring System (CFIMS) is intended for the experimental characterization of effects caused by a variety of adverse conditions on a distributed computation system running flight control applications. A product of research collaboration between NASA Langley Research Center and Old Dominion University, the CFIMS is the main research tool for generating actual fault response data with which to develop and validate analytical performance models and design methodologies for the mitigation of fault effects in distributed flight control systems. Rather than a fixed design solution, the CFIMS is a flexible system that enables the systematic exploration of the problem space and can be adapted to meet the evolving needs of the research. The CFIMS has the capabilities of system-under-test (SUT) functional stimulus generation, fault injection and state monitoring, all of which are supported by a configuration capability for setting up the system as desired for a particular experiment. This report summarizes the work accomplished so far in the development of the CFIMS concept and documents the first design realization.

  10. Fault tolerance in onboard processors - Protecting efficient FDM demultiplexers

    NASA Astrophysics Data System (ADS)

    Redinbo, Robert

    1992-03-01

    The application of convolutional codes to protect demultiplexer filter banks is demonstrated analytically for efficient implementations. An overview is given of the parameters for the efficient implementations of filter banks, and real convolutional codes are discussed in terms of DSP operations. Methods for composite filtering and parity generation are outlined, and attention is given to the protection of polyphase filter demultiplexing systems. Real convolutional codes can be applied to protect demultiplexer filter banks by employing two forms of low-rate parity calculation to each filter bank. The parity values are computed either by the output with an FIR parity filter or in parallel with the normal processing by a composite filter. Hardware similarities between the filter bank and the main demultiplexer bank permit efficient redeployment of the processing resources to the main processing function in any configuration.

  11. Resource Costs for Fault-Tolerant Linear Optical Quantum Computing

    NASA Astrophysics Data System (ADS)

    Li, Ying; Humphreys, Peter C.; Mendoza, Gabriel J.; Benjamin, Simon C.

    2015-10-01

    Linear optical quantum computing (LOQC) seems attractively simple: Information is borne entirely by light and processed by components such as beam splitters, phase shifters, and detectors. However, this very simplicity leads to limitations, such as the lack of deterministic entangling operations, which are compensated for by using substantial hardware overheads. Here, we quantify the resource costs for full-scale LOQC by proposing a specific protocol based on the surface code. With the caveat that our protocol can be further optimized, we report that the required number of physical components is at least 5 orders of magnitude greater than in comparable matter-based systems. Moreover, the resource requirements grow further if the per-component photon-loss rate is worse than 1 0-3 or the per-component noise rate is worse than 1 0-5. We identify the performance of switches in the network as the single most influential factor influencing resource scaling.

  12. To err is robotic, to tolerate immunological: fault detection in multirobot systems.

    PubMed

    Tarapore, Danesh; Lima, Pedro U; Carneiro, Jorge; Christensen, Anders Lyhne

    2015-02-02

    Fault detection and fault tolerance represent two of the most important and largely unsolved issues in the field of multirobot systems (MRS). Efficient, long-term operation requires an accurate, timely detection, and accommodation of abnormally behaving robots. Most existing approaches to fault-tolerance prescribe a characterization of normal robot behaviours, and train a model to recognize these behaviours. Behaviours unrecognized by the model are consequently labelled abnormal or faulty. MRS employing these models do not transition well to scenarios involving temporal variations in behaviour (e.g., online learning of new behaviours, or in response to environment perturbations). The vertebrate immune system is a complex distributed system capable of learning to tolerate the organism's tissues even when they change during puberty or metamorphosis, and to mount specific responses to invading pathogens, all without the need of a genetically hardwired characterization of normality. We present a generic abnormality detection approach based on a model of the adaptive immune system, and evaluate the approach in a swarm of robots. Our results reveal the robust detection of abnormal robots simulating common electro-mechanical and software faults, irrespective of temporal changes in swarm behaviour. Abnormality detection is shown to be scalable in terms of the number of robots in the swarm, and in terms of the size of the behaviour classification space.

  13. Adaptive Fault-Tolerant Control of Uncertain Nonlinear Large-Scale Systems With Unknown Dead Zone.

    PubMed

    Chen, Mou; Tao, Gang

    2016-08-01

    In this paper, an adaptive neural fault-tolerant control scheme is proposed and analyzed for a class of uncertain nonlinear large-scale systems with unknown dead zone and external disturbances. To tackle the unknown nonlinear interaction functions in the large-scale system, the radial basis function neural network (RBFNN) is employed to approximate them. To further handle the unknown approximation errors and the effects of the unknown dead zone and external disturbances, integrated as the compounded disturbances, the corresponding disturbance observers are developed for their estimations. Based on the outputs of the RBFNN and the disturbance observer, the adaptive neural fault-tolerant control scheme is designed for uncertain nonlinear large-scale systems by using a decentralized backstepping technique. The closed-loop stability of the adaptive control system is rigorously proved via Lyapunov analysis and the satisfactory tracking performance is achieved under the integrated effects of unknown dead zone, actuator fault, and unknown external disturbances. Simulation results of a mass-spring-damper system are given to illustrate the effectiveness of the proposed adaptive neural fault-tolerant control scheme for uncertain nonlinear large-scale systems.

  14. Fault-tolerant quantum computation with a soft-decision decoder for error correction and detection by teleportation.

    PubMed

    Goto, Hayato; Uchikawa, Hironori

    2013-01-01

    Fault-tolerant quantum computation with quantum error-correcting codes has been considerably developed over the past decade. However, there are still difficult issues, particularly on the resource requirement. For further improvement of fault-tolerant quantum computation, here we propose a soft-decision decoder for quantum error correction and detection by teleportation. This decoder can achieve almost optimal performance for the depolarizing channel. Applying this decoder to Knill's C4/C6 scheme for fault-tolerant quantum computation, which is one of the best schemes so far and relies heavily on error correction and detection by teleportation, we dramatically improve its performance. This leads to substantial reduction of resources.

  15. Automatic specification of reliability models for fault-tolerant computers

    NASA Technical Reports Server (NTRS)

    Liceaga, Carlos A.; Siewiorek, Daniel P.

    1993-01-01

    The calculation of reliability measures using Markov models is required for life-critical processor-memory-switch structures that have standby redundancy or that are subject to transient or intermittent faults or repair. The task of specifying these models is tedious and prone to human error because of the large number of states and transitions required in any reasonable system. Therefore, model specification is a major analysis bottleneck, and model verification is a major validation problem. The general unfamiliarity of computer architects with Markov modeling techniques further increases the necessity of automating the model specification. Automation requires a general system description language (SDL). For practicality, this SDL should also provide a high level of abstraction and be easy to learn and use. The first attempt to define and implement an SDL with those characteristics is presented. A program named Automated Reliability Modeling (ARM) was constructed as a research vehicle. The ARM program uses a graphical interface as its SDL, and it outputs a Markov reliability model specification formulated for direct use by programs that generate and evaluate the model.

  16. Horizontal Fault Tolerance in a Fully Distributed Loosely Coupled Environment

    DTIC Science & Technology

    1990-08-01

    iig ’t, d rI’ cu i w-ts L I -clit 0ur t M~liwo upA’ hitwe , d 11’crt I, w ffiersn 0~is P~r ~ is- I . , 41i C 11i, A ’/A/ ia tId TO01 I M t . I, ti1lrI t...ABSTRACT NS- 154001 -280-5500 Standard Form 298 (Rev 2-89) liii / ~riW iib ’ d bj Nd~t 510 139.Iil GENERAL INSTRUCTIONS FOR COMPLETING SF 298 The Report...TOLERANCE IN A FULLY DISTRIBUTED LOOSELY COUPLED FNVIRONMENT A Dissertation Accession For by NT-1 .A&IDTIC TA1B 0U j i, j j +j ( , k n c e d 0 ] PETER

  17. Sensor and sensorless fault tolerant control for induction motors using a wavelet index.

    PubMed

    Gaeid, Khalaf Salloum; Ping, Hew Wooi; Khalid, Mustafa; Masaoud, Ammar

    2012-01-01

    Fault Tolerant Control (FTC) systems are crucial in industry to ensure safe and reliable operation, especially of motor drives. This paper proposes the use of multiple controllers for a FTC system of an induction motor drive, selected based on a switching mechanism. The system switches between sensor vector control, sensorless vector control, closed-loop voltage by frequency (V/f) control and open loop V/f control. Vector control offers high performance, while V/f is a simple, low cost strategy with high speed and satisfactory performance. The faults dealt with are speed sensor failures, stator winding open circuits, shorts and minimum voltage faults. In the event of compound faults, a protection unit halts motor operation. The faults are detected using a wavelet index. For the sensorless vector control, a novel Boosted Model Reference Adaptive System (BMRAS) to estimate the motor speed is presented, which reduces tuning time. Both simulation results and experimental results with an induction motor drive show the scheme to be a fast and effective one for fault detection, while the control methods transition smoothly and ensure the effectiveness of the FTC system. The system is also shown to be flexible, reverting rapidly back to the dominant controller if the motor returns to a healthy state.

  18. Sensor and Sensorless Fault Tolerant Control for Induction Motors Using a Wavelet Index

    PubMed Central

    Gaeid, Khalaf Salloum; Ping, Hew Wooi; Khalid, Mustafa; Masaoud, Ammar

    2012-01-01

    Fault Tolerant Control (FTC) systems are crucial in industry to ensure safe and reliable operation, especially of motor drives. This paper proposes the use of multiple controllers for a FTC system of an induction motor drive, selected based on a switching mechanism. The system switches between sensor vector control, sensorless vector control, closed-loop voltage by frequency (V/f) control and open loop V/f control. Vector control offers high performance, while V/f is a simple, low cost strategy with high speed and satisfactory performance. The faults dealt with are speed sensor failures, stator winding open circuits, shorts and minimum voltage faults. In the event of compound faults, a protection unit halts motor operation. The faults are detected using a wavelet index. For the sensorless vector control, a novel Boosted Model Reference Adaptive System (BMRAS) to estimate the motor speed is presented, which reduces tuning time. Both simulation results and experimental results with an induction motor drive show the scheme to be a fast and effective one for fault detection, while the control methods transition smoothly and ensure the effectiveness of the FTC system. The system is also shown to be flexible, reverting rapidly back to the dominant controller if the motor returns to a healthy state. PMID:22666016

  19. Catastrophic Fault Recovery with Self-Reconfigurable Chips

    NASA Technical Reports Server (NTRS)

    Zheng, Will Hua; Marzwell, Neville I.; Chau, Savio N.

    2006-01-01

    Mission critical systems typically employ multi-string redundancy to cope with possible hardware failure. Such systems are only as fault tolerant as there are many redundant strings. Once a particular critical component exhausts its redundant spares, the multi-string architecture cannot tolerate any further hardware failure. This paper aims at addressing such catastrophic faults through the use of 'Self-Reconfigurable Chips' as a last resort effort to 'repair' a faulty critical component.

  20. Wireless Avionics Packet to Support Fault Tolerance for Flight Applications

    NASA Technical Reports Server (NTRS)

    Block, Gary L.; Whitaker, William D.; Dillon, James W.; Lux, James P.; Ahmad, Mohammad

    2009-01-01

    In this protocol and packet format, data traffic is monitored by all network interfaces to determine the health of transmitter and subsystems. When failures are detected, the network inter face applies its recover y policies to provide continued service despite the presence of faults. The protocol, packet format, and inter face are independent of the data link technology used. The current demonstration system supports both commercial off-the-shelf wireless connections and wired Ethernet connections. Other technologies such as 1553 or serial data links can be used for the network backbone. The Wireless Avionics packet is divided into three parts: a header, a data payload, and a checksum. The header has the following components: magic number, version, quality of service, time to live, sending transceiver, function code, payload length, source Application Data Interface (ADI) address, destination ADI address, sending node address, target node address, and a sequence number. The magic number is used to identify WAV packets, and allows the packet format to be updated in the future. The quality of service field allows routing decisions to be made based on this value and can be used to route critical management data over a dedicated channel. The time to live value is used to discard misrouted packets while the source transceiver is updated at each hop. This information is used to monitor the health of each transceiver in the network. To identify the packet type, the function code is used. Besides having a regular data packet, the system supports diagnostic packets for fault detection and isolation. The payload length specifies the number of data bytes in the payload, and this supports variable-length packets in the network. The source ADI is the address of the originating interface. This can be used by the destination application to identify the originating source of the packet where the address consists of a subnet, subsystem class within the subnet, a subsystem unit

  1. The Design of Fault Tolerant Quantum Dot Cellular Automata Based Logic

    NASA Technical Reports Server (NTRS)

    Armstrong, C. Duane; Humphreys, William M.; Fijany, Amir

    2002-01-01

    As transistor geometries are reduced, quantum effects begin to dominate device performance. At some point, transistors cease to have the properties that make them useful computational components. New computing elements must be developed in order to keep pace with Moore s Law. Quantum dot cellular automata (QCA) represent an alternative paradigm to transistor-based logic. QCA architectures that are robust to manufacturing tolerances and defects must be developed. We are developing software that allows the exploration of fault tolerant QCA gate architectures by automating the specification, simulation, analysis and documentation processes.

  2. 2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

    SciTech Connect

    Katz, D. S.; Daly, J.; DeBardeleben, N.; Elnozahy, M.; Kramer, B.; Lathrop, S.; Nystrom, N.; Milfeld, K.; Sanielevici, S.; Scott, S.; Votta, L.; Louisiana State Univ.; Center for Exceptional Computing; LANL; IBM; Univ. of Illinois; Shodor Foundation; Pittsburgh Supercomputer Center; Texas Advanced Computing Center; ORNL; Sun Microsystems

    2009-02-01

    This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults cause large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a good solution as systems and memories grow faster than I/O bandwidth to disk. There is interest in both modifications to this, such as checkpoints to memory, partial checkpoints, and message logging, and alternative ideas, such as in-memory recovery using residues. We believe that systematic exploration of these ideas holds the most promise for the scientific applications community. Fault tolerance has been an issue of discussion in the HPC community for at least the past 10 years; but much like other issues, the community has managed to put off addressing it during this period. There is a growing recognition that as systems continue to grow to petascale and beyond, the field is approaching the point where we don't have

  3. Fault-tolerance techniques for high-speed fiber-optic networks

    NASA Technical Reports Server (NTRS)

    Deruiter, John

    1991-01-01

    Four fiber optic network topologies (linear bus, ring, central star, and distributed star) are discussed relative to their application to high data throughput, fault tolerant networks. The topologies are also examined in terms of redundancy and the need to provide for single point, failure free (or better) system operation. Linear bus topology, although traditionally the method of choice for wire systems, presents implementation problems when larger fiber optic systems are considered. Ring topology works well for high speed systems when coupled with a token passing protocol, but it requires a significant increase in protocol complexity to manage system reconfiguration due to ring and node failures. Star topologies offer a natural fault tolerance, without added protocol complexity, while still providing high data throughput capability.

  4. Bio-inspired WSN architecture: event detection and loacalization in a fault tolerant WSN

    NASA Astrophysics Data System (ADS)

    Alayev, Yosef; Damarla, Thyagaraju

    2009-05-01

    One can think of human body as a sensory network. In particular, skin has several neurons that provide the sense of touch with different sensitivities, and neurons for communicating the sensory signals to the brain. Even though skin might occasionally experience some lacerations, it performs remarkably well (fault tolerant) with the failure of some sensors. One of the challenges in collaborative wireless sensor networks (WSN) is fault tolerant detection and localization of targets. In this paper we present a biologically inspired architecture model for WSN. Diagnosis of sensors in WSN model presented here is derived from the concept of the immune system. We present an architecture for WSN for detection and localization of multiple targets inspired by human nervous system. We show that the advantages of such bio-inspired networks are reduced data for communication, self-diagnosis to detect faulty sensors in real-time and the ability to localize events. We present the results of our algorithms on simulation data.

  5. Communications protocols for a fault tolerant, integrated local area network for Space Station applications

    NASA Technical Reports Server (NTRS)

    Meredith, B. D.

    1984-01-01

    The evolutionary growth of the Space Station and the diverse activities onboard are expected to require a hierarchy of integrated,local area networks capable of supporting data, voice and video communications. In addition, fault tolerant network operation is necessary to protect communications between critical systems attached to the net and to relieve the valuable human resources onboard Space Station of day-to-day data system repair tasks. An experimental, local area network is being developed which will serve as a testbed for investigating candidate algorithms and technologies for a fault tolerant, integrated network. The establishment of a set of rules or protocols which govern communications on the net is essential to obtain orderly and reliable operation. A hierarchy of protocols for the experimental network is presented and procedures for data and control communications are described.

  6. Real-number codes for fault-tolerant matrix operations on processor arrays

    NASA Technical Reports Server (NTRS)

    Nair, V. S. S.; Abraham, Jacob A.

    1990-01-01

    A generalization of existing real number codes is proposed. It is proven that linearity is a necessary and sufficient condition for codes used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU decomposition. It is also proven that for every linear code defined over a finite field, there exists a corresponding linear real-number code with similar error detecting capabilities. Encoding schemes are given for some of the example codes which fall under the general set of real-number codes. With the help of experiments, a rule is derived for the selection of a particular code for a given application. The performance overhead of fault tolerance schemes using the generalized encoding schemes is shown to be very low, and this is substantiated through simulation experiments.

  7. Reconciling fault-tolerant distributed algorithms and real-time computing.

    PubMed

    Moser, Heinrich; Schmid, Ulrich

    We present generic transformations, which allow to translate classic fault-tolerant distributed algorithms and their correctness proofs into a real-time distributed computing model (and vice versa). Owing to the non-zero-time, non-preemptible state transitions employed in our real-time model, scheduling and queuing effects (which are inherently abstracted away in classic zero step-time models, sometimes leading to overly optimistic time complexity results) can be accurately modeled. Our results thus make fault-tolerant distributed algorithms amenable to a sound real-time analysis, without sacrificing the wealth of algorithms and correctness proofs established in classic distributed computing research. By means of an example, we demonstrate that real-time algorithms generated by transforming classic algorithms can be competitive even w.r.t. optimal real-time algorithms, despite their comparatively simple real-time analysis.

  8. Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

    NASA Technical Reports Server (NTRS)

    Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

    1994-01-01

    The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.

  9. Observer-based fault-tolerant control for a class of nonlinear networked control systems

    NASA Astrophysics Data System (ADS)

    Mahmoud, M. S.; Memon, A. M.; Shi, Peng

    2014-08-01

    This paper presents a fault-tolerant control (FTC) scheme for nonlinear systems which are connected in a networked control system. The nonlinear system is first transformed into two subsystems such that the unobservable part is affected by a fault and the observable part is unaffected. An observer is then designed which gives state estimates using a Luenberger observer and also estimates unknown parameter of the system; this helps in fault estimation. The FTC is applied in the presence of sampling due to the presence of a network in the loop. The controller gain is obtained using linear-quadratic regulator technique. The methodology is applied on a mechatronic system and the results show satisfactory performance.

  10. A hybrid robust fault tolerant control based on adaptive joint unscented Kalman filter.

    PubMed

    Shabbouei Hagh, Yashar; Mohammadi Asl, Reza; Cocquempot, Vincent

    2017-01-01

    In this paper, a new hybrid robust fault tolerant control scheme is proposed. A robust H∞ control law is used in non-faulty situation, while a Non-Singular Terminal Sliding Mode (NTSM) controller is activated as soon as an actuator fault is detected. Since a linear robust controller is designed, the system is first linearized through the feedback linearization method. To switch from one controller to the other, a fuzzy based switching system is used. An Adaptive Joint Unscented Kalman Filter (AJUKF) is used for fault detection and diagnosis. The proposed method is based on the simultaneous estimation of the system states and parameters. In order to show the efficiency of the proposed scheme, a simulated 3-DOF robotic manipulator is used.

  11. A Communication Framework for Fault-Tolerant Parallel Execution

    NASA Astrophysics Data System (ADS)

    Kanna, Nagarajan; Subhlok, Jaspal; Gabriel, Edgar; Rohit, Eshwar; Anderson, David

    PC grids represent massive computation capacity at a low cost, but are challenging to employ for parallel computing because of variable and unpredictable performance and availability. A communicating parallel program must employ checkpoint-restart and/or process redundancy to make continuous forward progress in such an unreliable environment. A communication model based on one-sided Put/Get calls, pioneered by the Linda system, is a good match as processes can execute their communication operations independently and asynchronously. However, Linda and its many variants are not designed for communicating processes that are replicated or independently restarted from checkpoints. The key problem is that a single logical operation that impacts the global program state may be executed by different instances of the same process at different times leading to semantic inconsistency. This paper presents the design, execution model, implementation, and validation of a communication layer for robust execution on volatile nodes. The research leads to a practical way to employ idle PCs for latency tolerant parallel computing applications.

  12. A study of standard building blocks for the design of fault-tolerant distributed computer systems

    NASA Technical Reports Server (NTRS)

    Rennels, D. A.; Avizienis, A.; Ercegovac, M.

    1978-01-01

    This paper presents the results of a study that has established a standard set of four semiconductor VLSI building-block circuits. These circuits can be assembled with off-the-shelf microprocessors and semiconductor memory modules into fault-tolerant distributed computer configurations. The resulting multi-computer architecture uses self-checking computer modules backed up by a limited number of spares. A redundant bus system is employed for communication between computer modules.

  13. Fault Tolerance for Fight-Through: A Basis for Strategic Survival

    DTIC Science & Technology

    2011-11-01

    response , including the time for reviewing instructions, searching existing data sources, gathering and maintaining the data needed, and completing...NOV 2011 2. REPORT TYPE N/A 3. DATES COVERED - 4. TITLE AND SUBTITLE Fault Tolerance for Fight -Through: A Basis for Strategic Survival 5a... RESPONSIBLE PERSON a. REPORT unclassified b. ABSTRACT unclassified c. THIS PAGE unclassified Standard Form 298 (Rev. 8-98) Prescribed by ANSI Std

  14. A Formal Method for Developing Provably Correct Fault-Tolerant Systems Using Partial Refinement and Composition

    DTIC Science & Technology

    2009-01-01

    which is defined in Table 5. Verify the ASW Properties. The property checker Salsa [8] easily verifies that the specification of the ASW’s normal behavior... Salsa to check the fault-tolerant specification FT for all properties listed in Table 4. All but P2 were shown to be invariants. Thus the required...Conf., 2000. 8. R. Bharadwaj and S. Sims. Salsa : Combining constraint solvers with BDDs for automatic invariant checking. In Proc. Tools and Algorithms

  15. A Formal Method for Developing Provably Correct Fault-Tolerant Systems Using Partial Refinement and Composition

    DTIC Science & Technology

    2009-01-01

    constructs the state invariant H1, which is defined in Table 5. Verify the ASW Properties. The property checker Salsa [8] easily verifies that the...as auxiliary invariants, we used Salsa to check the fault-tolerant specification FT for all properties listed in Table 4. All but P2 were shown to be...Proc. 19th Digital Avionics Sys. Conf., 2000. 8. R. Bharadwaj and S. Sims. Salsa : Combining constraint solvers with BDDs for automatic invariant

  16. Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System

    DTIC Science & Technology

    1996-01-01

    collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources...gathering and maintaining the data needed, and completing and reviewing the collection of information . Send comments regarding this burden estimate or any...Exploiting Data-Flow for Fault-Tolerance in a Wide-Area Parallel System ’ Anh Nguyen-Tuong, Andrew S. Grimshaw and Mark Hyett University of Virginia

  17. A WSI approach towards defect/fault-tolerant reconfigurable serial systems

    NASA Astrophysics Data System (ADS)

    Chen, Wei; Mavor, John; Denyer, Peter B.; Renshaw, David

    1988-06-01

    A superchip for realizing ultra-large-scale integrated (ULSI) systems based on a wafer-scale integrated (WSI) circuit concept, which incorporates defect/fault tolerance and system reconfiguration, is introduced. The key features of the central architectural component, a large crossbar switch matrix, are described. A prototype has been fabricated in silicon technology. Hypothetical processor examples demonstrate the power of the superchip approach, and design/performance figures are discussed.

  18. Design Trade-off Between Performance and Fault-Tolerance of Space Onboard Computers

    NASA Astrophysics Data System (ADS)

    Gorbunov, M. S.; Antonov, A. A.

    2017-01-01

    It is well known that there is a trade-off between performance and power consumption in onboard computers. The fault-tolerance is another important factor affecting performance, chip area and power consumption. Involving special SRAM cells and error-correcting codes is often too expensive with relation to the performance needed. We discuss the possibility of finding the optimal solutions for modern onboard computer for scientific apparatus focusing on multi-level cache memory design.

  19. ALLIANCE: An architecture for fault tolerant, cooperative control of heterogeneous mobile robots

    SciTech Connect

    Parker, L.E.

    1995-02-01

    This research addresses the problem of achieving fault tolerant cooperation within small- to medium-sized teams of heterogeneous mobile robots. The author describes a novel behavior-based, fully distributed architecture, called ALLIANCE, that utilizes adaptive action selection to achieve fault tolerant cooperative control in robot missions involving loosely coupled, largely independent tasks. The robots in this architecture possess a variety of high-level functions that they can perform during a mission, and must at all times select an appropriate action based on the requirements of the mission, the activities of other robots, the current environmental conditions, and their own internal states. Since such cooperative teams often work in dynamic and unpredictable environments, the software architecture allows the team members to respond robustly and reliably to unexpected environmental changes and modifications in the robot team that may occur due to mechanical failure, the learning of new skills, or the addition or removal of robots from the team by human intervention. After presenting ALLIANCE, the author describes in detail experimental results of an implementation of this architecture on a team of physical mobile robots performing a cooperative box pushing demonstration. These experiments illustrate the ability of ALLIANCE to achieve adaptive, fault-tolerant cooperative control amidst dynamic changes in the capabilities of the robot team.

  20. Design and experimental validation for direct-drive fault-tolerant permanent-magnet vernier machines.

    PubMed

    Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian

    2014-01-01

    A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis.

  1. Design and Experimental Validation for Direct-Drive Fault-Tolerant Permanent-Magnet Vernier Machines

    PubMed Central

    Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian

    2014-01-01

    A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis. PMID:25045729

  2. Experimental fault tolerant universal quantum gates with solid-state spins under ambient conditions

    NASA Astrophysics Data System (ADS)

    Rong, Xing

    Quantum computation provides great speedup over classical counterpart for certain problems, such as quantum simulations, prime factoring and database searching. One of the challenges for realizing quantum computation is to execute precise control of the quantum system in the presence of noise. Recently, high fidelity control of spin-qubits has been achieved in several quantum systems. However, control of the spin-qubits with the accuracy required by the fault tolerant quantum computation under ambient conditions remains exclusive. Here we demonstrate a universal set of logic gates in nitrogen-vacancy centers with an average single-qubit gate fidelity of 0.99995 and two qubit gate fidelity of 0.992. These high control fidelities have been achieved in the C naturally abundant diamonds at room temperature via composite pulses and optimal control method. This experimental implementation of quantum gates with fault tolerant control fidelity sets an important step towards the fault-tolerant quantum computation under ambient conditions. National Key Basic Research Program of China (Grant No. 2013CB921800).

  3. Fault Tolerant Attitude Control for spacecraft with SGCMGs under actuator partial failure and actuator saturation

    NASA Astrophysics Data System (ADS)

    Zhang, Fuzhen; Jin, Lei; Xu, Shijie

    2017-03-01

    A Fault Tolerant Attitude Control algorithm for the spacecraft using Single Gimbal Control Moment Gyros (SGCMGs) as actuator is proposed. The controller is designed using the sliding mode control theory to control the gimbal rate directly and there is no singular point in the control algorithm, which means that we don't need to design the steering laws again and the singularity problems can be avoided. Also the gimbal rate saturation is considered when designing the control method. The adaptive control algorithm is used to estimate the disturbance and the boundary of the fault and saturation, which means that no prior information of the fault is needed. Although the controller is designed based on the SGCMGs, it can also be employed when reaction wheels work as the actuator of the spacecraft. Also the complete failure of several SGCMGs is allowed. It is proved based on the Lyapunov stability theorem that the designed control algorithm can achieve the attitude asymptotic stability both on the fault or fault-free condition. The simulation results show that the proposed method has a strong robustness.

  4. Trust index based fault tolerant multiple event localization algorithm for WSNs.

    PubMed

    Xu, Xianghua; Gao, Xueyong; Wan, Jian; Xiong, Naixue

    2011-01-01

    This paper investigates the use of wireless sensor networks for multiple event source localization using binary information from the sensor nodes. The events could continually emit signals whose strength is attenuated inversely proportional to the distance from the source. In this context, faults occur due to various reasons and are manifested when a node reports a wrong decision. In order to reduce the impact of node faults on the accuracy of multiple event localization, we introduce a trust index model to evaluate the fidelity of information which the nodes report and use in the event detection process, and propose the Trust Index based Subtract on Negative Add on Positive (TISNAP) localization algorithm, which reduces the impact of faulty nodes on the event localization by decreasing their trust index, to improve the accuracy of event localization and performance of fault tolerance for multiple event source localization. The algorithm includes three phases: first, the sink identifies the cluster nodes to determine the number of events occurred in the entire region by analyzing the binary data reported by all nodes; then, it constructs the likelihood matrix related to the cluster nodes and estimates the location of all events according to the alarmed status and trust index of the nodes around the cluster nodes. Finally, the sink updates the trust index of all nodes according to the fidelity of their information in the previous reporting cycle. The algorithm improves the accuracy of localization and performance of fault tolerance in multiple event source localization. The experiment results show that when the probability of node fault is close to 50%, the algorithm can still accurately determine the number of the events and have better accuracy of localization compared with other algorithms.

  5. Fault diagnosis and fault-tolerant finite control set-model predictive control of a multiphase voltage-source inverter supplying BLDC motor.

    PubMed

    Salehifar, Mehdi; Moreno-Equilaz, Manuel

    2016-01-01

    Due to its fault tolerance, a multiphase brushless direct current (BLDC) motor can meet high reliability demand for application in electric vehicles. The voltage-source inverter (VSI) supplying the motor is subjected to open circuit faults. Therefore, it is necessary to design a fault-tolerant (FT) control algorithm with an embedded fault diagnosis (FD) block. In this paper, finite control set-model predictive control (FCS-MPC) is developed to implement the fault-tolerant control algorithm of a five-phase BLDC motor. The developed control method is fast, simple, and flexible. A FD method based on available information from the control block is proposed; this method is simple, robust to common transients in motor and able to localize multiple open circuit faults. The proposed FD and FT control algorithm are embedded in a five-phase BLDC motor drive. In order to validate the theory presented, simulation and experimental results are conducted on a five-phase two-level VSI supplying a five-phase BLDC motor.

  6. Design and verification of a multiple fault tolerant control system for STS applications using computer simulation

    NASA Technical Reports Server (NTRS)

    Szatkowski, G. P.; Karas, J. C.

    1981-01-01

    General Dynamics/Convair is under NASA contract to integrate the Centaur upper stage into the space transportation system for future planetary missions. This requires that control of all safety critical functions be two-failure tolerant. The control system developed consists of five asynchronous computers, each contributing at their outputs to a 3-out-of-5 voting plane. Subsystem control is based on an end function redundancy management scheme. Analysis of multiple component failures and worst-case time-phase asynchrony among the computers is performed by a real-time computer simulation. The simulation emulates the hardware and subsystem interfaces, wire by wire, providing assessibility to any component for the insertion of preprogrammed failures. Observability is provided via a graphics system and diagnostic software. The simulation provides an engineering tool where the integrity of control system hardware and imbedded software can be demonstrated.

  7. Fault tolerant synchronization of chaotic heavy symmetric gyroscope systems versus external disturbances via Lyapunov rule-based fuzzy control.

    PubMed

    Farivar, Faezeh; Shoorehdeli, Mahdi Aliyari

    2012-01-01

    In this paper, fault tolerant synchronization of chaotic gyroscope systems versus external disturbances via Lyapunov rule-based fuzzy control is investigated. Taking the general nature of faults in the slave system into account, a new synchronization scheme, namely, fault tolerant synchronization, is proposed, by which the synchronization can be achieved no matter whether the faults and disturbances occur or not. By making use of a slave observer and a Lyapunov rule-based fuzzy control, fault tolerant synchronization can be achieved. Two techniques are considered as control methods: classic Lyapunov-based control and Lyapunov rule-based fuzzy control. On the basis of Lyapunov stability theory and fuzzy rules, the nonlinear controller and some generic sufficient conditions for global asymptotic synchronization are obtained. The fuzzy rules are directly constructed subject to a common Lyapunov function such that the error dynamics of two identical chaotic motions of symmetric gyros satisfy stability in the Lyapunov sense. Two proposed methods are compared. The Lyapunov rule-based fuzzy control can compensate for the actuator faults and disturbances occurring in the slave system. Numerical simulation results demonstrate the validity and feasibility of the proposed method for fault tolerant synchronization.

  8. Energy dissipation and error probability in fault-tolerant binary switching

    PubMed Central

    Fashami, Mohammad Salehi; Atulasimha, Jayasimha; Bandyopadhyay, Supriyo

    2013-01-01

    The potential energy profile of an ideal binary switch is a symmetric double well. Switching between the wells without energy dissipation requires time-modulating the height of the potential barrier separating the wells and tilting the profile towards the desired well at the precise juncture when the barrier disappears. This, however, demands perfect timing synchronization and is therefore fault-intolerant even in the absence of noise. A fault-tolerant strategy that requires no time modulation of the barrier (and hence no timing synchronization) switches by tilting the profile by an amount at least equal to the barrier height and dissipates at least that amount of energy in abrupt switching. Here, we present a third strategy that requires a time modulated barrier but no timing synchronization. It is therefore fault-tolerant, error-free in the absence of thermal noise, and yet it dissipates arbitrarily small energy in a noise-free environment since an arbitrarily small tilt is required for slow switching. This case is exemplified with stress induced switching of a shape-anisotropic single-domain soft nanomagnet dipole-coupled to a hard magnet. When thermal noise is present, we show analytically that the minimum energy dissipated to switch in this scheme is ~2kTln(1/p) [p = switching error probability]. PMID:24220310

  9. Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

    SciTech Connect

    Lifflander, Jonathan; Meneses, Esteban; Menon, Harshita; Miller, Phil; Krishnamoorthy, Sriram; Kale, Laxmikant

    2014-09-22

    Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application in order to maintain generality. However, in many cases, only a partial order must be recorded due to determinism intrinsic in the code, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different concurrent orderings and interleavings. By exploiting this theory, we improve on an existing scalable message-logging fault tolerance scheme. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2x lower overhead than one that records a total order.

  10. A Self-Stabilizing Byzantine-Fault-Tolerant Clock Synchronization Protocol

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2009-01-01

    This report presents a rapid Byzantine-fault-tolerant self-stabilizing clock synchronization protocol that is independent of application-specific requirements. It is focused on clock synchronization of a system in the presence of Byzantine faults after the cause of any transient faults has dissipated. A model of this protocol is mechanically verified using the Symbolic Model Verifier (SMV) [SMV] where the entire state space is examined and proven to self-stabilize in the presence of one arbitrary faulty node. Instances of the protocol are proven to tolerate bursts of transient failures and deterministically converge with a linear convergence time with respect to the synchronization period. This protocol does not rely on assumptions about the initial state of the system other than the presence of sufficient number of good nodes. All timing measures of variables are based on the node s local clock, and no central clock or externally generated pulse is used. The Byzantine faulty behavior modeled here is a node with arbitrarily malicious behavior that is allowed to influence other nodes at every clock tick. The only constraint is that the interactions are restricted to defined interfaces.

  11. Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

    NASA Technical Reports Server (NTRS)

    Harper, Richard

    1989-01-01

    In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.

  12. A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration

    NASA Technical Reports Server (NTRS)

    Obando, Rodrigo A.; Stoughton, John W.

    1995-01-01

    The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.

  13. Fault tolerant multi-sensor fusion based on the information gain

    NASA Astrophysics Data System (ADS)

    Hage, Joelle Al; El Najjar, Maan E.; Pomorski, Denis

    2017-01-01

    In the last decade, multi-robot systems are used in several applications like for example, the army, the intervention areas presenting danger to human life, the management of natural disasters, the environmental monitoring, exploration and agriculture. The integrity of localization of the robots must be ensured in order to achieve their mission in the best conditions. Robots are equipped with proprioceptive (encoders, gyroscope) and exteroceptive sensors (Kinect). However, these sensors could be affected by various faults types that can be assimilated to erroneous measurements, bias, outliers, drifts,… In absence of a sensor fault diagnosis step, the integrity and the continuity of the localization are affected. In this work, we present a muti-sensors fusion approach with Fault Detection and Exclusion (FDE) based on the information theory. In this context, we are interested by the information gain given by an observation which may be relevant when dealing with the fault tolerance aspect. Moreover, threshold optimization based on the quantity of information given by a decision on the true hypothesis is highlighted.

  14. The genotypic complexity of evolved fault-tolerant and noise-robust circuits.

    PubMed

    Hartmann, Morten; Haddow, Pauline C; Lehre, Per Kristian

    2007-02-01

    Noise and component failure is an increasingly difficult problem in modern electronic design. Bio-inspired techniques is one approach that is applied in an effort to solve such issues, motivated by the strong robustness and adaptivity often observed in nature. Circuits investigated herein are designed to be tolerant to faults or robust to noise, using an evolutionary algorithm. A major challenge is to improve the scalability of the approach. Earlier results have indicated that the evolved circuits may be suited for the application of artificial development, an approach to indirect mapping from genotype to phenotype that may improve scalability. Those observations were based on the genotypic complexity of evolved circuits. Herein, we measure the genotypic complexity of circuits evolved for tolerance to faults or noise, in order to uncover how that tolerance affects the complexity of the circuits. The complexity is analysed and discussed with regards to how it relates to the potential benefits to the evolutionary process of introducing an indirect genotype-phenotype mapping such as artificial development.

  15. Summary: Experimental validation of real-time fault-tolerant systems

    NASA Technical Reports Server (NTRS)

    Iyer, R. K.; Choi, G. S.

    1992-01-01

    Testing and validation of real-time systems is always difficult to perform since neither the error generation process nor the fault propagation problem is easy to comprehend. There is no better substitute to results based on actual measurements and experimentation. Such results are essential for developing a rational basis for evaluation and validation of real-time systems. However, with physical experimentation, controllability and observability are limited to external instrumentation that can be hooked-up to the system under test. And this process is quite a difficult, if not impossible, task for a complex system. Also, to set up such experiments for measurements, physical hardware must exist. On the other hand, a simulation approach allows flexibility that is unequaled by any other existing method for system evaluation. A simulation methodology for system evaluation was successfully developed and implemented and the environment was demonstrated using existing real-time avionic systems. The research was oriented toward evaluating the impact of permanent and transient faults in aircraft control computers. Results were obtained for the Bendix BDX 930 system and Hamilton Standard EEC131 jet engine controller. The studies showed that simulated fault injection is valuable, in the design stage, to evaluate the susceptibility of computing sytems to different types of failures.

  16. Diagnosing a Failed Proof in Fault-Tolerance: A Disproving Challenge Problem

    NASA Technical Reports Server (NTRS)

    Pike, Lee; Miner, Paul; Torres-Pomales, Wilfredo

    2006-01-01

    This paper proposes a challenge problem in disproving. We describe a fault-tolerant distributed protocol designed at NASA for use in a fly-by-wire system for next-generation commercial aircraft. An early design of the protocol contains a subtle bug that is highly unlikely to be caught in fault injection testing. We describe a failed proof of the protocol's correctness in a mechanical theorem prover (PVS) with a complex unfinished proof conjecture. We use a model checking suite (SAL) to generate a concrete counterexample to the unproven conjecture to demonstrate the existence of a bug. However, we argue that the effort required in our approach is too high and propose what conditions a better solution would satisfy. We carefully describe the protocol and bug to provide a challenging but feasible case study for disproving research.

  17. Fault tolerance techniques to assure data integrity in high-volume PACS image archives

    NASA Astrophysics Data System (ADS)

    He, Yutao; Huang, Lu J.; Valentino, Daniel J.; Wingate, W. Keith; Avizienis, Algirdas

    1995-05-01

    Picture archiving and communication systems (PACS) perform the systematic acquisition, archiving, and presentation of large quantities of radiological image and text data. In the UCLA Radiology PACS, for example, the volume of image data archived currently exceeds 2500 gigabytes. Furthermore, the distributed heterogeneous PACS is expected to have near real-time response, be continuously available, and assure the integrity and privacy of patient data. The off-the-shelf subsystems that compose the current PACS cannot meet these expectations; therefore fault tolerance techniques had to be incorporated into the system. This paper is to report our first-step efforts towards the goal and is organized as follows: First we discuss data integrity and identify fault classes under the PACS operational environment, then we describe auditing and accounting schemes developed for error-detection and analyze operational data collected. Finally, we outline plans for future research.

  18. Design of a fault tolerant airborne digital computer. Volume 2: Computational requirements and technology

    NASA Technical Reports Server (NTRS)

    Ratner, R. S.; Shapiro, E. B.; Zeidler, H. M.; Wahlstrom, S. E.; Clark, C. B.; Goldberg, J.

    1973-01-01

    This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation.

  19. A pattern-recognition-based, fault-tolerant monitoring and diagnostic technique

    SciTech Connect

    Singer, R.M.; Gross, K.C.; King, R.W.

    1995-06-01

    A properly designed monitoring and diagnostic system must be capable of detecting and distinguishing sensor and process malfunctions in the presence of signal noise, varying process states and multiple faults. The technique presented in this paper addresses these objectives through the implementation of a multivariate state estimation algorithm based upon pattern recognition methodology coupled with a statistically-based hypothesis test. Utilizing a residual signal vector generated from the difference between the estimated and measured current states of a process, disturbances are detected and identified with statistical hypothesis testing. Since the hypothesis testing utilizes the inherent noise on the signals to obtain a conclusion and the state estimation algorithm requires only a majority of the sensors to be functioning to ascertain the current state, this technique has proven to be quite robust and fault-tolerant. Several examples of its application are presented.

  20. On the design of fault-tolerant two-dimensional systolic arrays for yield enhancement

    SciTech Connect

    Kim, J.H.; Reddy, S.M.

    1989-04-01

    The continuing growth of interest in systolic arrays poses the problem of ensuring an acceptable yield. In this paper, the authors propose a unified approach to the design of fault-tolerant systolic arrays incorporating design for testability, a testing scheme, a reconfiguration algorithm, time complexity analysis of the proposed reconfiguration algorithm, and yield analysis. A main feature of the proposed designs is that multiple PE's in a 2-D array can be tested simultaneously, thus reducing the testing time significantly. Another feature is that with introduction of delay registers, the proposed reconfiguration algorithm reconfigures a faulty 2-D systolic array into a fault-free array without reducing throughput. The overall aim of this paper is to provide a design for a 2-D systolic array that produces high yield in VLSI/WSI implementations.

  1. A multi-layer robust adaptive fault tolerant control system for high performance aircraft

    NASA Astrophysics Data System (ADS)

    Huo, Ying

    Modern high-performance aircraft demand advanced fault-tolerant flight control strategies. Not only the control effector failures, but the aerodynamic type failures like wing-body damages often result in substantially deteriorate performance because of low available redundancy. As a result the remaining control actuators may yield substantially lower maneuvering capabilities which do not authorize the accomplishment of the air-craft's original specified mission. The problem is to solve the control reconfiguration on available control redundancies when the mission modification is urged to save the aircraft. The proposed robust adaptive fault-tolerant control (RAFTC) system consists of a multi-layer reconfigurable flight controller architecture. It contains three layers accounting for different types and levels of failures including sensor, actuator, and fuselage damages. In case of the nominal operation with possible minor failure(s) a standard adaptive controller stands to achieve the control allocation. This is referred to as the first layer, the controller layer. The performance adjustment is accounted for in the second layer, the reference layer, whose role is to adjust the reference model in the controller design with a degraded transit performance. The upmost mission adjust is in the third layer, the mission layer, when the original mission is not feasible with greatly restricted control capabilities. The modified mission is achieved through the optimization of the command signal which guarantees the boundedness of the closed-loop signals. The main distinguishing feature of this layer is the the mission decision property based on the current available resources. The contribution of the research is the multi-layer fault-tolerant architecture that can address the complete failure scenarios and their accommodations in realities. Moreover, the emphasis is on the mission design capabilities which may guarantee the stability of the aircraft with restricted post

  2. High Speed Operation and Testing of a Fault Tolerant Magnetic Bearing

    NASA Technical Reports Server (NTRS)

    DeWitt, Kenneth; Clark, Daniel

    2004-01-01

    Research activities undertaken to upgrade the fault-tolerant facility, continue testing high-speed fault-tolerant operation, and assist in the commission of the high temperature (1000 degrees F) thrust magnetic bearing as described. The fault-tolerant magnetic bearing test facility was upgraded to operate to 40,000 RPM. The necessary upgrades included new state-of-the art position sensors with high frequency modulation and new power edge filtering of amplifier outputs. A comparison study of the new sensors and the previous system was done as well as a noise assessment of the sensor-to-controller signals. Also a comparison study of power edge filtering for amplifier-to-actuator signals was done; this information is valuable for all position sensing and motor actuation applications. After these facility upgrades were completed, the rig is believed to have capabilities for 40,000 RPM operation, though this has yet to be demonstrated. Other upgrades included verification and upgrading of safety shielding, and upgrading control algorithms. The rig will now also be used to demonstrate motoring capabilities and control algorithms are in the process of being created. Recently an extreme temperature thrust magnetic bearing was designed from the ground up. The thrust bearing was designed to fit within the existing high temperature facility. The retrofit began near the end of the summer, 04, and continues currently. Contract staff authored a NASA-TM entitled "An Overview of Magnetic Bearing Technology for Gas Turbine Engines", containing a compilation of bearing data as it pertains to operation in the regime of the gas turbine engine and a presentation of how magnetic bearings can become a viable candidate for use in future engine technology.

  3. Fault-tolerant analysis and control of SSRMS-type manipulators with single-joint failure

    NASA Astrophysics Data System (ADS)

    She, Yu; Xu, Wenfu; Su, Haijun; Liang, Bin; Shi, Hongliang

    2016-03-01

    Several space manipulators, whose configurations are similar to that of the Space Station Remote Manipulator System (SSRMS, also called Canadarm2), are playing important roles in the construction and maintenance of the International Space Station. Working in the harsh orbital environment, they are at high risk of single-joint failure. Fault-tolerant capability is critical for those manipulators to complete their on-orbital tasks. In this paper, we analysed and compared the manipulation capability of SSRMS-type manipulators with joints locked at arbitrary positions, and proposed efficient path planning via a fault-tolerant control method. First, a unified kinematic model of this type of manipulators was established. Second, the manipulation capability of the original 7-DOF (degrees of freedom) redundant manipulator was analysed and compared with its degraded 6-DOF counterparts formed by different joint locking configurations. Then, we identified those joints with large sensitivity to fault tolerance performance. The influences of different positions of all joints were also determined by numerical computation. Based on the analysis, the relatively safe and dangerous regions for each joint failure were identified. Finally, we proposed a path planning strategy and realized by a H∞ controller which enables the failure joint locked in the safe region, and simulations were carried on a degraded 3-DOF planar redundant manipulator to verify the planning strategy and control approach. This paper provided important analysis results and efficient methods to address the possible problems of SSRMS-type manipulators caused by single-joint failure that can be extended to other types of manipulators. Moreover, the proposed method is useful for designing the optimal configuration of a redundant manipulator.

  4. Plan for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System

    NASA Technical Reports Server (NTRS)

    Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.

    2008-01-01

    This report presents the plan for the characterization of the effects of high intensity radiated fields on a prototype implementation of a fault-tolerant data communication system. Various configurations of the communication system will be tested. The prototype system is implemented using off-the-shelf devices. The system will be tested in a closed-loop configuration with extensive real-time monitoring. This test is intended to generate data suitable for the design of avionics health management systems, as well as redundancy management mechanisms and policies for robust distributed processing architectures.

  5. SIFT - A preliminary evaluation. [Software Implemented Fault Tolerant computer for aircraft control

    NASA Technical Reports Server (NTRS)

    Palumbo, D. L.; Butler, R. W.

    1983-01-01

    This paper presents the results of a performance evaluation of the SIFT computer system conducted in the NASA AIRLAB facility. The essential system functions are described and compared to both earlier design proposals and subsequent design improvements. The functions supporting fault tolerance are found to consume significant computing resources. With SIFT's specimen task load, scheduled at a 30-Hz rate, the executive tasks such as reconfiguration, clock synchronization and interactive consistency, require 55 percent of the available task slots. Other system overhead (e.g., voting and scheduling) use an average of 50 percent of each remaining task slot.

  6. Fault-tolerant Remote Quantum Entanglement Establishment for Secure Quantum Communications

    NASA Astrophysics Data System (ADS)

    Tsai, Chia-Wei; Lin, Jason

    2016-07-01

    This work presents a strategy for constructing long-distance quantum communications among a number of remote users through collective-noise channel. With the assistance of semi-honest quantum certificate authorities (QCAs), the remote users can share a secret key through fault-tolerant entanglement swapping. The proposed protocol is feasible for large-scale distributed quantum networks with numerous users. Each pair of communicating parties only needs to establish the quantum channels and the classical authenticated channels with his/her local QCA. Thus, it enables any user to communicate freely without point-to-point pre-establishing any communication channels, which is efficient and feasible for practical environments.

  7. The Development of Design Tools for Fault Tolerant Quantum Dot Cellular Automata Based Logic

    NASA Technical Reports Server (NTRS)

    Armstrong, Curtis D.; Humphreys, William M.

    2003-01-01

    We are developing software to explore the fault tolerance of quantum dot cellular automata gate architectures in the presence of manufacturing variations and device defects. The Topology Optimization Methodology using Applied Statistics (TOMAS) framework extends the capabilities of the A Quantum Interconnected Network Array Simulator (AQUINAS) by adding front-end and back-end software and creating an environment that integrates all of these components. The front-end tools establish all simulation parameters, configure the simulation system, automate the Monte Carlo generation of simulation files, and execute the simulation of these files. The back-end tools perform automated data parsing, statistical analysis and report generation.

  8. Reliability calculation using randomization for Markovian fault-tolerant computing systems

    NASA Technical Reports Server (NTRS)

    Miller, D. R.

    1982-01-01

    The randomization technique for computing transient probabilities of Markov processes is presented. The technique is applied to a Markov process model of a simplified fault tolerant computer system for illustrative purposes. It is applicable to much larger and more complex models. Transient state probabilities are computed, from which reliabilities are derived. An accelerated version of the randomization algorithm is developed which exploits ''stiffness' of the models to gain increased efficiency. A great advantage of the randomization approach is that it easily allows probabilities and reliabilities to be computed to any predetermined accuracy.

  9. Lithium Ion Battery (LIB) Charger: Spacesuit Battery Charger Design with 2-Fault Tolerance to Catastrophic Hazards

    NASA Technical Reports Server (NTRS)

    Darcy, Eric; Davies, Frank

    2009-01-01

    Charger design that is 2-fault tolerant to catastrophic has been achieved for the Spacesuit Li-ion Battery with key features. Power supply control circuit and 2 microprocessors independently control against overcharge. 3 microprocessor control against undercharge (false positive: Go for EVA) conditions. 2 independent channels provide functional redundancy. Capable of charge balancing cell banks in series. Cell manufacturing and performance uniformity is excellent with both designs. Once a few outliers are removed, LV cells are slightly more uniform than MoliJ cells. If cell balance feature of charger is ever invoked, it will be an indication of a significant degradation issue, not a nominal condition.

  10. A simple executive for a fault-tolerant, real-time multiprocessor.

    NASA Technical Reports Server (NTRS)

    Filene, R. J.; Green, A. I.

    1971-01-01

    Description of a simple executive for operation with a fault-tolerant multiprocessor that is oriented toward application in an environment where the primary function is to provide real-time control. The primary executive function is to accept requests for jobs placed by other jobs or from peripheral equipment and then schedule their initiation in accordance with the request parameters. The executive is also brought into action when a processor fails, so that appropriate disposition may be made of the job that was running on the failed processor. Many architectural features intended to support this executive concept are included.

  11. Safety Verification of a Fault Tolerant Reconfigurable Autonomous Goal-Based Robotic Control System

    NASA Technical Reports Server (NTRS)

    Braman, Julia M. B.; Murray, Richard M; Wagner, David A.

    2007-01-01

    Fault tolerance and safety verification of control systems are essential for the success of autonomous robotic systems. A control architecture called Mission Data System (MDS), developed at the Jet Propulsion Laboratory, takes a goal-based control approach. In this paper, a method for converting goal network control programs into linear hybrid systems is developed. The linear hybrid system can then be verified for safety in the presence of failures using existing symbolic model checkers. An example task is simulated in MDS and successfully verified using HyTech, a symbolic model checking software for linear hybrid systems.

  12. Results of investigations of Ethernet network fault-tolerance parameters by using additional analysis subsystem

    NASA Astrophysics Data System (ADS)

    Sultanov, Albert H.; Gayfulin, Renat R.; Vinogradova, Irina L.

    2008-04-01

    Fiber optic telecommunication systems with duplex data transmitting over single fiber require reflection minimization. Moreover reflections may be so high that causes system deactivating by misoperation of conventional alarm, and system can not automatically adjudge the collision, so operator manual control is required. In this paper we proposed technical solution of mentioned problem based on additional analysis subsystem, realized on the installed Ufa-city fiber optic CTV system "Crystal". Experience of it's maintenance and results of investigations of the fault tolerance parameters are represented

  13. Advanced reliability modeling of fault-tolerant computer-based systems

    NASA Technical Reports Server (NTRS)

    Bavuso, S. J.

    1982-01-01

    Two methodologies for the reliability assessment of fault tolerant digital computer based systems are discussed. The computer-aided reliability estimation 3 (CARE 3) and gate logic software simulation (GLOSS) are assessment technologies that were developed to mitigate a serious weakness in the design and evaluation process of ultrareliable digital systems. The weak link is based on the unavailability of a sufficiently powerful modeling technique for comparing the stochastic attributes of one system against others. Some of the more interesting attributes are reliability, system survival, safety, and mission success.

  14. Fault-tolerant quantum random-number generator certified by Majorana fermions

    NASA Astrophysics Data System (ADS)

    Deng, Dong-Ling; Duan, Lu-Ming

    2013-07-01

    Braiding of Majorana fermions gives accurate topological quantum operations that are intrinsically robust to noise and imperfection, providing a natural method to realize fault-tolerant quantum information processing. Unfortunately, it is known that braiding of Majorana fermions is not sufficient for the implementation of universal quantum computation. Here we show that topological manipulation of Majorana fermions provides the full set of operations required to generate random numbers by way of quantum mechanics and to certify its genuine randomness through violation of a multipartite Bell inequality. The result opens a perspective to apply Majorana fermions for the robust generation of certified random numbers, which has important applications in cryptography and other related areas.

  15. Fault Tolerant Magnetic Bearing Testing and Conical Magnetic Bearing Development for Extreme Temperature Environments

    NASA Technical Reports Server (NTRS)

    Keith, Theo G., Jr.; Clark, Daniel

    2004-01-01

    During the six month tenure of the grant, activities included continued research of hydrostatic bearings as a viable backup-bearing solution for a magnetically levitated shaft system in extreme temperature environments (1000 F), developmental upgrades of the fault-tolerant magnetic bearing rig at the NASA Glenn Research Center, and assisting in the development of a conical magnetic bearing for extreme temperature environments, particularly turbomachinery. It leveraged work from the ongoing Smart Efficient Components (SEC) and the Turbine-Based Combined Cycle (TBCC) program at NASA Glenn Research Center. The effort was useful in providing technology for more efficient and powerful gas turbine engines.

  16. Performance evaluation of fault-tolerant systems with application to the IUS

    NASA Technical Reports Server (NTRS)

    Missana, J.-O.; Walker, Bruce K.

    1988-01-01

    A new method for quantitatively evaluating the performance of fault-tolerant systems is developed and applied to an example. The method assumes that the random failure and diagnostic decision behavior of the system can be modeled by a finite state Markov process. A performance value must be assigned to each of the states of the model. The method then generates the moments of the probability mass function of the cumulative performance and uses these to generate a maximum entropy approximation to the performance PMF. Some computational considerations are discussed. The method is applied to a typical mission for the Inertial Upper Stage to examine the attitude accuracy performance of the inertial system.

  17. Overhead and noise threshold of fault-tolerant quantum error correction

    NASA Astrophysics Data System (ADS)

    Steane, Andrew M.

    2003-10-01

    Fault-tolerant quantum error correction (QEC) networks are studied by a combination of numerical and approximate analytical treatments. The probability of failure of the recovery operation is calculated for a variety of Calderbank-Shor-Steane codes, including large block codes and concatenated codes. Recent insights into the syndrome extraction process, which render the whole process more efficient and more noise tolerant, are incorporated. The average number of recoveries that can be completed without failure is thus estimated as a function of various parameters. The main parameters are the gate γ and memory ɛ failure rates, the physical scale-up of the computer size, and the time tm required for measurements and classical processing. The achievable computation size is given as a surface in parameter space. This indicates the noise threshold as well as other information. It is found that concatenated codes based on the [[23,1,7

  18. Finite Time Fault Tolerant Control for Robot Manipulators Using Time Delay Estimation and Continuous Nonsingular Fast Terminal Sliding Mode Control.

    PubMed

    Van, Mien; Ge, Shuzhi Sam; Ren, Hongliang

    2016-04-28

    In this paper, a novel finite time fault tolerant control (FTC) is proposed for uncertain robot manipulators with actuator faults. First, a finite time passive FTC (PFTC) based on a robust nonsingular fast terminal sliding mode control (NFTSMC) is investigated. Be analyzed for addressing the disadvantages of the PFTC, an AFTC are then investigated by combining NFTSMC with a simple fault diagnosis scheme. In this scheme, an online fault estimation algorithm based on time delay estimation (TDE) is proposed to approximate actuator faults. The estimated fault information is used to detect, isolate, and accommodate the effect of the faults in the system. Then, a robust AFTC law is established by combining the obtained fault information and a robust NFTSMC. Finally, a high-order sliding mode (HOSM) control based on super-twisting algorithm is employed to eliminate the chattering. In comparison to the PFTC and other state-of-the-art approaches, the proposed AFTC scheme possess several advantages such as high precision, strong robustness, no singularity, less chattering, and fast finite-time convergence due to the combined NFTSMC and HOSM control, and requires no prior knowledge of the fault due to TDE-based fault estimation. Finally, simulation results are obtained to verify the effectiveness of the proposed strategy.

  19. Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks.

    PubMed

    Wu, Zhao; Xiong, Naixue; Huang, Yannong; Xu, Degang; Hu, Chunyang

    2015-11-06

    The services composition technology provides flexible methods for building service composition applications (SCAs) in wireless sensor networks (WSNs). The high reliability and high performance of SCAs help services composition technology promote the practical application of WSNs. The optimization methods for reliability and performance used for traditional software systems are mostly based on the instantiations of software components, which are inapplicable and inefficient in the ever-changing SCAs in WSNs. In this paper, we consider the SCAs with fault tolerance in WSNs. Based on a Universal Generating Function (UGF) we propose a reliability and performance model of SCAs in WSNs, which generalizes a redundancy optimization problem to a multi-state system. Based on this model, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is developed based on a Genetic Algorithm (GA) to find the optimal structure of SCAs with fault-tolerance in WSNs. In order to examine the feasibility of our algorithm, we have evaluated the performance. Furthermore, the interrelationships between the reliability, performance and cost are investigated. In addition, a distinct approach to determine the most suitable parameters in the suggested algorithm is proposed.

  20. Optimizing the Reliability and Performance of Service Composition Applications with Fault Tolerance in Wireless Sensor Networks

    PubMed Central

    Wu, Zhao; Xiong, Naixue; Huang, Yannong; Xu, Degang; Hu, Chunyang

    2015-01-01

    The services composition technology provides flexible methods for building service composition applications (SCAs) in wireless sensor networks (WSNs). The high reliability and high performance of SCAs help services composition technology promote the practical application of WSNs. The optimization methods for reliability and performance used for traditional software systems are mostly based on the instantiations of software components, which are inapplicable and inefficient in the ever-changing SCAs in WSNs. In this paper, we consider the SCAs with fault tolerance in WSNs. Based on a Universal Generating Function (UGF) we propose a reliability and performance model of SCAs in WSNs, which generalizes a redundancy optimization problem to a multi-state system. Based on this model, an efficient optimization algorithm for reliability and performance of SCAs in WSNs is developed based on a Genetic Algorithm (GA) to find the optimal structure of SCAs with fault-tolerance in WSNs. In order to examine the feasibility of our algorithm, we have evaluated the performance. Furthermore, the interrelationships between the reliability, performance and cost are investigated. In addition, a distinct approach to determine the most suitable parameters in the suggested algorithm is proposed. PMID:26561818

  1. A fault-tolerant addressable spin qubit in a natural silicon quantum dot.

    PubMed

    Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

    2016-08-01

    Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot-based qubits. This result can inspire contributions to quantum computing from industrial communities.

  2. Kinematic analysis and fault-tolerant trajectory planning of space manipulator under a single joint failure.

    PubMed

    Mu, Zonggao; Han, Liang; Xu, Wenfu; Li, Bing; Liang, Bin

    2016-01-01

    A space manipulator plays an important role in spacecraft capturing, repairing, maintenance, and so on. However, the harsh space environment will cause its joints fail to work. For a non-redundant manipulator, single joint locked failure will cause it to lose one degree of freedom (DOF), hence reducing its movement ability. In this paper, the key problems related to the fault-tolerant including kinematics, workspace, and trajectory planning of a non-redundant space manipulator under single joint failure are handled. First, the analytical inverse kinematics equations are derived for the 5-DOF manipulator formed by locking the failure joint of the original 6-DOF manipulator. Then, the reachable end-effector pose (position and orientation) is determined. Further, we define the missions can be completed by the 5-DOF manipulator. According to the constraints of the on-orbital mission, we determine the grasp envelope required for the end-effector. Combining the manipulability of the manipulator and the performance of its end-effector, a fault tolerance parameter is defined and a planning method is proposed to generate the reasonable trajectory, based on which the 5-DOF manipulator can complete the desired tasks. Finally, typical cases are simulated and the simulation results verify the proposed method.

  3. Fault Tolerance Implementation within SRAM Based FPGA Designs based upon Single Event Upset Occurrence Rates

    NASA Technical Reports Server (NTRS)

    Berg, Melanie

    2006-01-01

    Emerging technology is enabling the design community to consistently expand the amount of functionality that can be implemented within Integrated Circuits (ICs). As the number of gates placed within an FPGA increases, the complexity of the design can grow exponentially. Consequently, the ability to create reliable circuits has become an incredibly difficult task. In order to ease the complexity of design completion, the commercial design community has developed a very rigid (but effective) design methodology based on synchronous circuit techniques. In order to create faster, smaller and lower power circuits, transistor geometries and core voltages have decreased. In environments that contain ionizing energy, such a combination will increase the probability of Single Event Upsets (SEUs) and will consequently affect the state space of a circuit. In order to combat the effects of radiation, the aerospace community has developed several "Hardened by Design" (fault tolerant) design schemes. This paper will address design mitigation schemes targeted for SRAM Based FPGA CMOS devices. Because some mitigation schemes may be over zealous (too much power, area, complexity, etc.. . .), the designer should be conscious that system requirements can ease the amount of mitigation necessary for acceptable operation. Therefore, various degrees of Fault Tolerance will be demonstrated along with an analysis of its effectiveness.

  4. Minimum sliding mode error feedback control for fault tolerant reconfigurable satellite formations with J2 perturbations

    NASA Astrophysics Data System (ADS)

    Cao, Lu; Chen, Xiaoqian; Misra, Arun K.

    2014-03-01

    Minimum Sliding Mode Error Feedback Control (MSMEFC) is proposed to improve the control precision of spacecraft formations based on the conventional sliding mode control theory. This paper proposes a new approach to estimate and offset the system model errors, which include various kinds of uncertainties and disturbances, as well as smoothes out the effect of nonlinear switching control terms. To facilitate the analysis, the concept of equivalent control error is introduced, which is the key to the utilization of MSMEFC. A cost function is formulated on the basis of the principle of minimum sliding mode error; then the equivalent control error is estimated and fed back to the conventional sliding mode control. It is shown that the sliding mode after the MSMEFC will approximate to the ideal sliding mode, resulting in improved control performance and quality. The new methodology is applied to spacecraft formation flying. It guarantees global asymptotic convergence of the relative tracking error in the presence of J2 perturbations. In addition, some fault tolerant situations such as thruster failure for a period of time, thruster degradation and so on, are also considered to verify the effectiveness of MSMEFC. Numerical simulations are performed to demonstrate the efficacy of the proposed methodology to maintain and reconfigure the satellite formation with the existence of initial offsets and J2 perturbation effects, even in the fault-tolerant cases.

  5. Evaluation of Simple Causal Message Logging for Large-Scale Fault Tolerant HPC Systems

    SciTech Connect

    Bronevetsky, G; Meneses, E; Kale, L V

    2011-02-25

    The era of petascale computing brought machines with hundreds of thousands of processors. The next generation of exascale supercomputers will make available clusters with millions of processors. In those machines, mean time between failures will range from a few minutes to few tens of minutes, making the crash of a processor the common case, instead of a rarity. Parallel applications running on those large machines will need to simultaneously survive crashes and maintain high productivity. To achieve that, fault tolerance techniques will have to go beyond checkpoint/restart, which requires all processors to roll back in case of a failure. Incorporating some form of message logging will provide a framework where only a subset of processors are rolled back after a crash. In this paper, we discuss why a simple causal message logging protocol seems a promising alternative to provide fault tolerance in large supercomputers. As opposed to pessimistic message logging, it has low latency overhead, especially in collective communication operations. Besides, it saves messages when more than one thread is running per processor. Finally, we demonstrate that a simple causal message logging protocol has a faster recovery and a low performance penalty when compared to checkpoint/restart. Running NAS Parallel Benchmarks (CG, MG and BT) on 1024 processors, simple causal message logging has a latency overhead below 5%.

  6. Reverse Computation for Rollback-based Fault Tolerance in Large Parallel Systems

    SciTech Connect

    Perumalla, Kalyan S; Park, Alfred J

    2013-01-01

    Reverse computation is presented here as an important future direction in addressing the challenge of fault tolerant execution on very large cluster platforms for parallel computing. As the scale of parallel jobs increases, traditional checkpointing approaches suffer scalability problems ranging from computational slowdowns to high congestion at the persistent stores for checkpoints. Reverse computation can overcome such problems and is also better suited for parallel computing on newer architectures with smaller, cheaper or energy-efficient memories and file systems. Initial evidence for the feasibility of reverse computation in large systems is presented with detailed performance data from a particle simulation scaling to 65,536 processor cores and 950 accelerators (GPUs). Reverse computation is observed to deliver very large gains relative to checkpointing schemes when nodes rely on their host processors/memory to tolerate faults at their accelerators. A comparison between reverse computation and checkpointing with measurements such as cache miss ratios, TLB misses and memory usage indicates that reverse computation is hard to ignore as a future alternative to be pursued in emerging architectures.

  7. A fault-tolerant addressable spin qubit in a natural silicon quantum dot

    PubMed Central

    Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R.; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

    2016-01-01

    Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot–based qubits. This result can inspire contributions to quantum computing from industrial communities. PMID:27536725

  8. Model Checking a Byzantine-Fault-Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2007-01-01

    This report presents the mechanical verification of a simplified model of a rapid Byzantine-fault-tolerant self-stabilizing protocol for distributed clock synchronization systems. This protocol does not rely on any assumptions about the initial state of the system. This protocol tolerates bursts of transient failures, and deterministically converges within a time bound that is a linear function of the self-stabilization period. A simplified model of the protocol is verified using the Symbolic Model Verifier (SMV) [SMV]. The system under study consists of 4 nodes, where at most one of the nodes is assumed to be Byzantine faulty. The model checking effort is focused on verifying correctness of the simplified model of the protocol in the presence of a permanent Byzantine fault as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period. Although model checking results of the simplified model of the protocol confirm the theoretical predictions, these results do not necessarily confirm that the protocol solves the general case of this problem. Modeling challenges of the protocol and the system are addressed. A number of abstractions are utilized in order to reduce the state space. Also, additional innovative state space reduction techniques are introduced that can be used in future verification efforts applied to this and other protocols.

  9. Verification of a Byzantine-Fault-Tolerant Self-stabilizing Protocol for Clock Synchronization

    NASA Technical Reports Server (NTRS)

    Malekpour, Mahyar R.

    2008-01-01

    This paper presents the mechanical verification of a simplified model of a rapid Byzantine-fault-tolerant self-stabilizing protocol for distributed clock synchronization systems. This protocol does not rely on any assumptions about the initial state of the system except for the presence of sufficient good nodes, thus making the weakest possible assumptions and producing the strongest results. This protocol tolerates bursts of transient failures, and deterministically converges within a time bound that is a linear function of the self-stabilization period. A simplified model of the protocol is verified using the Symbolic Model Verifier (SMV). The system under study consists of 4 nodes, where at most one of the nodes is assumed to be Byzantine faulty. The model checking effort is focused on verifying correctness of the simplified model of the protocol in the presence of a permanent Byzantine fault as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period. Although model checking results of the simplified model of the protocol confirm the theoretical predictions, these results do not necessarily confirm that the protocol solves the general case of this problem. Modeling challenges of the protocol and the system are addressed. A number of abstractions are utilized in order to reduce the state space.

  10. Universal fault-tolerant adiabatic quantum computing with quantum dots or donors

    NASA Astrophysics Data System (ADS)

    Landahl, Andrew

    I will present a conceptual design for an adiabatic quantum computer that can achieve arbitrarily accurate universal fault-tolerant quantum computations with a constant energy gap and nearest-neighbor interactions. This machine can run any quantum algorithm known today or discovered in the future, in principle. The key theoretical idea is adiabatic deformation of degenerate ground spaces formed by topological quantum error-correcting codes. An open problem with the design is making the four-body interactions and measurements it uses more technologically accessible. I will present some partial solutions, including one in which interactions between quantum dots or donors in a two-dimensional array can emulate the desired interactions in second-order perturbation theory. I will conclude with some open problems, including the challenge of reformulating Kitaev's gadget perturbation theory technique so that it preserves fault tolerance. Sandia National Laboratories is a multi-program laboratory managed and operated by Sandia Corporation, a wholly owned subsidiary of Lockheed Martin Corporation, for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-AC04-94AL85000.

  11. Design of a fault tolerant airborne digital computer. Volume 1: Architecture

    NASA Technical Reports Server (NTRS)

    Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.

    1973-01-01

    This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.

  12. Fault-tolerant control of large space structures using the stable factorization approach

    NASA Technical Reports Server (NTRS)

    Razavi, H. C.; Mehra, R. K.; Vidyasagar, M.

    1986-01-01

    Large space structures are characterized by the following features: they are in general infinite-dimensional systems, and have large numbers of undamped or lightly damped poles. Any attempt to apply linear control theory to large space structures must therefore take into account these features. Phase I consisted of an attempt to apply the recently developed Stable Factorization (SF) design philosophy to problems of large space structures, with particular attention to the aspects of robustness and fault tolerance. The final report on the Phase I effort consists of four sections, each devoted to one task. The first three sections report theoretical results, while the last consists of a design example. Significant results were obtained in all four tasks of the project. More specifically, an innovative approach to order reduction was obtained, stabilizing controller structures for plants with an infinite number of unstable poles were determined under some conditions, conditions for simultaneous stabilizability of an infinite number of plants were explored, and a fault tolerance controller design that stabilizes a flexible structure model was obtained which is robust against one failure condition.

  13. A Multiconstrained Grid Scheduling Algorithm with Load Balancing and Fault Tolerance.

    PubMed

    Keerthika, P; Suresh, P

    2015-01-01

    Grid environment consists of millions of dynamic and heterogeneous resources. A grid environment which deals with computing resources is computational grid and is meant for applications that involve larger computations. A scheduling algorithm is said to be efficient if and only if it performs better resource allocation even in case of resource failure. Allocation of resources is a tedious issue since it has to consider several requirements such as system load, processing cost and time, user's deadline, and resource failure. This work attempts to design a resource allocation algorithm which is budget constrained and also targets load balancing, fault tolerance, and user satisfaction by considering the above requirements. The proposed Multiconstrained Load Balancing Fault Tolerant algorithm (MLFT) reduces the schedule makespan, schedule cost, and task failure rate and improves resource utilization. The proposed MLFT algorithm is evaluated using Gridsim toolkit and the results are compared with the recent algorithms which separately concentrate on all these factors. The comparison results ensure that the proposed algorithm works better than its counterparts.

  14. Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination

    SciTech Connect

    Kumbhare, Alok; Frincu, Marc; Simmhan, Yogesh; Prasanna, Viktor K.

    2015-06-29

    The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the runtime variations in data characteristics such as data-rates and key-distribution cause resource overload, that inturn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful reducers, whose state depends on the complete tuple history, necessitates efficient fault-recovery mechanisms to maintain the desired QoS in the presence of resource failures. We propose an integrated streaming MapReduce architecture leveraging the concept of consistent hashing to support runtime elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2:8 improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700 -1500 ms from multiple failures.

  15. Redundant and fault-tolerant algorithms for real-time measurement and control systems for weapon equipment.

    PubMed

    Li, Dan; Hu, Xiaoguang

    2017-03-01

    Because of the high availability requirements from weapon equipment, an in-depth study has been conducted on the real-time fault-tolerance of the widely applied Compact PCI (CPCI) bus measurement and control system. A redundancy design method that uses heartbeat detection to connect the primary and alternate devices has been developed. To address the low successful execution rate and relatively large waste of time slices in the primary version of the task software, an improved algorithm for real-time fault-tolerant scheduling is proposed based on the Basic Checking available time Elimination idle time (BCE) algorithm, applying a single-neuron self-adaptive proportion sum differential (PSD) controller. The experimental validation results indicate that this system has excellent redundancy and fault-tolerance, and the newly developed method can effectively improve the system availability.

  16. Filtering and fault tolerant control of parameter-varying time-delay systems and applications

    NASA Astrophysics Data System (ADS)

    Mohammadpour Velni, Javad

    This dissertation addresses some open problems in control systems theory. The problems considered include the dynamic controller and filter design for Linear Parameter Varying (LPV) time-delay systems, the reconfigurable control design in Fault Tolerant Control Systems (FTCS) and fault diagnostics in Diesel engines. In the first part of this thesis, we investigate the problem of designing parameter-dependent filters for output estimation of LPV time-delay systems. The filters are designed such that the filtering error system guarantees an optimum level of H2 or Hinfinity performance. A state-delay term is included in the filter dynamics to reduce the design conservatism and improve the performance. The Linear Matrix Inequality (LMI)-based synthesis conditions developed for the filter design purposes are categorized into the rate-dependent and delay-dependent conditions which could handle the time-varying state-delay and bounded small delay cases, respectively. Among these two, the latter one is shown to provide a significant reduction in the conservativeness in the filter design. The second part of the thesis examines the analysis and synthesis of Fault Tolerant Control (FTC) systems in an LPV framework. For reconfigurable control design purposes, the information from Fault Detection and Isolation (FDI) module, that provides an estimate of the fault parameters, is utilized to schedule the controller matrices. We will also present a formulation that incorporates the factor of detection delay in the FTC supervisory system. It is shown that including this delay in the synthesis conditions leads to improved performance and reduced control effort. For analysis of the FTC systems including time-delay, where the fault parameters might be identified inaccurately, we first introduce the notion of brief instability for LPV time-delay systems. In these systems it is possible that the output trajectory converges to zero even though there are parameter trajectories for which

  17. Verification of fault-tolerant clock synchronization systems. M.S. Thesis - College of William and Mary, 1992

    NASA Technical Reports Server (NTRS)

    Miner, Paul S.

    1993-01-01

    A critical function in a fault-tolerant computer architecture is the synchronization of the redundant computing elements. The synchronization algorithm must include safeguards to ensure that failed components do not corrupt the behavior of good clocks. Reasoning about fault-tolerant clock synchronization is difficult because of the possibility of subtle interactions involving failed components. Therefore, mechanical proof systems are used to ensure that the verification of the synchronization system is correct. In 1987, Schneider presented a general proof of correctness for several fault-tolerant clock synchronization algorithms. Subsequently, Shankar verified Schneider's proof by using the mechanical proof system EHDM. This proof ensures that any system satisfying its underlying assumptions will provide Byzantine fault-tolerant clock synchronization. The utility of Shankar's mechanization of Schneider's theory for the verification of clock synchronization systems is explored. Some limitations of Shankar's mechanically verified theory were encountered. With minor modifications to the theory, a mechanically checked proof is provided that removes these limitations. The revised theory also allows for proven recovery from transient faults. Use of the revised theory is illustrated with the verification of an abstract design of a clock synchronization system.

  18. Influence of slot number and pole number in fault-tolerant brushless dc motors having unequal tooth widths

    NASA Astrophysics Data System (ADS)

    Ishak, D.; Zhu, Z. Q.; Howe, D.

    2005-05-01

    The electromagnetic performance of fault-tolerant three-phase permanent magnet brushless dc motors, in which the wound teeth are wider than the unwound teeth and their tooth tips span approximately one pole pitch and which have similar numbers of slots and poles, is investigated. It is shown that they have a more trapezoidal phase back-emf wave form, a higher torque capability, and a lower torque ripple than similar fault-tolerant machines with equal tooth widths. However, these benefits gradually diminish as the pole number is increased, due to the effect of interpole leakage flux.

  19. Validation of fault-free behavior of a reliable multiprocessor system - FTMP: A case study. [Fault-Tolerant Multi-Processor avionics

    NASA Technical Reports Server (NTRS)

    Clune, E.; Segall, Z.; Siewiorek, D.

    1984-01-01

    A program of experiments has been conducted at NASA-Langley to test the fault-free performance of a Fault-Tolerant Multiprocessor (FTMP) avionics system for next-generation aircraft. Baseline measurements of an operating FTMP system were obtained with respect to the following parameters: instruction execution time, frame size, and the variation of clock ticks. The mechanisms of frame stretching were also investigated. The experimental results are summarized in a table. Areas of interest for future tests are identified, with emphasis given to the implementation of a synthetic workload generation mechanism on FTMP.

  20. Fault-tolerant logical gates in quantum error-correcting codes

    NASA Astrophysics Data System (ADS)

    Pastawski, Fernando; Yoshida, Beni

    2015-01-01

    Recently, S. Bravyi and R. König [Phys. Rev. Lett. 110, 170503 (2013), 10.1103/PhysRevLett.110.170503] have shown that there is a trade-off between fault-tolerantly implementable logical gates and geometric locality of stabilizer codes. They consider locality-preserving operations which are implemented by a constant-depth geometrically local circuit and are thus fault tolerant by construction. In particular, they show that, for local stabilizer codes in D spatial dimensions, locality-preserving gates are restricted to a set of unitary gates known as the D th level of the Clifford hierarchy. In this paper, we explore this idea further by providing several extensions and applications of their characterization to qubit stabilizer and subsystem codes. First, we present a no-go theorem for self-correcting quantum memory. Namely, we prove that a three-dimensional stabilizer Hamiltonian with a locality-preserving implementation of a non-Clifford gate cannot have a macroscopic energy barrier. This result implies that non-Clifford gates do not admit such implementations in Haah's cubic code and Michnicki's welded code. Second, we prove that the code distance of a D -dimensional local stabilizer code with a nontrivial locality-preserving m th -level Clifford logical gate is upper bounded by O (LD +1 -m) . For codes with non-Clifford gates (m >2 ), this improves the previous best bound by S. Bravyi and B. Terhal [New. J. Phys. 11, 043029 (2009), 10.1088/1367-2630/11/4/043029]. Topological color codes, introduced by H. Bombin and M. A. Martin-Delgado [Phys. Rev. Lett. 97, 180501 (2006), 10.1103/PhysRevLett.97.180501; Phys. Rev. Lett. 98, 160502 (2007), 10.1103/PhysRevLett.98.160502; Phys. Rev. B 75, 075103 (2007), 10.1103/PhysRevB.75.075103], saturate the bound for m =D . Third, we prove that the qubit erasure threshold for codes with a nontrivial transversal m th -level Clifford logical gate is upper bounded by 1 /m . This implies that no family of fault-tolerant codes with

  1. Advanced information processing system: Fault injection study and results

    NASA Technical Reports Server (NTRS)

    Burkhardt, Laura F.; Masotto, Thomas K.; Lala, Jaynarayan H.

    1992-01-01

    The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given.

  2. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography

    PubMed Central

    Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; Rudinger, Kenneth; Mizrahi, Jonathan; Fortier, Kevin; Maunz, Peter

    2017-01-01

    Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trapped-Yb+-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10−4). PMID:28198466

  3. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography

    NASA Astrophysics Data System (ADS)

    Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; Rudinger, Kenneth; Mizrahi, Jonathan; Fortier, Kevin; Maunz, Peter

    2017-02-01

    Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if--and only if--the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trapped-Yb+-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm <=6.7 × 10-4).

  4. Physical implementation of a Majorana fermion surface code for fault-tolerant quantum computation

    NASA Astrophysics Data System (ADS)

    Vijay, Sagar; Fu, Liang

    2016-12-01

    We propose a physical realization of a commuting Hamiltonian of interacting Majorana fermions realizing Z 2 topological order, using an array of Josephson-coupled topological superconductor islands. The required multi-body interaction Hamiltonian is naturally generated by a combination of charging energy induced quantum phase-slips on the superconducting islands and electron tunneling between islands. Our setup improves on a recent proposal for implementing a Majorana fermion surface code (Vijay et al 2015 Phys. Rev. X 5 041038), a ‘hybrid’ approach to fault-tolerant quantum computation that combines (1) the engineering of a stabilizer Hamiltonian with a topologically ordered ground state with (2) projective stabilizer measurements to implement error correction and a universal set of logical gates. Our hybrid strategy has advantages over the traditional surface code architecture in error suppression and single-step stabilizer measurements, and is widely applicable to implementing stabilizer codes for quantum computation.

  5. Fault-tolerant strategies for an implantable centrifugal blood pump using a radially controlled magnetic bearing.

    PubMed

    Pai, Chi Nan; Shinshi, Tadahiko

    2011-10-01

    In our laboratory, an implantable centrifugal blood pump (CBP) with a two degrees-of-freedom radially controlled magnetic bearing (MB) to support the impeller without contact has been developed to assist the pumping function of the weakened heart ventricle. In order to maintain the function of the CBP after damage to the electromagnets (EMs) of the MB, fault-tolerant strategies for the CBP are proposed in this study. Using a redundant MB design, magnetic levitation of the impeller was maintained with damage to up to two out of a total of four EMs of the MB; with damage to three EMs, contact-free support of the impeller was achieved using hydrodynamic and electromagnetic forces; and with damage to all four EMs, the pump operating point, of 5 l/min against 100 mmHg, was achieved using the motor for rotation of the impeller, with contact between the impeller and the stator.

  6. Backstepping Design of Adaptive Neural Fault-Tolerant Control for MIMO Nonlinear Systems.

    PubMed

    Gao, Hui; Song, Yongduan; Wen, Changyun

    2016-08-24

    In this paper, an adaptive controller is developed for a class of multi-input and multioutput nonlinear systems with neural networks (NNs) used as a modeling tool. It is shown that all the signals in the closed-loop system with the proposed adaptive neural controller are globally uniformly bounded for any external input in L[₀,∞]. In our control design, the upper bound of the NN modeling error and the gains of external disturbance are characterized by unknown upper bounds, which is more rational to establish the stability in the adaptive NN control. Filter-based modification terms are used in the update laws of unknown parameters to improve the transient performance. Finally, fault-tolerant control is developed to accommodate actuator failure. An illustrative example applying the adaptive controller to control a rigid robot arm shows the validation of the proposed controller.

  7. Evaluation Applied to Reliability Analysis of Reconfigurable, Highly Reliable, Fault-Tolerant, Computing Systems for Avionics

    NASA Technical Reports Server (NTRS)

    Migneault, G. E.

    1979-01-01

    Emulation techniques are proposed as a solution to a difficulty arising in the analysis of the reliability of highly reliable computer systems for future commercial aircraft. The difficulty, viz., the lack of credible precision in reliability estimates obtained by analytical modeling techniques are established. The difficulty is shown to be an unavoidable consequence of: (1) a high reliability requirement so demanding as to make system evaluation by use testing infeasible, (2) a complex system design technique, fault tolerance, (3) system reliability dominated by errors due to flaws in the system definition, and (4) elaborate analytical modeling techniques whose precision outputs are quite sensitive to errors of approximation in their input data. The technique of emulation is described, indicating how its input is a simple description of the logical structure of a system and its output is the consequent behavior. The use of emulation techniques is discussed for pseudo-testing systems to evaluate bounds on the parameter values needed for the analytical techniques.

  8. Optimised configuration of sensors for fault tolerant control of an electro-magnetic suspension system

    NASA Astrophysics Data System (ADS)

    Michail, K.; Zolotas, A. C.; Goodall, R. M.; Whidborne, J. F.

    2012-10-01

    For any given system the number and location of sensors can affect the closed-loop performance as well as the reliability of the system. Hence, one problem in control system design is the selection of the sensors in some optimum sense that considers both the system performance and reliability. Although some methods have been proposed that deal with some of the aforementioned aspects, in this work, a design framework dealing with both control and reliability aspects is presented. The proposed framework is able to identify the best sensor set for which optimum performance is achieved even under single or multiple sensor failures with minimum sensor redundancy. The proposed systematic framework combines linear quadratic Gaussian control, fault tolerant control and multiobjective optimisation. The efficacy of the proposed framework is shown via appropriate simulations on an electro-magnetic suspension system.

  9. A Markov model reduction technique for fault tolerant processor reliability analysis

    NASA Technical Reports Server (NTRS)

    Schor, Andrei L.; Rosch, Gene

    1991-01-01

    A fault tolerant processor (FTP) plays a key role in many high performance, safety-critical control system applications. Realistic modeling of an FTP is crucial to gaining a high degree of confidence in the reliability and safety analysis of such a system. While fidelity is clearly a major consideration, a practical model must also be kept to a moderate size to allow its incorporation into the overall system model. This paper presents a systematic reduction technique that starts from a complex, detailed model of a triple, redundant FTP and produces a low order approximation of very high accuracy. The existence of two distinct time scales represents the key to the success of the technique. No eigenvalue solution or coordinate transformation are needed. The reduced model captures all the important features of the detailed model, is amenable to an analytical solution and provides insight into the reconfiguration behavior of an FTP.

  10. Decentralized Fault-Tolerant Control of Inland Navigation Networks: a Challenge

    NASA Astrophysics Data System (ADS)

    Segovia, P.; Rajaoarisoa, L.; Nejjari, F.; Blesa, J.; Puig, V.; Duviella, E.

    2017-01-01

    Inland waterways are large-scale networks used principally for navigation. Even if the transport planning is an important issue, the water resource management is a crucial point. Indeed, navigation is not possible when there is too little or too much water inside the waterways. Hence, the water resource management of waterways has to be particularly efficient in a context of climate change and increase of water demand. This management has to be done by considering different time and space scales and still requires the development of new methodologies and tools in the topics of the Control and Informatics communities. This work addresses the problem of waterways management in terms of modeling, control, diagnosis and fault-tolerant control by focusing in the inland waterways of the north of France. A review of proposed tools and the ongoing research topics are provided in this paper.

  11. Implementing a strand of a scalable fault-tolerant quantum computing fabric.

    PubMed

    Chow, Jerry M; Gambetta, Jay M; Magesan, Easwar; Abraham, David W; Cross, Andrew W; Johnson, B R; Masluk, Nicholas A; Ryan, Colm A; Smolin, John A; Srinivasan, Srikanth J; Steffen, M

    2014-06-24

    With favourable error thresholds and requiring only nearest-neighbour interactions on a lattice, the surface code is an error-correcting code that has garnered considerable attention. At the heart of this code is the ability to perform a low-weight parity measurement of local code qubits. Here we demonstrate high-fidelity parity detection of two code qubits via measurement of a third syndrome qubit. With high-fidelity gates, we generate entanglement distributed across three superconducting qubits in a lattice where each code qubit is coupled to two bus resonators. Via high-fidelity measurement of the syndrome qubit, we deterministically entangle the code qubits in either an even or odd parity Bell state, conditioned on the syndrome qubit state. Finally, to fully characterize this parity readout, we develop a measurement tomography protocol. The lattice presented naturally extends to larger networks of qubits, outlining a path towards fault-tolerant quantum computing.

  12. Application of a Resource Theory for Magic States to Fault-Tolerant Quantum Computing.

    PubMed

    Howard, Mark; Campbell, Earl

    2017-03-03

    Motivated by their necessity for most fault-tolerant quantum computation schemes, we formulate a resource theory for magic states. First, we show that robustness of magic is a well-behaved magic monotone that operationally quantifies the classical simulation overhead for a Gottesman-Knill-type scheme using ancillary magic states. Our framework subsequently finds immediate application in the task of synthesizing non-Clifford gates using magic states. When magic states are interspersed with Clifford gates, Pauli measurements, and stabilizer ancillas-the most general synthesis scenario-then the class of synthesizable unitaries is hard to characterize. Our techniques can place nontrivial lower bounds on the number of magic states required for implementing a given target unitary. Guided by these results, we have found new and optimal examples of such synthesis.

  13. Low cost management of replicated data in fault-tolerant distributed systems

    NASA Technical Reports Server (NTRS)

    Joseph, Thomas A.; Birman, Kenneth P.

    1990-01-01

    Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to keep the copies of replicated data consistent introduce a delay when operations are performed. A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. How this technique performs in conjunction with a roll-back and a roll-forward failure recovery mechanism is also discussed.

  14. Analysis and design of algorithm-based fault-tolerant systems

    NASA Technical Reports Server (NTRS)

    Nair, V. S. Sukumaran

    1990-01-01

    An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems.

  15. General computational framework for distributed sensing and fault-tolerant sensor integration

    SciTech Connect

    Iyengar, S.S.; Prasad, L.

    1995-04-01

    An abstract framework to address the problem of fault-tolerance integration of information provided by multiple sensors was proposed. The problem addressed involves combining interval estimates of sensor outputs into a best intersection estimate of outputs. To test the theoretical analysis of the proposed computational framework, a modular parameter-driven simulator SIMDSN was developed. The simulator uses the well-known Monte-Carlo technique to generate random correct and tamely faulty intervals. The results of the proposed algorithm and Marzullo`s were compared. It is concluded that based on the simulation, the union of the disjoint intervals determined by the algorithm and the gaps between these intervals together can never exceed the width of the final output of Marzullo`s method. Thus, while Marzullo`s method specifies a rather large interval as the final output of the integration process, the proposed method subdivides this interval into disjoint intervals each with an associated reliability measure. 7 refs.

  16. Methods and apparatuses for self-generating fault-tolerant keys in spread-spectrum systems

    DOEpatents

    Moradi, Hussein; Farhang, Behrouz; Subramanian, Vijayarangam

    2015-12-15

    Self-generating fault-tolerant keys for use in spread-spectrum systems are disclosed. At a communication device, beacon signals are received from another communication device and impulse responses are determined from the beacon signals. The impulse responses are circularly shifted to place a largest sample at a predefined position. The impulse responses are converted to a set of frequency responses in a frequency domain. The frequency responses are shuffled with a predetermined shuffle scheme to develop a set of shuffled frequency responses. A set of phase differences is determined as a difference between an angle of the frequency response and an angle of the shuffled frequency response at each element of the corresponding sets. Each phase difference is quantized to develop a set of secret-key quantized phases and a set of spreading codes is developed wherein each spreading code includes a corresponding phase of the set of secret-key quantized phases.

  17. Development of N-version software samples for an experiment in software fault tolerance

    NASA Technical Reports Server (NTRS)

    Lauterbach, L.

    1987-01-01

    The report documents the task planning and software development phases of an effort to obtain twenty versions of code independently designed and developed from a common specification. These versions were created for use in future experiments in software fault tolerance, in continuation of the experimental series underway at the Systems Validation Methods Branch (SVMB) at NASA Langley Research Center. The 20 versions were developed under controlled conditions at four U.S. universities, by 20 teams of two researchers each. The versions process raw data from a modified Redundant Strapped Down Inertial Measurement Unit (RSDIMU). The specifications, and over 200 questions submitted by the developers concerning the specifications, are included as appendices to this report. Design documents, and design and code walkthrough reports for each version, were also obtained in this task for use in future studies.

  18. Fault tolerance in an inner-outer solver: A GVR-enabled case study

    SciTech Connect

    Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita

    2015-04-18

    Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-stream snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.

  19. Fault tolerance in an inner-outer solver: A GVR-enabled case study

    DOE PAGES

    Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita

    2015-04-18

    Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-streammore » snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.« less

  20. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography

    DOE PAGES

    Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; ...

    2017-02-15

    Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Finally, we usemore » gate set tomography to completely characterize operations on a trapped-Yb+-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10-4).« less

  1. High Temperature, Permanent Magnet Biased, Fault Tolerant, Homopolar Magnetic Bearing Development

    NASA Technical Reports Server (NTRS)

    Palazzolo, Alan; Tucker, Randall; Kenny, Andrew; Kang, Kyung-Dae; Ghandi, Varun; Liu, Jinfang; Choi, Heeju; Provenza, Andrew

    2008-01-01

    This paper summarizes the development of a magnetic bearing designed to operate at 1,000 F. A novel feature of this high temperature magnetic bearing is its homopolar construction which incorporates state of the art high temperature, 1,000 F, permanent magnets. A second feature is its fault tolerance capability which provides the desired control forces with over one-half of the coils failed. The construction and design methodology of the bearing is outlined and test results are shown. The agreement between a 3D finite element, magnetic field based prediction for force is shown to be in good agreement with predictions at room and high temperature. A 5 axis test rig will be complete soon to provide a means to test the magnetic bearings at high temperature and speed.

  2. Demonstration of qubit operations below a rigorous fault tolerance threshold with gate set tomography.

    PubMed

    Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; Rudinger, Kenneth; Mizrahi, Jonathan; Fortier, Kevin; Maunz, Peter

    2017-02-15

    Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if-and only if-the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trapped-Yb(+)-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10(-4)).

  3. An extension to Schneider's general paradigm for fault-tolerant clock synchronization

    NASA Technical Reports Server (NTRS)

    Miner, Paul S.

    1992-01-01

    In 1987, Schneider presented a general paradigm that provides a single proof of a number of fault tolerant clock synchronization algorithms. His proof was subsequently subjected to the rigor of mechanical verification by Shankar. However, both Schneider and Shankar assumed a condition Shankar refers to as a bounded delay. This condition states that the elapsed time between synchronization events (i.e., the time that the local process applies an adjustment to its logical clock) is bounded. This property is really a result of the algorithm and should not be assumed in a proof of correctness. This paper remedies this by providing a proof of this property in the context of the general paradigm proposed by Schneider. The argument given is a generalization of Welch and Lynch's proof of a related property for their algorithm.

  4. Condition for fault-tolerant quantum computation with a cavity-QED scheme

    SciTech Connect

    Goto, Hayato; Ichimura, Kouichi

    2010-09-15

    A condition for fault-tolerant quantum computation (FTQC) with cavity schemes is discussed. It is shown that the condition is very hard if the standard error threshold of FTQC is simply applied. To relax the condition, we propose to combine the cavity-quantum-electrodynamics (QED) scheme proposed by Duan et al. [Phys. Rev. A 72, 032333 (2005)] and Xiao et al. [Phys. Rev. A 70, 042314 (2004)] with the recently proposed FTQC scheme with probabilistic two-qubit gates [Goto and Ichimura, Phys. Rev. A 80, 040303(R) (2009)]. It is shown that the condition for FTQC is dramatically relaxed compared to the case of the standard threshold. The optimization of the cavity-QED scheme is also discussed.

  5. A discussion on sensor recovery techniques for fault tolerant multisensor schemes

    NASA Astrophysics Data System (ADS)

    Stoican, Florin; Olaru, Sorin; De Doná, José A.; Seron, María M.

    2014-08-01

    The present paper deals with the interplay between healthy and faulty sensor functioning in a multisensor scheme based on a switching control strategy. Fault tolerance guarantees have been recently obtained in this framework based upon the characterisation of invariant sets for state estimations in healthy and faulty functioning. A source of conservativeness of this approach is related to the issue of sensor recovery. A common working hypothesis has been to assume that once a sensor switches to faulty functioning it can no longer be used by the control mechanism even if at an ulterior moment it switches back to healthy functioning. In the current paper, we present necessary and sufficient conditions for the acknowledgement of sensor recovery and we propose and compare different techniques for the reintegration of sensors in the closed-loop decision-making mechanism.

  6. Application of a Resource Theory for Magic States to Fault-Tolerant Quantum Computing

    NASA Astrophysics Data System (ADS)

    Howard, Mark; Campbell, Earl

    2017-03-01

    Motivated by their necessity for most fault-tolerant quantum computation schemes, we formulate a resource theory for magic states. First, we show that robustness of magic is a well-behaved magic monotone that operationally quantifies the classical simulation overhead for a Gottesman-Knill-type scheme using ancillary magic states. Our framework subsequently finds immediate application in the task of synthesizing non-Clifford gates using magic states. When magic states are interspersed with Clifford gates, Pauli measurements, and stabilizer ancillas—the most general synthesis scenario—then the class of synthesizable unitaries is hard to characterize. Our techniques can place nontrivial lower bounds on the number of magic states required for implementing a given target unitary. Guided by these results, we have found new and optimal examples of such synthesis.

  7. An experimental investigation of fault tolerant software structures in an avionics application

    NASA Technical Reports Server (NTRS)

    Caglayan, Alper K.; Eckhardt, Dave E., Jr.

    1989-01-01

    The objective of this experimental investigation is to compare the functional performance and software reliability of competing fault tolerant software structures utilizing software diversity. In this experiment, three versions of the redundancy management software for a skewed sensor array have been developed using three diverse failure detection and isolation algorithms and incorporated into various N-version, recovery block and hybrid software structures. The empirical results show that, for maximum functional performance improvement in the selected application domain, the results of diverse algorithms should be voted before being processed by multiple versions without enforced diversity. Results also suggest that when the reliability gain with an N-version structure is modest, recovery block structures are more feasible since higher reliability can be obtained using an acceptance check with a modest reliability.

  8. An approach to experimental evaluation of real-time fault-tolerant distributed computing schemes

    NASA Technical Reports Server (NTRS)

    Kim, K. H.

    1989-01-01

    A testbed-based approach to the evaluation of fault-tolerant distributed computing schemes is discussed. The approach is based on experimental incorporation of system structuring and design techniques into real-time distributed-computing testbeds centered around tightly coupled microcomputer networks. The effectiveness of this approach has been experimentally confirmed. Primary advantages of this approach include the accuracy of the timing and logical-complexity data and the degree of assurance of the practical effectiveness of the scheme evaluated. Various design issues encountered in the course of establishing the network testbed facilities are discussed, along with their augmentation to support some experiments. The shortcomings of the testbeds are also discussed together with the desired extensions of the testbeds.

  9. Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation

    PubMed Central

    Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana; Schulze Grotthoff, Stefan

    2013-01-01

    The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. PMID:24748902

  10. Steps towards fault-tolerant quantum operations in trapped-ion Quantum information experiments

    NASA Astrophysics Data System (ADS)

    Ozeri, R.; Langer, C.; Jost, J. D.; Blakestad, R. B.; Britton, J.; Chiaverini, J.; Hume, D.; Itano, W. M.; Knill, E.; Leibfried, D.; Reichle, R.; Seidelin, S.; Wesenberg, J. H.; Wineland, D. J.

    2006-05-01

    Fault-tolerant Quantum Information Processing (QIP) requires that the error in a quantum gate be smaller than a certain threshold, currently believed to be on the ˜10-4 level. Here we discuss progress toward realizing such low error rates in trapped-ion QIP experiments at NIST. Memory coherence times are extended using a qubit transition which, to first order, is independent of the magnetic field. The fundamental limits to laser driven quantum gates are investigated by studying the effect of spontaneous scattering of photons on hyperfine coherence. It is shown that the error due to the scattering of photons can be, at least in principle, reduced to very low values.

  11. Hybrid magic state distillation for universal fault-tolerant quantum computation

    NASA Astrophysics Data System (ADS)

    Zheng, Wenqiang; Yu, Yafei; Pan, Jian; Zhang, Jingfu; Li, Jun; Li, Zhaokai; Suter, Dieter; Zhou, Xianyi; Peng, Xinhua; Du, Jiangfeng

    2015-02-01

    A set of stabilizer operations augmented by some special initial states known as "magic states" gives the possibility of universal fault-tolerant quantum computation. However, magic state preparation inevitably involves nonideal operations that introduce noise. The most common method to eliminate the noise is magic state distillation (MSD) by stabilizer operations. Here we propose a hybrid MSD protocol by connecting a four-qubit H -type MSD with a five-qubit T -type MSD in order to overcome the shortcomings of the previous MSD protocols. The hybrid MSD protocol further integrates distillable ranges of different existing MSD protocols and provides considerable improvement in qubit cost. Moreover, we experimentally demonstrate the four-qubit H -type MSD protocol using nuclear magnetic resonance technology, together with the previous five-qubit MSD experiment, to show the feasibility of the hybrid MSD protocol.

  12. Reliable and Fault-Tolerant Software-Defined Network Operations Scheme for Remote 3D Printing

    NASA Astrophysics Data System (ADS)

    Kim, Dongkyun; Gil, Joon-Min

    2015-03-01

    The recent wide expansion of applicable three-dimensional (3D) printing and software-defined networking (SDN) technologies has led to a great deal of attention being focused on efficient remote control of manufacturing processes. SDN is a renowned paradigm for network softwarization, which has helped facilitate remote manufacturing in association with high network performance, since SDN is designed to control network paths and traffic flows, guaranteeing improved quality of services by obtaining network requests from end-applications on demand through the separated SDN controller or control plane. However, current SDN approaches are generally focused on the controls and automation of the networks, which indicates that there is a lack of management plane development designed for a reliable and fault-tolerant SDN environment. Therefore, in addition to the inherent advantage of SDN, this paper proposes a new software-defined network operations center (SD-NOC) architecture to strengthen the reliability and fault-tolerance of SDN in terms of network operations and management in particular. The cooperation and orchestration between SDN and SD-NOC are also introduced for the SDN failover processes based on four principal SDN breakdown scenarios derived from the failures of the controller, SDN nodes, and connected links. The abovementioned SDN troubles significantly reduce the network reachability to remote devices (e.g., 3D printers, super high-definition cameras, etc.) and the reliability of relevant control processes. Our performance consideration and analysis results show that the proposed scheme can shrink operations and management overheads of SDN, which leads to the enhancement of responsiveness and reliability of SDN for remote 3D printing and control processes.

  13. Superconducting quantum circuits at the surface code threshold for fault tolerance.

    PubMed

    Barends, R; Kelly, J; Megrant, A; Veitia, A; Sank, D; Jeffrey, E; White, T C; Mutus, J; Fowler, A G; Campbell, B; Chen, Y; Chen, Z; Chiaro, B; Dunsworth, A; Neill, C; O'Malley, P; Roushan, P; Vainsencher, A; Wenner, J; Korotkov, A N; Cleland, A N; Martinis, John M

    2014-04-24

    A quantum computer can solve hard problems, such as prime factoring, database searching and quantum simulation, at the cost of needing to protect fragile quantum states from error. Quantum error correction provides this protection by distributing a logical state among many physical quantum bits (qubits) by means of quantum entanglement. Superconductivity is a useful phenomenon in this regard, because it allows the construction of large quantum circuits and is compatible with microfabrication. For superconducting qubits, the surface code approach to quantum computing is a natural choice for error correction, because it uses only nearest-neighbour coupling and rapidly cycled entangling gates. The gate fidelity requirements are modest: the per-step fidelity threshold is only about 99 per cent. Here we demonstrate a universal set of logic gates in a superconducting multi-qubit processor, achieving an average single-qubit gate fidelity of 99.92 per cent and a two-qubit gate fidelity of up to 99.4 per cent. This places Josephson quantum computing at the fault-tolerance threshold for surface code error correction. Our quantum processor is a first step towards the surface code, using five qubits arranged in a linear array with nearest-neighbour coupling. As a further demonstration, we construct a five-qubit Greenberger-Horne-Zeilinger state using the complete circuit and full set of gates. The results demonstrate that Josephson quantum computing is a high-fidelity technology, with a clear path to scaling up to large-scale, fault-tolerant quantum circuits.

  14. Design of fault tolerant control system for individual blade control helicopters

    NASA Astrophysics Data System (ADS)

    Tamayo, Sergio

    This dissertation presents the development of a fault tolerant control scheme for helicopters fitted with individually controlled blades. This novel approach attempts to improve fault tolerant capabilities of helicopter control system by increasing control redundancy using additional actuators for individual blade input and software re-mixing to obtain nominal or close to nominal conditions under failure. An advanced interactive simulation environment has been developed including modeling of sensor failure, swashplate actuator failure, individual blade actuator failure, and blade delamination to support the design, testing, and evaluation of the control laws. This simulation environment is based on the blade element theory for the calculation of forces and moments generated by the main rotor. This discretized model allows for individual blade analysis, which in turn allows measuring the consequences of a stuck blade, or loss of the surface area of the blade itself, with respect to the dynamics of the whole helicopter. The control laws are based on non-linear dynamic inversion and artificial neural network augmentation, which is a mix of linear and nonlinear methods that compensates for model inaccuracies due to linearization or failure. A stability analysis based on the Lyapunov function approach has shown that bounded tracking error is guaranteed, and under specific circumstances, global stability is guaranteed as well. An analysis over the degrees of freedom of the mechanical system and its impact over the helicopter handling qualities is also performed to measure the degree of redundancy achieved with the addition of individual blade actuators as compared to a classic swashplate helicopter configuration. Mathematical analysis and numerical simulation, using reconfiguration of the individual blade control under failure have shown that this control architecture can potentially improve the survivability of the aircraft and reduce pilot workload under failure

  15. Physiological hemostasis based intelligent integrated cooperative controller for precise fault-tolerant control of redundant parallel manipulator

    NASA Astrophysics Data System (ADS)

    Hao, Kuangrong; Guo, Chongbin; Ding, Yongsheng

    2014-10-01

    This paper focuses on precise fault-tolerant control for actual redundant parallel manipulator. Based on kinematic redundancy, some unnoticed influences such as mechanical clearance have been considered to design a more precise and intelligent fault-tolerant plan for actual plants. According to regulation principles in human hemostasis system, a bio-inspired intelligent integrated cooperative controller (BIICC) is developed including system structure, algorithm and step in parameter tuning. The proposed BIICC optimises partial error signal and improves control performance in each sub-channel. Moreover, the new controller transfers and disposes cooperative control signals among different sub-channels to achieve an intelligent integrated fault-tolerant system. The proposed BIICC is applied to an actual 2-DOF (degrees of freedom) redundant parallel manipulator where the feasibility of the new controller is demonstrated. The BIICC is beneficial to control precision and fault-tolerant capability of redundant plant. The improvements are more obvious in cases where extra actuators of redundant manipulator are broken.

  16. Adaptive backstepping fault-tolerant control for flexible spacecraft with unknown bounded disturbances and actuator failures.

    PubMed

    Jiang, Ye; Hu, Qinglei; Ma, Guangfu

    2010-01-01

    In this paper, a robust adaptive fault-tolerant control approach to attitude tracking of flexible spacecraft is proposed for use in situations when there are reaction wheel/actuator failures, persistent bounded disturbances and unknown inertia parameter uncertainties. The controller is designed based on an adaptive backstepping sliding mode control scheme, and a sufficient condition under which this control law can render the system semi-globally input-to-state stable is also provided such that the closed-loop system is robust with respect to any disturbance within a quantifiable restriction on the amplitude, as well as the set of initial conditions, if the control gains are designed appropriately. Moreover, in the design, the control law does not need a fault detection and isolation mechanism even if the failure time instants, patterns and values on actuator failures are also unknown for the designers, as motivated from a practical spacecraft control application. In addition to detailed derivations of the new controller design and a rigorous sketch of all the associated stability and attitude error convergence proofs, illustrative simulation results of an application to flexible spacecraft show that high precise attitude control and vibration suppression are successfully achieved using various scenarios of controlling effective failures.

  17. Validation of a fault-tolerant multiprocessor: Baseline experiments and workload implementation

    NASA Technical Reports Server (NTRS)

    Feather, Frank; Siewiorek, Daniel; Segall, Zary

    1985-01-01

    In the future, aircraft must employ highly reliable multiprocessors in order to achieve flight safety. Such computers must be experimentally validated before they are deployed. This project outlines a methodology for validating reliable multiprocessors. The methodology begins with baseline experiments, which tests a single phenomenon. As experiments progress, tools for performance testing are developed. The methodology is used, in part, on the Fault Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB facility. Experiments are designed to evaluate the fault-free performance of the system. Presented are the results of interrupt baseline experiments performed on FTMP. Interrupt causing exception conditions were tested, and several were found to have unimplemented interrupt handling software while one had an unimplemented interrupt vector. A synthetic workload model for realtime multiprocessors is then developed as an application level performance analysis tool. Details of the workload implementation and calibration are presented. Both the experimental methodology and the synthetic workload model are general enough to be applicable to reliable multiprocessors beside FTMP.

  18. An Autonomous Self-Aware and Adaptive Fault Tolerant Routing Technique for Wireless Sensor Networks

    PubMed Central

    Abba, Sani; Lee, Jeong-A

    2015-01-01

    We propose an autonomous self-aware and adaptive fault-tolerant routing technique (ASAART) for wireless sensor networks. We address the limitations of self-healing routing (SHR) and self-selective routing (SSR) techniques for routing sensor data. We also examine the integration of autonomic self-aware and adaptive fault detection and resiliency techniques for route formation and route repair to provide resilience to errors and failures. We achieved this by using a combined continuous and slotted prioritized transmission back-off delay to obtain local and global network state information, as well as multiple random functions for attaining faster routing convergence and reliable route repair despite transient and permanent node failure rates and efficient adaptation to instantaneous network topology changes. The results of simulations based on a comparison of the ASAART with the SHR and SSR protocols for five different simulated scenarios in the presence of transient and permanent node failure rates exhibit a greater resiliency to errors and failure and better routing performance in terms of the number of successfully delivered network packets, end-to-end delay, delivered MAC layer packets, packet error rate, as well as efficient energy conservation in a highly congested, faulty, and scalable sensor network. PMID:26295236

  19. A New On-Line Diagnosis Protocol for the SPIDER Family of Byzantine Fault Tolerant Architectures

    NASA Technical Reports Server (NTRS)

    Geser, Alfons; Miner, Paul S.

    2004-01-01

    This paper presents the formal verification of a new protocol for online distributed diagnosis for the SPIDER family of architectures. An instance of the Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) architecture consists of a collection of processing elements communicating over a Reliable Optical Bus (ROBUS). The ROBUS is a specialized fault-tolerant device that guarantees Interactive Consistency, Distributed Diagnosis (Group Membership), and Synchronization in the presence of a bounded number of physical faults. Formal verification of the original SPIDER diagnosis protocol provided a detailed understanding that led to the discovery of a significantly more efficient protocol. The original protocol was adapted from the formally verified protocol used in the MAFT architecture. It required O(N) message exchanges per defendant to correctly diagnose failures in a system with N nodes. The new protocol achieves the same diagnostic fidelity, but only requires O(1) exchanges per defendant. This paper presents this new diagnosis protocol and a formal proof of its correctness using PVS.

  20. Dual-quaternion based fault-tolerant control for spacecraft formation flying with finite-time convergence.

    PubMed

    Dong, Hongyang; Hu, Qinglei; Ma, Guangfu

    2016-03-01

    Study results of developing control system for spacecraft formation proximity operations between a target and a chaser are presented. In particular, a coupled model using dual quaternion is employed to describe the proximity problem of spacecraft formation, and a nonlinear adaptive fault-tolerant feedback control law is developed to enable the chaser spacecraft to track the position and attitude of the target even though its actuator occurs fault. Multiple-task capability of the proposed control system is further demonstrated in the presence of disturbances and parametric uncertainties as well. In addition, the practical finite-time stability feature of the closed-loop system is guaranteed theoretically under the designed control law. Numerical simulation of the proposed method is presented to demonstrate the advantages with respect to interference suppression, fast tracking, fault tolerant and practical finite-time stability.