Parallel and distributed computation for fault-tolerant object recognition
NASA Technical Reports Server (NTRS)
Wechsler, Harry
1988-01-01
The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.
Fault Tolerant Software Technology for Distributed Computer Systems
1989-03-01
RAY.) &-TR-88-296 I Fin;.’ Technical Report ,r 19,39 i A28 3329 F’ULT TOLERANT SOFTWARE TECHNOLOGY FOR DISTRIBUTED COMPUTER SYSTEMS Georgia Institute...GrfisABN 34-70IiWftlI NO0. IN?3. NO IACCESSION NO. 158 21 7 11. TITLE (Incld security Cassification) FAULT TOLERANT SOFTWARE FOR DISTRIBUTED COMPUTER ...Technology for Distributed Computing Systems," a two year effort performed at Georgia Institute of Technology as part of the Clouds Project. The Clouds
Fault tolerant features and experiments of ANTS distributed real-time system
NASA Astrophysics Data System (ADS)
Dominic-Savio, Patrick; Lo, Jien-Chung; Tufts, Donald W.
1995-01-01
The ANTS project at the University of Rhode Island introduces the concept of Active Nodal Task Seeking (ANTS) as a way to efficiently design and implement dependable, high-performance, distributed computing. This paper presents the fault tolerant design features that have been incorporated in the ANTS experimental system implementation. The results of performance evaluations and fault injection experiments are reported. The fault-tolerant version of ANTS categorizes all computing nodes into three groups. They are: the up-and-running green group, the self-diagnosing yellow group and the failed red group. Each available computing node will be placed in the yellow group periodically for a routine diagnosis. In addition, for long-life missions, ANTS uses a monitoring scheme to identify faulty computing nodes. In this monitoring scheme, the communication pattern of each computing node is monitored by two other nodes.
Survivable algorithms and redundancy management in NASA's distributed computing systems
NASA Technical Reports Server (NTRS)
Malek, Miroslaw
1992-01-01
The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given.
Redundancy management for efficient fault recovery in NASA's distributed computing system
NASA Technical Reports Server (NTRS)
Malek, Miroslaw; Pandya, Mihir; Yau, Kitty
1991-01-01
The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance.
Fault-tolerant computer study. [logic designs for building block circuits
NASA Technical Reports Server (NTRS)
Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.
1981-01-01
A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.
Chen, Gang; Song, Yongduan; Lewis, Frank L
2016-05-03
This paper investigates the distributed fault-tolerant control problem of networked Euler-Lagrange systems with actuator and communication link faults. An adaptive fault-tolerant cooperative control scheme is proposed to achieve the coordinated tracking control of networked uncertain Lagrange systems on a general directed communication topology, which contains a spanning tree with the root node being the active target system. The proposed algorithm is capable of compensating for the actuator bias fault, the partial loss of effectiveness actuation fault, the communication link fault, the model uncertainty, and the external disturbance simultaneously. The control scheme does not use any fault detection and isolation mechanism to detect, separate, and identify the actuator faults online, which largely reduces the online computation and expedites the responsiveness of the controller. To validate the effectiveness of the proposed method, a test-bed of multiple robot-arm cooperative control system is developed for real-time verification. Experiments on the networked robot-arms are conduced and the results confirm the benefits and the effectiveness of the proposed distributed fault-tolerant control algorithms.
A General theory of Signal Integration for Fault-Tolerant Dynamic Distributed Sensor Networks
1993-10-01
related to a) the architecture and fault- tolerance of the distributed sensor network, b) the proper synchronisation of sensor signals, c) the...Computational complexities of the problem of distributed detection. 5) Issues related to recording of events and synchronization in distributed sensor...Intervals for Synchronization in Real Time Distributed Systems", Submitted to Electronic Encyclopedia. 3. V. G. Hegde and S. S. Iyengar "Efficient
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sunderam, Vaidy S.
2007-01-09
The Harness project has developed novel software frameworks for the execution of high-end simulations in a fault-tolerant manner on distributed resources. The H2O subsystem comprises the kernel of the Harness framework, and controls the key functions of resource management across multiple administrative domains, especially issues of access and allocation. It is based on a “pluggable” architecture that enables the aggregated use of distributed heterogeneous resources for high performance computing. The major contributions of the Harness II project result in significantly enhancing the overall computational productivity of high-end scientific applications by enabling robust, failure-resilient computations on cooperatively pooled resource collections.
Guest Editor's Introduction: Special section on dependable distributed systems
NASA Astrophysics Data System (ADS)
Fetzer, Christof
1999-09-01
We rely more and more on computers. For example, the Internet reshapes the way we do business. A `computer outage' can cost a company a substantial amount of money. Not only with respect to the business lost during an outage, but also with respect to the negative publicity the company receives. This is especially true for Internet companies. After recent computer outages of Internet companies, we have seen a drastic fall of the shares of the affected companies. There are multiple causes for computer outages. Although computer hardware becomes more reliable, hardware related outages remain an important issue. For example, some of the recent computer outages of companies were caused by failed memory and system boards, and even by crashed disks - a failure type which can easily be masked using disk mirroring. Transient hardware failures might also look like software failures and, hence, might be incorrectly classified as such. However, many outages are software related. Faulty system software, middleware, and application software can crash a system. Dependable computing systems are systems we can rely on. Dependable systems are, by definition, reliable, available, safe and secure [3]. This special section focuses on issues related to dependable distributed systems. Distributed systems have the potential to be more dependable than a single computer because the probability that all computers in a distributed system fail is smaller than the probability that a single computer fails. However, if a distributed system is not built well, it is potentially less dependable than a single computer since the probability that at least one computer in a distributed system fails is higher than the probability that one computer fails. For example, if the crash of any computer in a distributed system can bring the complete system to a halt, the system is less dependable than a single-computer system. Building dependable distributed systems is an extremely difficult task. There is no silver bullet solution. Instead one has to apply a variety of engineering techniques [2]: fault-avoidance (minimize the occurrence of faults, e.g. by using a proper design process), fault-removal (remove faults before they occur, e.g. by testing), fault-evasion (predict faults by monitoring and reconfigure the system before failures occur), and fault-tolerance (mask and/or contain failures). Building a system from scratch is an expensive and time consuming effort. To reduce the cost of building dependable distributed systems, one would choose to use commercial off-the-shelf (COTS) components whenever possible. The usage of COTS components has several potential advantages beyond minimizing costs. For example, through the widespread usage of a COTS component, design failures might be detected and fixed before the component is used in a dependable system. Custom-designed components have to mature without the widespread in-field testing of COTS components. COTS components have various potential disadvantages when used in dependable systems. For example, minimizing the time to market might lead to the release of components with inherent design faults (e.g. use of `shortcuts' that only work most of the time). In addition, the components might be more complex than needed and, hence, potentially have more design faults than simpler components. However, given economic constraints and the ability to cope with some of the problems using fault-evasion and fault-tolerance, only for a small percentage of systems can one justify not using COTS components. Distributed systems built from current COTS components are asynchronous systems in the sense that there exists no a priori known bound on the transmission delay of messages or the execution time of processes. When designing a distributed algorithm, one would like to make sure (e.g. by testing or verification) that it is correct, i.e. satisfies its specification. Many distributed algorithms make use of consensus (eventually all non-crashed processes have to agree on a value), leader election (a crashed leader is eventually replaced by a new leader, but at any time there is at most one leader) or a group membership detection service (a crashed process is eventually suspected to have crashed but only crashed processes are suspected). From a theoretical point of view, the service specifications given for such services are not implementable in asynchronous systems. In particular, for each implementation one can derive a counter example in which the service violates its specification. From a practical point of view, the consensus, the leader election, and the membership detection problem are solvable in asynchronous distributed systems. In this special section, Raynal and Tronel show how to bridge this difference by showing how to implement the group membership detection problem with a negligible probability [1] to fail in an asynchronous system. The group membership detection problem is specified by a liveness condition (L) and a safety property (S): (L) if a process p crashes, then eventually every non-crashed process q has to suspect that p has crashed; and (S) if a process q suspects p, then p has indeed crashed. One can show that either (L) or (S) is implementable, but one cannot implement both (L) and (S) at the same time in an asynchronous system. In practice, one only needs to implement (L) and (S) such that the probability that (L) or (S) is violated becomes negligible. Raynal and Tronel propose and analyse a protocol that implements (L) with certainty and that can be tuned such that the probability that (S) is violated becomes negligible. Designing and implementing distributed fault-tolerant protocols for asynchronous systems is a difficult but not an impossible task. A fault-tolerant protocol has to detect and mask certain failure classes, e.g. crash failures and message omission failures. There is a trade-off between the performance of a fault-tolerant protocol and the failure classes the protocol can tolerate. One wants to tolerate as many failure classes as needed to satisfy the stochastic requirements of the protocol [1] while still maintaining a sufficient performance. Since clients of a protocol have different requirements with respect to the performance/fault-tolerance trade-off, one would like to be able to customize protocols such that one can select an appropriate performance/fault-tolerance trade-off. In this special section Hiltunen et al describe how one can compose protocols from micro-protocols in their Cactus system. They show how a group RPC system can be tailored to the needs of a client. In particular, they show how considering additional failure classes affects the performance of a group RPC system. References [1] Cristian F 1991 Understanding fault-tolerant distributed systems Communications of ACM 34 (2) 56-78 [2] Heimerdinger W L and Weinstock C B 1992 A conceptual framework for system fault tolerance Technical Report 92-TR-33, CMU/SEI [3] Laprie J C (ed) 1992 Dependability: Basic Concepts and Terminology (Vienna: Springer)
Advanced information processing system
NASA Technical Reports Server (NTRS)
Lala, J. H.
1984-01-01
Design and performance details of the advanced information processing system (AIPS) for fault and damage tolerant data processing on aircraft and spacecraft are presented. AIPS comprises several computers distributed throughout the vehicle and linked by a damage tolerant data bus. Most I/O functions are available to all the computers, which run in a TDMA mode. Each computer performs separate specific tasks in normal operation and assumes other tasks in degraded modes. Redundant software assures that all fault monitoring, logging and reporting are automated, together with control functions. Redundant duplex links and damage-spread limitation provide the fault tolerance. Details of an advanced design of a laboratory-scale proof-of-concept system are described, including functional operations.
What does fault tolerant Deep Learning need from MPI?
DOE Office of Scientific and Technical Information (OSTI.GOV)
Amatya, Vinay C.; Vishnu, Abhinav; Siegel, Charles M.
Deep Learning (DL) algorithms have become the {\\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for amore » fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less
NASA Technical Reports Server (NTRS)
Harper, Richard
1989-01-01
In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.
NASA Technical Reports Server (NTRS)
Yates, Amy M.; Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Gonzalez, Oscar R.; Gray, W. Steven
2010-01-01
Safety-critical distributed flight control systems require robustness in the presence of faults. In general, these systems consist of a number of input/output (I/O) and computation nodes interacting through a fault-tolerant data communication system. The communication system transfers sensor data and control commands and can handle most faults under typical operating conditions. However, the performance of the closed-loop system can be adversely affected as a result of operating in harsh environments. In particular, High-Intensity Radiated Field (HIRF) environments have the potential to cause random fault manifestations in individual avionic components and to generate simultaneous system-wide communication faults that overwhelm existing fault management mechanisms. This paper presents the design of an experiment conducted at the NASA Langley Research Center's HIRF Laboratory to statistically characterize the faults that a HIRF environment can trigger on a single node of a distributed flight control system.
Software fault tolerance in computer operating systems
NASA Technical Reports Server (NTRS)
Iyer, Ravishankar K.; Lee, Inhwan
1994-01-01
This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.
A distributed programming environment for Ada
NASA Technical Reports Server (NTRS)
Brennan, Peter; Mcdonnell, Tom; Mcfarland, Gregory; Timmins, Lawrence J.; Litke, John D.
1986-01-01
Despite considerable commercial exploitation of fault tolerance systems, significant and difficult research problems remain in such areas as fault detection and correction. A research project is described which constructs a distributed computing test bed for loosely coupled computers. The project is constructing a tool kit to support research into distributed control algorithms, including a distributed Ada compiler, distributed debugger, test harnesses, and environment monitors. The Ada compiler is being written in Ada and will implement distributed computing at the subsystem level. The design goal is to provide a variety of control mechanics for distributed programming while retaining total transparency at the code level.
NASA Technical Reports Server (NTRS)
Brunelle, J. E.; Eckhardt, D. E., Jr.
1985-01-01
Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.
Scalable and fault tolerant orthogonalization based on randomized distributed data aggregation
Gansterer, Wilfried N.; Niederbrucker, Gerhard; Straková, Hana; Schulze Grotthoff, Stefan
2013-01-01
The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to failures compared to existing aggregation methods. It is illustrated that on a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and that it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method rdmGS, which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms. PMID:24748902
Responsive systems - The challenge for the nineties
NASA Technical Reports Server (NTRS)
Malek, Miroslaw
1990-01-01
A concept of responsive computer systems will be introduced. The emerging responsive systems demand fault-tolerant and real-time performance in parallel and distributed computing environments. The design methodologies for fault-tolerant, real time and responsive systems will be presented. Novel techniques of introducing redundancy for improved performance and dependability will be illustrated. The methods of system responsiveness evaluation will be proposed. The issues of determinism, closed and open systems will also be discussed from the perspective of responsive systems design.
Computer Sciences and Data Systems, volume 1
NASA Technical Reports Server (NTRS)
1987-01-01
Topics addressed include: software engineering; university grants; institutes; concurrent processing; sparse distributed memory; distributed operating systems; intelligent data management processes; expert system for image analysis; fault tolerant software; and architecture research.
Towards scalable Byzantine fault-tolerant replication
NASA Astrophysics Data System (ADS)
Zbierski, Maciej
2017-08-01
Byzantine fault-tolerant (BFT) replication is a powerful technique, enabling distributed systems to remain available and correct even in the presence of arbitrary faults. Unfortunately, existing BFT replication protocols are mostly load-unscalable, i.e. they fail to respond with adequate performance increase whenever new computational resources are introduced into the system. This article proposes a universal architecture facilitating the creation of load-scalable distributed services based on BFT replication. The suggested approach exploits parallel request processing to fully utilize the available resources, and uses a load balancer module to dynamically adapt to the properties of the observed client workload. The article additionally provides a discussion on selected deployment scenarios, and explains how the proposed architecture could be used to increase the dependability of contemporary large-scale distributed systems.
A Fault-tolerant RISC Microprocessor for Spacecraft Applications
NASA Technical Reports Server (NTRS)
Timoc, Constantin; Benz, Harry
1990-01-01
Viewgraphs on a fault-tolerant RISC microprocessor for spacecraft applications are presented. Topics covered include: reduced instruction set computer; fault tolerant registers; fault tolerant ALU; and double rail CMOS logic.
A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors.
Zhang, Jilin; Tu, Hangdi; Ren, Yongjian; Wan, Jian; Zhou, Li; Li, Mingwei; Wang, Jue; Yu, Lifeng; Zhao, Chang; Zhang, Lei
2017-09-21
In order to utilize the distributed characteristic of sensors, distributed machine learning has become the mainstream approach, but the different computing capability of sensors and network delays greatly influence the accuracy and the convergence rate of the machine learning model. Our paper describes a reasonable parameter communication optimization strategy to balance the training overhead and the communication overhead. We extend the fault tolerance of iterative-convergent machine learning algorithms and propose the Dynamic Finite Fault Tolerance (DFFT). Based on the DFFT, we implement a parameter communication optimization strategy for distributed machine learning, named Dynamic Synchronous Parallel Strategy (DSP), which uses the performance monitoring model to dynamically adjust the parameter synchronization strategy between worker nodes and the Parameter Server (PS). This strategy makes full use of the computing power of each sensor, ensures the accuracy of the machine learning model, and avoids the situation that the model training is disturbed by any tasks unrelated to the sensors.
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control
NASA Technical Reports Server (NTRS)
Dunn, W. R.
1980-01-01
RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs
1991-12-07
is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the i .nd~ividual components May be, the...The scale of parallel computing systems is rapidly approaching dimensions where 41to’- erance can no longer be ignored. No matter how relitble the...those employed in the Tandem [71 and Stratus [35] systems, is clearly impractical. * No matter how reliable the individual components are, the sheer
A Conceptual Design for a Reliable Optical Bus (ROBUS)
NASA Technical Reports Server (NTRS)
Miner, Paul S.; Malekpour, Mahyar; Torres, Wilfredo
2002-01-01
The Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) is a new family of fault-tolerant architectures under development at NASA Langley Research Center (LaRC). The SPIDER is a general-purpose computational platform suitable for use in ultra-reliable embedded control applications. The design scales from a small configuration supporting a single aircraft function to a large distributed configuration capable of supporting several functions simultaneously. SPIDER consists of a collection of simplex processing elements communicating via a Reliable Optical Bus (ROBUS). The ROBUS is an ultra-reliable, time-division multiple access broadcast bus with strictly enforced write access (no babbling idiots) providing basic fault-tolerant services using formally verified fault-tolerance protocols including Interactive Consistency (Byzantine Agreement), Internal Clock Synchronization, and Distributed Diagnosis. The conceptual design of the ROBUS is presented in this paper including requirements, topology, protocols, and the block-level design. Verification activities, including the use of formal methods, are also discussed.
Validation Methods for Fault-Tolerant avionics and control systems, working group meeting 1
NASA Technical Reports Server (NTRS)
1979-01-01
The proceedings of the first working group meeting on validation methods for fault tolerant computer design are presented. The state of the art in fault tolerant computer validation was examined in order to provide a framework for future discussions concerning research issues for the validation of fault tolerant avionics and flight control systems. The development of positions concerning critical aspects of the validation process are given.
Design study of Software-Implemented Fault-Tolerance (SIFT) computer
NASA Technical Reports Server (NTRS)
Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.
1982-01-01
Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.
Ultrareliable fault-tolerant control systems
NASA Technical Reports Server (NTRS)
Webster, L. D.; Slykhouse, R. A.; Booth, L. A., Jr.; Carson, T. M.; Davis, G. J.; Howard, J. C.
1984-01-01
It is demonstrated that fault-tolerant computer systems, such as on the Shuttles, based on redundant, independent operation are a viable alternative in fault tolerant system designs. The ultrareliable fault-tolerant control system (UFTCS) was developed and tested in laboratory simulations of an UH-1H helicopter. UFTCS includes asymptotically stable independent control elements in a parallel, cross-linked system environment. Static redundancy provides the fault tolerance. A polling is performed among the computers, with results allowing for time-delay channel variations with tight bounds. When compared with the laboratory and actual flight data for the helicopter, the probability of a fault was, for the first 10 hr of flight given a quintuple computer redundancy, found to be 1 in 290 billion. Two weeks of untended Space Station operations would experience a fault probability of 1 in 24 million. Techniques for avoiding channel divergence problems are identified.
Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing
NASA Technical Reports Server (NTRS)
Akamine, Robert L.; Hodson, Robert F.; LaMeres, Brock J.; Ray, Robert E.
2011-01-01
Fault tolerant systems require the ability to detect and recover from physical damage caused by the hardware s environment, faulty connectors, and system degradation over time. This ability applies to military, space, and industrial computing applications. The integrity of Point-to-Point (P2P) communication, between two microcontrollers for example, is an essential part of fault tolerant computing systems. In this paper, different methods of fault detection and recovery are presented and analyzed.
Advanced cloud fault tolerance system
NASA Astrophysics Data System (ADS)
Sumangali, K.; Benny, Niketa
2017-11-01
Cloud computing has become a prevalent on-demand service on the internet to store, manage and process data. A pitfall that accompanies cloud computing is the failures that can be encountered in the cloud. To overcome these failures, we require a fault tolerance mechanism to abstract faults from users. We have proposed a fault tolerant architecture, which is a combination of proactive and reactive fault tolerance. This architecture essentially increases the reliability and the availability of the cloud. In the future, we would like to compare evaluations of our proposed architecture with existing architectures and further improve it.
Information Weighted Consensus for Distributed Estimation in Vision Networks
ERIC Educational Resources Information Center
Kamal, Ahmed Tashrif
2013-01-01
Due to their high fault-tolerance, ease of installation and scalability to large networks, distributed algorithms have recently gained immense popularity in the sensor networks community, especially in computer vision. Multi-target tracking in a camera network is one of the fundamental problems in this domain. Distributed estimation algorithms…
Fault tolerance in computational grids: perspectives, challenges, and issues.
Haider, Sajjad; Nazir, Babar
2016-01-01
Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.
Fault tolerant architectures for integrated aircraft electronics systems, task 2
NASA Technical Reports Server (NTRS)
Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.
1984-01-01
The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported.
Fault-tolerant measurement-based quantum computing with continuous-variable cluster states.
Menicucci, Nicolas C
2014-03-28
A long-standing open question about Gaussian continuous-variable cluster states is whether they enable fault-tolerant measurement-based quantum computation. The answer is yes. Initial squeezing in the cluster above a threshold value of 20.5 dB ensures that errors from finite squeezing acting on encoded qubits are below the fault-tolerance threshold of known qubit-based error-correcting codes. By concatenating with one of these codes and using ancilla-based error correction, fault-tolerant measurement-based quantum computation of theoretically indefinite length is possible with finitely squeezed cluster states.
Method and system for environmentally adaptive fault tolerant computing
NASA Technical Reports Server (NTRS)
Copenhaver, Jason L. (Inventor); Jeremy, Ramos (Inventor); Wolfe, Jeffrey M. (Inventor); Brenner, Dean (Inventor)
2010-01-01
A method and system for adapting fault tolerant computing. The method includes the steps of measuring an environmental condition representative of an environment. An on-board processing system's sensitivity to the measured environmental condition is measured. It is determined whether to reconfigure a fault tolerance of the on-board processing system based in part on the measured environmental condition. The fault tolerance of the on-board processing system may be reconfigured based in part on the measured environmental condition.
Step-by-step magic state encoding for efficient fault-tolerant quantum computation
Goto, Hayato
2014-01-01
Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation. PMID:25511387
Step-by-step magic state encoding for efficient fault-tolerant quantum computation.
Goto, Hayato
2014-12-16
Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation.
FTAPE: A fault injection tool to measure fault tolerance
NASA Technical Reports Server (NTRS)
Tsai, Timothy K.; Iyer, Ravishankar K.
1995-01-01
The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.
A Parameter Communication Optimization Strategy for Distributed Machine Learning in Sensors
Zhang, Jilin; Tu, Hangdi; Ren, Yongjian; Wan, Jian; Zhou, Li; Li, Mingwei; Wang, Jue; Yu, Lifeng; Zhao, Chang; Zhang, Lei
2017-01-01
In order to utilize the distributed characteristic of sensors, distributed machine learning has become the mainstream approach, but the different computing capability of sensors and network delays greatly influence the accuracy and the convergence rate of the machine learning model. Our paper describes a reasonable parameter communication optimization strategy to balance the training overhead and the communication overhead. We extend the fault tolerance of iterative-convergent machine learning algorithms and propose the Dynamic Finite Fault Tolerance (DFFT). Based on the DFFT, we implement a parameter communication optimization strategy for distributed machine learning, named Dynamic Synchronous Parallel Strategy (DSP), which uses the performance monitoring model to dynamically adjust the parameter synchronization strategy between worker nodes and the Parameter Server (PS). This strategy makes full use of the computing power of each sensor, ensures the accuracy of the machine learning model, and avoids the situation that the model training is disturbed by any tasks unrelated to the sensors. PMID:28934163
Algorithm-Based Fault Tolerance Integrated with Replication
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2008-01-01
In a proposed approach to programming and utilization of commercial off-the-shelf computing equipment, a combination of algorithm-based fault tolerance (ABFT) and replication would be utilized to obtain high degrees of fault tolerance without incurring excessive costs. The basic idea of the proposed approach is to integrate ABFT with replication such that the algorithmic portions of computations would be protected by ABFT, and the logical portions by replication. ABFT is an extremely efficient, inexpensive, high-coverage technique for detecting and mitigating faults in computer systems used for algorithmic computations, but does not protect against errors in logical operations surrounding algorithms.
Agent Based Fault Tolerance for the Mobile Environment
NASA Astrophysics Data System (ADS)
Park, Taesoon
This paper presents a fault-tolerance scheme based on mobile agents for the reliable mobile computing systems. Mobility of the agent is suitable to trace the mobile hosts and the intelligence of the agent makes it efficient to support the fault tolerance services. This paper presents two approaches to implement the mobile agent based fault tolerant service and their performances are evaluated and compared with other fault-tolerant schemes.
Fault Injection and Monitoring Capability for a Fault-Tolerant Distributed Computation System
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Yates, Amy M.; Malekpour, Mahyar R.
2010-01-01
The Configurable Fault-Injection and Monitoring System (CFIMS) is intended for the experimental characterization of effects caused by a variety of adverse conditions on a distributed computation system running flight control applications. A product of research collaboration between NASA Langley Research Center and Old Dominion University, the CFIMS is the main research tool for generating actual fault response data with which to develop and validate analytical performance models and design methodologies for the mitigation of fault effects in distributed flight control systems. Rather than a fixed design solution, the CFIMS is a flexible system that enables the systematic exploration of the problem space and can be adapted to meet the evolving needs of the research. The CFIMS has the capabilities of system-under-test (SUT) functional stimulus generation, fault injection and state monitoring, all of which are supported by a configuration capability for setting up the system as desired for a particular experiment. This report summarizes the work accomplished so far in the development of the CFIMS concept and documents the first design realization.
Transparent Ada rendezvous in a fault tolerant distributed system
NASA Technical Reports Server (NTRS)
Racine, Roger
1986-01-01
There are many problems associated with distributing an Ada program over a loosely coupled communication network. Some of these problems involve the various aspects of the distributed rendezvous. The problems addressed involve supporting the delay statement in a selective call and supporting the else clause in a selective call. Most of these difficulties are compounded by the need for an efficient communication system. The difficulties are compounded even more by considering the possibility of hardware faults occurring while the program is running. With a hardware fault tolerant computer system, it is possible to design a distribution scheme and communication software which is efficient and allows Ada semantics to be preserved. An Ada design for the communications software of one such system will be presented, including a description of the services provided in the seven layers of an International Standards Organization (ISO) Open System Interconnect (OSI) model communications system. The system capabilities (hardware and software) that allow this communication system will also be described.
Provable Transient Recovery for Frame-Based, Fault-Tolerant Computing Systems
NASA Technical Reports Server (NTRS)
DiVito, Ben L.; Butler, Ricky W.
1992-01-01
We present a formal verification of the transient fault recovery aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system architecture for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization accommodates a wide variety of voting schemes for purging the effects of transients.
Fault-tolerant cooperative output regulation for multi-vehicle systems with sensor faults
NASA Astrophysics Data System (ADS)
Qin, Liguo; He, Xiao; Zhou, D. H.
2017-10-01
This paper presents a unified framework of fault diagnosis and fault-tolerant cooperative output regulation (FTCOR) for a linear discrete-time multi-vehicle system with sensor faults. The FTCOR control law is designed through three steps. A cooperative output regulation (COR) controller is designed based on the internal mode principle when there are no sensor faults. A sufficient condition on the existence of the COR controller is given based on the discrete-time algebraic Riccati equation (DARE). Then, a decentralised fault diagnosis scheme is designed to cope with sensor faults occurring in followers. A residual generator is developed to detect sensor faults of each follower, and a bank of fault-matching estimators are proposed to isolate and estimate sensor faults of each follower. Unlike the current distributed fault diagnosis for multi-vehicle systems, the presented decentralised fault diagnosis scheme in each vehicle reduces the communication and computation load by only using the information of the vehicle. By combing the sensor fault estimation and the COR control law, an FTCOR controller is proposed. Finally, the simulation results demonstrate the effectiveness of the FTCOR controller.
The embedded operating system project
NASA Technical Reports Server (NTRS)
Campbell, R. H.
1985-01-01
The design and construction of embedded operating systems for real-time advanced aerospace applications was investigated. The applications require reliable operating system support that must accommodate computer networks. Problems that arise in the construction of such operating systems, reconfiguration, consistency and recovery in a distributed system, and the issues of real-time processing are reported. A thesis that provides theoretical foundations for the use of atomic actions to support fault tolerance and data consistency in real-time object-based system is included. The following items are addressed: (1) atomic actions and fault-tolerance issues; (2) operating system structure; (3) program development; (4) a reliable compiler for path Pascal; and (5) mediators, a mechanism for scheduling distributed system processes.
Airborne Advanced Reconfigurable Computer System (ARCS)
NASA Technical Reports Server (NTRS)
Bjurman, B. E.; Jenkins, G. M.; Masreliez, C. J.; Mcclellan, K. L.; Templeman, J. E.
1976-01-01
A digital computer subsystem fault-tolerant concept was defined, and the potential benefits and costs of such a subsystem were assessed when used as the central element of a new transport's flight control system. The derived advanced reconfigurable computer system (ARCS) is a triple-redundant computer subsystem that automatically reconfigures, under multiple fault conditions, from triplex to duplex to simplex operation, with redundancy recovery if the fault condition is transient. The study included criteria development covering factors at the aircraft's operation level that would influence the design of a fault-tolerant system for commercial airline use. A new reliability analysis tool was developed for evaluating redundant, fault-tolerant system availability and survivability; and a stringent digital system software design methodology was used to achieve design/implementation visibility.
Advanced information processing system: Fault injection study and results
NASA Technical Reports Server (NTRS)
Burkhardt, Laura F.; Masotto, Thomas K.; Lala, Jaynarayan H.
1992-01-01
The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer
NASA Technical Reports Server (NTRS)
Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.
1984-01-01
SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
Fault tolerant software modules for SIFT
NASA Technical Reports Server (NTRS)
Hecht, M.; Hecht, H.
1982-01-01
The implementation of software fault tolerance is investigated for critical modules of the Software Implemented Fault Tolerance (SIFT) operating system to support the computational and reliability requirements of advanced fly by wire transport aircraft. Fault tolerant designs generated for the error reported and global executive are examined. A description of the alternate routines, implementation requirements, and software validation are included.
Universal fault-tolerant quantum computation with only transversal gates and error correction.
Paetznick, Adam; Reichardt, Ben W
2013-08-30
Transversal implementations of encoded unitary gates are highly desirable for fault-tolerant quantum computation. Though transversal gates alone cannot be computationally universal, they can be combined with specially distilled resource states in order to achieve universality. We show that "triorthogonal" stabilizer codes, introduced for state distillation by Bravyi and Haah [Phys. Rev. A 86, 052329 (2012)], admit transversal implementation of the controlled-controlled-Z gate. We then construct a universal set of fault-tolerant gates without state distillation by using only transversal controlled-controlled-Z, transversal Hadamard, and fault-tolerant error correction. We also adapt the distillation procedure of Bravyi and Haah to Toffoli gates, improving on existing Toffoli distillation schemes.
NASA Astrophysics Data System (ADS)
Wang, Rui
It is known that high intensity radiated fields (HIRF) can produce upsets in digital electronics, and thereby degrade the performance of digital flight control systems. Such upsets, either from natural or man-made sources, can change data values on digital buses and memory and affect CPU instruction execution. HIRF environments are also known to trigger common-mode faults, affecting nearly-simultaneously multiple fault containment regions, and hence reducing the benefits of n-modular redundancy and other fault-tolerant computing techniques. Thus, it is important to develop models which describe the integration of the embedded digital system, where the control law is implemented, as well as the dynamics of the closed-loop system. In this dissertation, theoretical tools are presented to analyze the relationship between the design choices for a class of distributed recoverable computing platforms and the tracking performance degradation of a digital flight control system implemented on such a platform while operating in a HIRF environment. Specifically, a tractable hybrid performance model is developed for a digital flight control system implemented on a computing platform inspired largely by the NASA family of fault-tolerant, reconfigurable computer architectures known as SPIDER (scalable processor-independent design for enhanced reliability). The focus will be on the SPIDER implementation, which uses the computer communication system known as ROBUS-2 (reliable optical bus). A physical HIRF experiment was conducted at the NASA Langley Research Center in order to validate the theoretical tracking performance degradation predictions for a distributed Boeing 747 flight control system subject to a HIRF environment. An extrapolation of these results for scenarios that could not be physically tested is also presented.
Reliability model derivation of a fault-tolerant, dual, spare-switching, digital computer system
NASA Technical Reports Server (NTRS)
1974-01-01
A computer based reliability projection aid, tailored specifically for application in the design of fault-tolerant computer systems, is described. Its more pronounced characteristics include the facility for modeling systems with two distinct operational modes, measuring the effect of both permanent and transient faults, and calculating conditional system coverage factors. The underlying conceptual principles, mathematical models, and computer program implementation are presented.
Abstractions for Fault-Tolerant Distributed System Verification
NASA Technical Reports Server (NTRS)
Pike, Lee S.; Maddalon, Jeffrey M.; Miner, Paul S.; Geser, Alfons
2004-01-01
Four kinds of abstraction for the design and analysis of fault tolerant distributed systems are discussed. These abstractions concern system messages, faults, fault masking voting, and communication. The abstractions are formalized in higher order logic, and are intended to facilitate specifying and verifying such systems in higher order theorem provers.
1988-08-20
34 William A. Link, Patuxent Wildlife Research Center "Increasing reliability of multiversion fault-tolerant software design by modulation," Junryo 3... Multiversion lault-Tolerant Software Design by Modularization Junryo Miyashita Department of Computer Science California state University at san Bernardino Fault...They shall beE refered to as " multiversion fault-tolerant software design". Onel problem of developing multi-versions of a program is the high cost
Coordinated Fault Tolerance for High-Performance Computing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dongarra, Jack; Bosilca, George; et al.
2013-04-08
Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.
Tutorial: Advanced fault tree applications using HARP
NASA Technical Reports Server (NTRS)
Dugan, Joanne Bechta; Bavuso, Salvatore J.; Boyd, Mark A.
1993-01-01
Reliability analysis of fault tolerant computer systems for critical applications is complicated by several factors. These modeling difficulties are discussed and dynamic fault tree modeling techniques for handling them are described and demonstrated. Several advanced fault tolerant computer systems are described, and fault tree models for their analysis are presented. HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that is capable of solving the fault tree models presented.
High Speed Computing, LANs, and WAMs
NASA Technical Reports Server (NTRS)
Bergman, Larry A.; Monacos, Steve
1994-01-01
Optical fiber networks may one day offer potential capacities exceeding 10 terabits/sec. This paper describes present gigabit network techniques for distributed computing as illustrated by the CASA gigabit testbed, and then explores future all-optic network architectures that offer increased capacity, more optimized level of service for a given application, high fault tolerance, and dynamic reconfigurability.
Fault-tolerant building-block computer study
NASA Technical Reports Server (NTRS)
Rennels, D. A.
1978-01-01
Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.
Plan for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.
2008-01-01
This report presents the plan for the characterization of the effects of high intensity radiated fields on a prototype implementation of a fault-tolerant data communication system. Various configurations of the communication system will be tested. The prototype system is implemented using off-the-shelf devices. The system will be tested in a closed-loop configuration with extensive real-time monitoring. This test is intended to generate data suitable for the design of avionics health management systems, as well as redundancy management mechanisms and policies for robust distributed processing architectures.
Napolitano, Jr., Leonard M.
1995-01-01
The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance.
Fault-tolerant linear optical quantum computing with small-amplitude coherent States.
Lund, A P; Ralph, T C; Haselgrove, H L
2008-01-25
Quantum computing using two coherent states as a qubit basis is a proposed alternative architecture with lower overheads but has been questioned as a practical way of performing quantum computing due to the fragility of diagonal states with large coherent amplitudes. We show that using error correction only small amplitudes (alpha>1.2) are required for fault-tolerant quantum computing. We study fault tolerance under the effects of small amplitudes and loss using a Monte Carlo simulation. The first encoding level resources are orders of magnitude lower than the best single photon scheme.
Enhanced fault-tolerant quantum computing in d-level systems.
Campbell, Earl T
2014-12-05
Error-correcting codes protect quantum information and form the basis of fault-tolerant quantum computing. Leading proposals for fault-tolerant quantum computation require codes with an exceedingly rare property, a transversal non-Clifford gate. Codes with the desired property are presented for d-level qudit systems with prime d. The codes use n=d-1 qudits and can detect up to ∼d/3 errors. We quantify the performance of these codes for one approach to quantum computation known as magic-state distillation. Unlike prior work, we find performance is always enhanced by increasing d.
Message Efficient Checkpointing and Rollback Recovery in Heterogeneous Mobile Networks
NASA Astrophysics Data System (ADS)
Jaggi, Parmeet Kaur; Singh, Awadhesh Kumar
2016-06-01
Heterogeneous networks provide an appealing way of expanding the computing capability of mobile networks by combining infrastructure-less mobile ad-hoc networks with the infrastructure-based cellular mobile networks. The nodes in such a network range from low-power nodes to macro base stations and thus, vary greatly in their capabilities such as computation power and battery power. The nodes are susceptible to different types of transient and permanent failures and therefore, the algorithms designed for such networks need to be fault-tolerant. The article presents a checkpointing algorithm for the rollback recovery of mobile hosts in a heterogeneous mobile network. Checkpointing is a well established approach to provide fault tolerance in static and cellular mobile distributed systems. However, the use of checkpointing for fault tolerance in a heterogeneous environment remains to be explored. The proposed protocol is based on the results of zigzag paths and zigzag cycles by Netzer-Xu. Considering the heterogeneity prevalent in the network, an uncoordinated checkpointing technique is employed. Yet, useless checkpoints are avoided without causing a high message overhead.
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Divito, Ben L.; Holloway, C. Michael
1994-01-01
In this paper the design and formal verification of the lower levels of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications, are presented. The RCP uses NMR-style redundancy to mask faults and internal majority voting to flush the effects of transient faults. Two new layers of the RCP hierarchy are introduced: the Minimal Voting refinement (DA_minv) of the Distributed Asynchronous (DA) model and the Local Executive (LE) Model. Both the DA_minv model and the LE model are specified formally and have been verified using the Ehdm verification system. All specifications and proofs are available electronically via the Internet using anonymous FTP or World Wide Web (WWW) access.
Analysis of typical fault-tolerant architectures using HARP
NASA Technical Reports Server (NTRS)
Bavuso, Salvatore J.; Bechta Dugan, Joanne; Trivedi, Kishor S.; Rothmann, Elizabeth M.; Smith, W. Earl
1987-01-01
Difficulties encountered in the modeling of fault-tolerant systems are discussed. The Hybrid Automated Reliability Predictor (HARP) approach to modeling fault-tolerant systems is described. The HARP is written in FORTRAN, consists of nearly 30,000 lines of codes and comments, and is based on behavioral decomposition. Using the behavioral decomposition, the dependability model is divided into fault-occurrence/repair and fault/error-handling models; the characteristics and combining of these two models are examined. Examples in which the HARP is applied to the modeling of some typical fault-tolerant systems, including a local-area network, two fault-tolerant computer systems, and a flight control system, are presented.
Optimal design and use of retry in fault tolerant real-time computer systems
NASA Technical Reports Server (NTRS)
Lee, Y. H.; Shin, K. G.
1983-01-01
A new method to determin an optimal retry policy and for use in retry of fault characterization is presented. An optimal retry policy for a given fault characteristic, which determines the maximum allowable retry durations to minimize the total task completion time was derived. The combined fault characterization and retry decision, in which the characteristics of fault are estimated simultaneously with the determination of the optimal retry policy were carried out. Two solution approaches were developed, one based on the point estimation and the other on the Bayes sequential decision. The maximum likelihood estimators are used for the first approach, and the backward induction for testing hypotheses in the second approach. Numerical examples in which all the durations associated with faults have monotone hazard functions, e.g., exponential, Weibull and gamma distributions are presented. These are standard distributions commonly used for modeling analysis and faults.
NASA Astrophysics Data System (ADS)
Lidar, Daniel A.; Brun, Todd A.
2013-09-01
Prologue; Preface; Part I. Background: 1. Introduction to decoherence and noise in open quantum systems Daniel Lidar and Todd Brun; 2. Introduction to quantum error correction Dave Bacon; 3. Introduction to decoherence-free subspaces and noiseless subsystems Daniel Lidar; 4. Introduction to quantum dynamical decoupling Lorenza Viola; 5. Introduction to quantum fault tolerance Panos Aliferis; Part II. Generalized Approaches to Quantum Error Correction: 6. Operator quantum error correction David Kribs and David Poulin; 7. Entanglement-assisted quantum error-correcting codes Todd Brun and Min-Hsiu Hsieh; 8. Continuous-time quantum error correction Ognyan Oreshkov; Part III. Advanced Quantum Codes: 9. Quantum convolutional codes Mark Wilde; 10. Non-additive quantum codes Markus Grassl and Martin Rötteler; 11. Iterative quantum coding systems David Poulin; 12. Algebraic quantum coding theory Andreas Klappenecker; 13. Optimization-based quantum error correction Andrew Fletcher; Part IV. Advanced Dynamical Decoupling: 14. High order dynamical decoupling Zhen-Yu Wang and Ren-Bao Liu; 15. Combinatorial approaches to dynamical decoupling Martin Rötteler and Pawel Wocjan; Part V. Alternative Quantum Computation Approaches: 16. Holonomic quantum computation Paolo Zanardi; 17. Fault tolerance for holonomic quantum computation Ognyan Oreshkov, Todd Brun and Daniel Lidar; 18. Fault tolerant measurement-based quantum computing Debbie Leung; Part VI. Topological Methods: 19. Topological codes Héctor Bombín; 20. Fault tolerant topological cluster state quantum computing Austin Fowler and Kovid Goyal; Part VII. Applications and Implementations: 21. Experimental quantum error correction Dave Bacon; 22. Experimental dynamical decoupling Lorenza Viola; 23. Architectures Jacob Taylor; 24. Error correction in quantum communication Mark Wilde; Part VIII. Critical Evaluation of Fault Tolerance: 25. Hamiltonian methods in QEC and fault tolerance Eduardo Novais, Eduardo Mucciolo and Harold Baranger; 26. Critique of fault-tolerant quantum information processing Robert Alicki; References; Index.
Experimental Demonstration of Fault-Tolerant State Preparation with Superconducting Qubits.
Takita, Maika; Cross, Andrew W; Córcoles, A D; Chow, Jerry M; Gambetta, Jay M
2017-11-03
Robust quantum computation requires encoding delicate quantum information into degrees of freedom that are hard for the environment to change. Quantum encodings have been demonstrated in many physical systems by observing and correcting storage errors, but applications require not just storing information; we must accurately compute even with faulty operations. The theory of fault-tolerant quantum computing illuminates a way forward by providing a foundation and collection of techniques for limiting the spread of errors. Here we implement one of the smallest quantum codes in a five-qubit superconducting transmon device and demonstrate fault-tolerant state preparation. We characterize the resulting code words through quantum process tomography and study the free evolution of the logical observables. Our results are consistent with fault-tolerant state preparation in a protected qubit subspace.
Fault Injection Campaign for a Fault Tolerant Duplex Framework
NASA Technical Reports Server (NTRS)
Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.
2007-01-01
Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.
Computer-Aided Reliability Estimation
NASA Technical Reports Server (NTRS)
Bavuso, S. J.; Stiffler, J. J.; Bryant, L. A.; Petersen, P. L.
1986-01-01
CARE III (Computer-Aided Reliability Estimation, Third Generation) helps estimate reliability of complex, redundant, fault-tolerant systems. Program specifically designed for evaluation of fault-tolerant avionics systems. However, CARE III general enough for use in evaluation of other systems as well.
Fault-tolerant processing system
NASA Technical Reports Server (NTRS)
Palumbo, Daniel L. (Inventor)
1996-01-01
A fault-tolerant, fiber optic interconnect, or backplane, which serves as a via for data transfer between modules. Fault tolerance algorithms are embedded in the backplane by dividing the backplane into a read bus and a write bus and placing a redundancy management unit (RMU) between the read bus and the write bus so that all data transmitted by the write bus is subjected to the fault tolerance algorithms before the data is passed for distribution to the read bus. The RMU provides both backplane control and fault tolerance.
NASA Technical Reports Server (NTRS)
Smith, T. B., Jr.; Lala, J. H.
1983-01-01
The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Paralex: An Environment for Parallel Programming in Distributed Systems
1991-12-07
distributed systems is coni- parable to assembly language programming for traditional sequential systems - the user must resort to low-level primitives ...to accomplish data encoding/decoding, communication, remote exe- cution, synchronization , failure detection and recovery. It is our belief that... synchronization . Finally, composing parallel programs by interconnecting se- quential computations allows automatic support for heterogeneity and fault tolerance
High-Threshold Fault-Tolerant Quantum Computation with Analog Quantum Error Correction
NASA Astrophysics Data System (ADS)
Fukui, Kosuke; Tomita, Akihisa; Okamoto, Atsushi; Fujii, Keisuke
2018-04-01
To implement fault-tolerant quantum computation with continuous variables, the Gottesman-Kitaev-Preskill (GKP) qubit has been recognized as an important technological element. However, it is still challenging to experimentally generate the GKP qubit with the required squeezing level, 14.8 dB, of the existing fault-tolerant quantum computation. To reduce this requirement, we propose a high-threshold fault-tolerant quantum computation with GKP qubits using topologically protected measurement-based quantum computation with the surface code. By harnessing analog information contained in the GKP qubits, we apply analog quantum error correction to the surface code. Furthermore, we develop a method to prevent the squeezing level from decreasing during the construction of the large-scale cluster states for the topologically protected, measurement-based, quantum computation. We numerically show that the required squeezing level can be relaxed to less than 10 dB, which is within the reach of the current experimental technology. Hence, this work can considerably alleviate this experimental requirement and take a step closer to the realization of large-scale quantum computation.
MAX - An advanced parallel computer for space applications
NASA Technical Reports Server (NTRS)
Lewis, Blair F.; Bunker, Robert L.
1991-01-01
MAX is a fault-tolerant multicomputer hardware and software architecture designed to meet the needs of NASA spacecraft systems. It consists of conventional computing modules (computers) connected via a dual network topology. One network is used to transfer data among the computers and between computers and I/O devices. This network's topology is arbitrary. The second network operates as a broadcast medium for operating system synchronization messages and supports the operating system's Byzantine resilience. A fully distributed operating system supports multitasking in an asynchronous event and data driven environment. A large grain dataflow paradigm is used to coordinate the multitasking and provide easy control of concurrency. It is the basis of the system's fault tolerance and allows both static and dynamical location of tasks. Redundant execution of tasks with software voting of results may be specified for critical tasks. The dataflow paradigm also supports simplified software design, test and maintenance. A unique feature is a method for reliably patching code in an executing dataflow application.
Fault tolerant architectures for integrated aircraft electronics systems
NASA Technical Reports Server (NTRS)
Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.
1983-01-01
Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered.
A formally verified algorithm for interactive consistency under a hybrid fault model
NASA Technical Reports Server (NTRS)
Lincoln, Patrick; Rushby, John
1993-01-01
Consistent distribution of single-source data to replicated computing channels is a fundamental problem in fault-tolerant system design. The 'Oral Messages' (OM) algorithm solves this problem of Interactive Consistency (Byzantine Agreement) assuming that all faults are worst-cass. Thambidurai and Park introduced a 'hybrid' fault model that distinguished three fault modes: asymmetric (Byzantine), symmetric, and benign; they also exhibited, along with an informal 'proof of correctness', a modified version of OM. Unfortunately, their algorithm is flawed. The discipline of mechanically checked formal verification eventually enabled us to develop a correct algorithm for Interactive Consistency under the hybrid fault model. This algorithm withstands $a$ asymmetric, $s$ symmetric, and $b$ benign faults simultaneously, using $m+1$ rounds, provided $n is greater than 2a + 2s + b + m$, and $m\\geg a$. We present this algorithm, discuss its subtle points, and describe its formal specification and verification in PVS. We argue that formal verification systems such as PVS are now sufficiently effective that their application to fault-tolerance algorithms should be considered routine.
Formal Techniques for Synchronized Fault-Tolerant Systems
NASA Technical Reports Server (NTRS)
DiVito, Ben L.; Butler, Ricky W.
1992-01-01
We present the formal verification of synchronizing aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization is based on an extended state machine model incorporating snapshots of local processors clocks.
Buffered coscheduling for parallel programming and enhanced fault tolerance
Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM
2006-01-31
A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Measurement and analysis of workload effects on fault latency in real-time systems
NASA Technical Reports Server (NTRS)
Woodbury, Michael H.; Shin, Kang G.
1990-01-01
The authors demonstrate the need to address fault latency in highly reliable real-time control computer systems. It is noted that the effectiveness of all known recovery mechanisms is greatly reduced in the presence of multiple latent faults. The presence of multiple latent faults increases the possibility of multiple errors, which could result in coverage failure. The authors present experimental evidence indicating that the duration of fault latency is dependent on workload. A synthetic workload generator is used to vary the workload, and a hardware fault injector is applied to inject transient faults of varying durations. This method makes it possible to derive the distribution of fault latency duration. Experimental results obtained from the fault-tolerant multiprocessor at the NASA Airlab are presented and discussed.
Testing For EM Upsets In Aircraft Control Computers
NASA Technical Reports Server (NTRS)
Belcastro, Celeste M.
1994-01-01
Effects of transient electrical signals evaluated in laboratory tests. Method of evaluating nominally fault-tolerant, aircraft-type digital-computer-based control system devised. Provides for evaluation of susceptibility of system to upset and evaluation of integrity of control when system subjected to transient electrical signals like those induced by electromagnetic (EM) source, in this case lightning. Beyond aerospace applications, fault-tolerant control systems becoming more wide-spread in industry; such as in automobiles. Method supports practical, systematic tests for evaluation of designs of fault-tolerant control systems.
Napolitano, L.M. Jr.
1995-11-28
The Lambda network is a single stage, packet-switched interprocessor communication network for a distributed memory, parallel processor computer. Its design arises from the desired network characteristics of minimizing mean and maximum packet transfer time, local routing, expandability, deadlock avoidance, and fault tolerance. The network is based on fixed degree nodes and has mean and maximum packet transfer distances where n is the number of processors. The routing method is detailed, as are methods for expandability, deadlock avoidance, and fault tolerance. 14 figs.
VLSI Implementation of Fault Tolerance Multiplier based on Reversible Logic Gate
NASA Astrophysics Data System (ADS)
Ahmad, Nabihah; Hakimi Mokhtar, Ahmad; Othman, Nurmiza binti; Fhong Soon, Chin; Rahman, Ab Al Hadi Ab
2017-08-01
Multiplier is one of the essential component in the digital world such as in digital signal processing, microprocessor, quantum computing and widely used in arithmetic unit. Due to the complexity of the multiplier, tendency of errors are very high. This paper aimed to design a 2×2 bit Fault Tolerance Multiplier based on Reversible logic gate with low power consumption and high performance. This design have been implemented using 90nm Complemetary Metal Oxide Semiconductor (CMOS) technology in Synopsys Electronic Design Automation (EDA) Tools. Implementation of the multiplier architecture is by using the reversible logic gates. The fault tolerance multiplier used the combination of three reversible logic gate which are Double Feynman gate (F2G), New Fault Tolerance (NFT) gate and Islam Gate (IG) with the area of 160μm x 420.3μm (67.25 mm2). This design achieved a low power consumption of 122.85μW and propagation delay of 16.99ns. The fault tolerance multiplier proposed achieved a low power consumption and high performance which suitable for application of modern computing as it has a fault tolerance capabilities.
Analysis of a hardware and software fault tolerant processor for critical applications
NASA Technical Reports Server (NTRS)
Dugan, Joanne B.
1993-01-01
Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input-dependent. Software faults are triggered by a particular set of inputs. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. A methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor. The models consider both transient and permanent faults, hardware and software faults, independent and related software faults, automatic recovery, and reconfiguration.
A distributed fault-tolerant signal processor /FTSP/
NASA Astrophysics Data System (ADS)
Bonneau, R. J.; Evett, R. C.; Young, M. J.
1980-01-01
A digital fault-tolerant signal processor (FTSP), an example of a self-repairing programmable system is analyzed. The design configuration is discussed in terms of fault tolerance, system-level fault detection, isolation and common memory. Special attention is given to the FDIR (fault detection isolation and reconfiguration) logic, noting that the reconfiguration decisions are based on configuration, summary status, end-around tests, and north marker/synchro data. Several mechanisms of fault detection are described which initiate reconfiguration at different levels. It is concluded that the reliability of a signal processor can be significantly enhanced by the use of fault-tolerant techniques.
Distributed asynchronous microprocessor architectures in fault tolerant integrated flight systems
NASA Technical Reports Server (NTRS)
Dunn, W. R.
1983-01-01
The paper discusses the implementation of fault tolerant digital flight control and navigation systems for rotorcraft application. It is shown that in implementing fault tolerance at the systems level using advanced LSI/VLSI technology, aircraft physical layout and flight systems requirements tend to define a system architecture of distributed, asynchronous microprocessors in which fault tolerance can be achieved locally through hardware redundancy and/or globally through application of analytical redundancy. The effects of asynchronism on the execution of dynamic flight software is discussed. It is shown that if the asynchronous microprocessors have knowledge of time, these errors can be significantly reduced through appropiate modifications of the flight software. Finally, the papear extends previous work to show that through the combined use of time referencing and stable flight algorithms, individual microprocessors can be configured to autonomously tolerate intermittent faults.
Nonuniform code concatenation for universal fault-tolerant quantum computing
NASA Astrophysics Data System (ADS)
Nikahd, Eesa; Sedighi, Mehdi; Saheb Zamani, Morteza
2017-09-01
Using transversal gates is a straightforward and efficient technique for fault-tolerant quantum computing. Since transversal gates alone cannot be computationally universal, they must be combined with other approaches such as magic state distillation, code switching, or code concatenation to achieve universality. In this paper we propose an alternative approach for universal fault-tolerant quantum computing, mainly based on the code concatenation approach proposed in [T. Jochym-O'Connor and R. Laflamme, Phys. Rev. Lett. 112, 010505 (2014), 10.1103/PhysRevLett.112.010505], but in a nonuniform fashion. The proposed approach is described based on nonuniform concatenation of the 7-qubit Steane code with the 15-qubit Reed-Muller code, as well as the 5-qubit code with the 15-qubit Reed-Muller code, which lead to two 49-qubit and 47-qubit codes, respectively. These codes can correct any arbitrary single physical error with the ability to perform a universal set of fault-tolerant gates, without using magic state distillation.
Multiple Embedded Processors for Fault-Tolerant Computing
NASA Technical Reports Server (NTRS)
Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy
2005-01-01
A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Fault-tolerant arithmetic via time-shared TMR
NASA Astrophysics Data System (ADS)
Swartzlander, Earl E.
1999-11-01
Fault tolerance is increasingly important as society has come to depend on computers for more and more aspects of daily life. The current concern about the Y2K problems indicates just how much we depend on accurate computers. This paper describes work on time- shared TMR, a technique which is used to provide arithmetic operations that produce correct results in spite of circuit faults.
Analysis of fault-tolerant neurocontrol architectures
NASA Technical Reports Server (NTRS)
Troudet, T.; Merrill, W.
1992-01-01
The fault-tolerance of analog parallel distributed implementations of a multivariable aircraft neurocontroller is analyzed by simulating weight and neuron failures in a simplified scheme of analog processing based on the functional architecture of the ETANN chip (Electrically Trainable Artificial Neural Network). The neural information processing is found to be only partially distributed throughout the set of weights of the neurocontroller synthesized with the backpropagation algorithm. Although the degree of distribution of the neural processing, and consequently the fault-tolerance of the neurocontroller, could be enhanced using Locally Distributed Weight and Neuron Approaches, a satisfactory level of fault-tolerance could only be obtained by retraining the degrated VLSI neurocontroller. The possibility of maintaining neurocontrol performance and stability in the presence of single weight of neuron failures was demonstrated through an automated retraining procedure of the neurocontroller based on a pre-programmed choice and sequence of the training parameters.
Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST)
Dowd, Scot E; Zaragoza, Joaquin; Rodriguez, Javier R; Oliver, Melvin J; Payton, Paxton R
2005-01-01
Background BLAST is one of the most common and useful tools for Genetic Research. This paper describes a software application we have termed Windows .NET Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), which enhances the BLAST utility by improving usability, fault recovery, and scalability in a Windows desktop environment. Our goal was to develop an easy to use, fault tolerant, high-throughput BLAST solution that incorporates a comprehensive BLAST result viewer with curation and annotation functionality. Results W.ND-BLAST is a comprehensive Windows-based software toolkit that targets researchers, including those with minimal computer skills, and provides the ability increase the performance of BLAST by distributing BLAST queries to any number of Windows based machines across local area networks (LAN). W.ND-BLAST provides intuitive Graphic User Interfaces (GUI) for BLAST database creation, BLAST execution, BLAST output evaluation and BLAST result exportation. This software also provides several layers of fault tolerance and fault recovery to prevent loss of data if nodes or master machines fail. This paper lays out the functionality of W.ND-BLAST. W.ND-BLAST displays close to 100% performance efficiency when distributing tasks to 12 remote computers of the same performance class. A high throughput BLAST job which took 662.68 minutes (11 hours) on one average machine was completed in 44.97 minutes when distributed to 17 nodes, which included lower performance class machines. Finally, there is a comprehensive high-throughput BLAST Output Viewer (BOV) and Annotation Engine components, which provides comprehensive exportation of BLAST hits to text files, annotated fasta files, tables, or association files. Conclusion W.ND-BLAST provides an interactive tool that allows scientists to easily utilizing their available computing resources for high throughput and comprehensive sequence analyses. The install package for W.ND-BLAST is freely downloadable from . With registration the software is free, installation, networking, and usage instructions are provided as well as a support forum. PMID:15819992
The process group approach to reliable distributed computing
NASA Technical Reports Server (NTRS)
Birman, Kenneth P.
1991-01-01
The difficulty of developing reliable distributed software is an impediment to applying distributed computing technology in many settings. Experience with the ISIS system suggests that a structured approach based on virtually synchronous process groups yields systems which are substantially easier to develop, fault-tolerance, and self-managing. Six years of research on ISIS are reviewed, describing the model, the types of applications to which ISIS was applied, and some of the reasoning that underlies a recent effort to redesign and reimplement ISIS as a much smaller, lightweight system.
An approximation formula for a class of fault-tolerant computers
NASA Technical Reports Server (NTRS)
White, A. L.
1986-01-01
An approximation formula is derived for the probability of failure for fault-tolerant process-control computers. These computers use redundancy and reconfiguration to achieve high reliability. Finite-state Markov models capture the dynamic behavior of component failure and system recovery, and the approximation formula permits an estimation of system reliability by an easy examination of the model.
2000-06-01
real - time operating system and design of a human-computer interface (HCI) for a triple modular redundant (TMR) fault-tolerant microprocessor for use in space-based applications. Once disadvantage of using COTS hardware components is their susceptibility to the radiation effects present in the space environment. and specifically, radiation-induced single-event upsets (SEUs). In the event of an SEU, a fault-tolerant system can mitigate the effects of the upset and continue to process from the last known correct system state. The TMR basic hardware
NASA Technical Reports Server (NTRS)
Brenner, Richard; Lala, Jaynarayan H.; Nagle, Gail A.; Schor, Andrei; Turkovich, John
1994-01-01
This program demonstrated the integration of a number of technologies that can increase the availability and reliability of launch vehicles while lowering costs. Availability is increased with an advanced guidance algorithm that adapts trajectories in real-time. Reliability is increased with fault-tolerant computers and communication protocols. Costs are reduced by automatically generating code and documentation. This program was realized through the cooperative efforts of academia, industry, and government. The NASA-LaRC coordinated the effort, while Draper performed the integration. Georgia Institute of Technology supplied a weak Hamiltonian finite element method for optimal control problems. Martin Marietta used MATLAB to apply this method to a launch vehicle (FENOC). Draper supplied the fault-tolerant computing and software automation technology. The fault-tolerant technology includes sequential and parallel fault-tolerant processors (FTP & FTPP) and authentication protocols (AP) for communication. Fault-tolerant technology was incrementally incorporated. Development culminated with a heterogeneous network of workstations and fault-tolerant computers using AP. Draper's software automation system, ASTER, was used to specify a static guidance system based on FENOC, navigation, flight control (GN&C), models, and the interface to a user interface for mission control. ASTER generated Ada code for GN&C and C code for models. An algebraic transform engine (ATE) was developed to automatically translate MATLAB scripts into ASTER.
RAID Unbound: Storage Fault Tolerance in a Distributed Environment
NASA Technical Reports Server (NTRS)
Ritchie, Brian
1996-01-01
Mirroring, data replication, backup, and more recently, redundant arrays of independent disks (RAID) are all technologies used to protect and ensure access to critical company data. A new set of problems has arisen as data becomes more and more geographically distributed. Each of the technologies listed above provides important benefits; but each has failed to adapt fully to the realities of distributed computing. The key to data high availability and protection is to take the technologies' strengths and 'virtualize' them across a distributed network. RAID and mirroring offer high data availability, which data replication and backup provide strong data protection. If we take these concepts at a very granular level (defining user, record, block, file, or directory types) and them liberate them from the physical subsystems with which they have traditionally been associated, we have the opportunity to create a highly scalable network wide storage fault tolerance. The network becomes the virtual storage space in which the traditional concepts of data high availability and protection are implemented without their corresponding physical constraints.
Li, Ying
2016-09-16
Fault-tolerant quantum computing in systems composed of both Majorana fermions and topologically unprotected quantum systems, e.g., superconducting circuits or quantum dots, is studied in this Letter. Errors caused by topologically unprotected quantum systems need to be corrected with error-correction schemes, for instance, the surface code. We find that the error-correction performance of such a hybrid topological quantum computer is not superior to a normal quantum computer unless the topological charge of Majorana fermions is insusceptible to noise. If errors changing the topological charge are rare, the fault-tolerance threshold is much higher than the threshold of a normal quantum computer and a surface-code logical qubit could be encoded in only tens of topological qubits instead of about 1,000 normal qubits.
Definition and trade-off study of reconfigurable airborne digital computer system organizations
NASA Technical Reports Server (NTRS)
Conn, R. B.
1974-01-01
A highly-reliable, fault-tolerant reconfigurable computer system for aircraft applications was developed. The development and application reliability and fault-tolerance assessment techniques are described. Particular emphasis is placed on the needs of an all-digital, fly-by-wire control system appropriate for a passenger-carrying airplane.
Measurement and analysis of operating system fault tolerance
NASA Technical Reports Server (NTRS)
Lee, I.; Tang, D.; Iyer, R. K.
1992-01-01
This paper demonstrates a methodology to model and evaluate the fault tolerance characteristics of operational software. The methodology is illustrated through case studies on three different operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Measurements are made on these systems for substantial periods to collect software error and recovery data. In addition to investigating basic dependability characteristics such as major software problems and error distributions, we develop two levels of models to describe error and recovery processes inside an operating system and on multiple instances of an operating system running in a distributed environment. Based on the models, reward analysis is conducted to evaluate the loss of service due to software errors and the effect of the fault-tolerance techniques implemented in the systems. Software error correlation in multicomputer systems is also investigated.
Final Project Report. Scalable fault tolerance runtime technology for petascale computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Krishnamoorthy, Sriram; Sadayappan, P
With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been amore » considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.« less
Fault-Tolerant Computing: An Overview
1991-06-01
Addison Wesley:, Reading, MA) 1984. [8] J. Wakerly , Error Detecting Codes, Self-Checking Circuits and Applications , (Elsevier North Holland, Inc.- New York... applicable to bit-sliced organi- zations of hardware. In the first time step, the normal computation is performed on the operands and the results...for error detection and fault tolerance in parallel processor systems while perform- ing specific computation-intensive applications [111. Contrary to
Reliable communication in the presence of failures
NASA Technical Reports Server (NTRS)
Birman, Kenneth P.; Joseph, Thomas A.
1987-01-01
The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistant orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols is the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
NASA Technical Reports Server (NTRS)
Lala, J. H.; Smith, T. B., III
1983-01-01
The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Study of fault-tolerant software technology
NASA Technical Reports Server (NTRS)
Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.
1984-01-01
Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.
Using Performance Tools to Support Experiments in HPC Resilience
DOE Office of Scientific and Technical Information (OSTI.GOV)
Naughton, III, Thomas J; Boehm, Swen; Engelmann, Christian
2014-01-01
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environ- ments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between performance tools and resilience tools . As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initialmore » motivation to leverage standard HPC per- formance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in provid- ing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerances.« less
ISIS: A System for Fault-Tolerant Distributed Computing
1986-04-01
SECURITY CLASSIMCMTIQN OP THIS PACIE S REPORT DOCUMENTATION PAGE /|^/f /^ ^^O mA \\ la REPORT SICUWTY CLASSIFICATION IS C...Mirrmteil hv me DHtense ^ilsanced ii.ttheiin n Pfnitcts Auftwv IüOD’ iiiuct \\RP\\ inier S .]7H. i ntin’ \\lt)A9<i;MS-( olj;. m<l bv ihe...DOWNGRADING SCHEDULE 3 DISTRIBUTION/AVAILABILITY OF REPORT Approved for Public Release Distribution Unlimited i PERFORMING ORGANUATION REPORT NUMBER( S
Secure key storage and distribution
Agrawal, Punit
2015-06-02
This disclosure describes a distributed, fault-tolerant security system that enables the secure storage and distribution of private keys. In one implementation, the security system includes a plurality of computing resources that independently store private keys provided by publishers and encrypted using a single security system public key. To protect against malicious activity, the security system private key necessary to decrypt the publication private keys is not stored at any of the computing resources. Rather portions, or shares of the security system private key are stored at each of the computing resources within the security system and multiple security systems must communicate and share partial decryptions in order to decrypt the stored private key.
Superconducting quantum circuits at the surface code threshold for fault tolerance.
Barends, R; Kelly, J; Megrant, A; Veitia, A; Sank, D; Jeffrey, E; White, T C; Mutus, J; Fowler, A G; Campbell, B; Chen, Y; Chen, Z; Chiaro, B; Dunsworth, A; Neill, C; O'Malley, P; Roushan, P; Vainsencher, A; Wenner, J; Korotkov, A N; Cleland, A N; Martinis, John M
2014-04-24
A quantum computer can solve hard problems, such as prime factoring, database searching and quantum simulation, at the cost of needing to protect fragile quantum states from error. Quantum error correction provides this protection by distributing a logical state among many physical quantum bits (qubits) by means of quantum entanglement. Superconductivity is a useful phenomenon in this regard, because it allows the construction of large quantum circuits and is compatible with microfabrication. For superconducting qubits, the surface code approach to quantum computing is a natural choice for error correction, because it uses only nearest-neighbour coupling and rapidly cycled entangling gates. The gate fidelity requirements are modest: the per-step fidelity threshold is only about 99 per cent. Here we demonstrate a universal set of logic gates in a superconducting multi-qubit processor, achieving an average single-qubit gate fidelity of 99.92 per cent and a two-qubit gate fidelity of up to 99.4 per cent. This places Josephson quantum computing at the fault-tolerance threshold for surface code error correction. Our quantum processor is a first step towards the surface code, using five qubits arranged in a linear array with nearest-neighbour coupling. As a further demonstration, we construct a five-qubit Greenberger-Horne-Zeilinger state using the complete circuit and full set of gates. The results demonstrate that Josephson quantum computing is a high-fidelity technology, with a clear path to scaling up to large-scale, fault-tolerant quantum circuits.
``DMS-R, the Brain of the ISS'', 10 Years of Continuous Successful Operation in Space
NASA Astrophysics Data System (ADS)
Wolff, Bernd; Scheffers, Peter
2012-08-01
Space industries on both sides of the Atlantic were faced with a new situation of collaboration in the beginning of the 1990s.In 1995, industrial cooperation between ASTRIUM ST, Bremen and RSC-E, Moscow started aiming the outfitting of the Russian Service Module ZVEZDA for the ISS with computers. The requested equipments had to provide not only redundancy but fault tolerance and high availability. The design and development of two fault tolerant computers, (FTCs) responsible for the telemetry (Telemetry Computer: TC) and the central control (CC), as well as the man machine interface CPC were contracted to ASTRIUM ST, Bremen. The computer system is responsible e.g. for the life support system and the ISS re-boost control.In July 2000, the integration of the Russian Service Module ZVEZDA with Russian ZARYA FGB and American Node 1 bears witness for transatlantic and European cooperation.The Russian Service module ZVEZDA provides several basic functions as Avionics Control, the Environmental Control and Life Support (ECLS) in the ISS and control of the docked Automatic Transfer Vehicle (ATV) which includes re-boost of ISS. If these elementary functions fail or do not work reliable the effects for the ISS will be catastrophic with respect to Safety (manned space) and ISS mission.For that reason the responsible computer system Data Management System - Russia (DMS-R) is also called "The brain of the ISS".The Russian Service module ZVEZDA, including DMS-R, was launched on 12th of July, 2000. DMS-R was operational also during launch and docking.The talk provide information about the definition, design and development of DMS-R, the integration of DMS-R in the Russian Service module and the maintenance of the system in space. Besides the technical aspects are also the German - Russian cooperation an important subject of this speech. An outlook finalises the talk providing further development activities and application of fault tolerant systems.The importance of the DMS-R equipment for the ISS related to availability and reliability is reported in paragraph 1.2, describing a serious incident.The DMS-R architecture, consisting of two fault tolerant computers, their interconnection via MIL 1553 STD Bus and the Control Post Computer (CPC) as man- machine interface is given in figure 1. The main data transfer within the ISS and therefore also the Russian segment is managed by the MIL1553 STD bus. The focus of this script is neither the operational concept nor the fault tolerant design according the Byzantine Theorem, but the architectural embedment. One fault tolerant computer consists out of up to four fault containment regions (FCR), comparing in- and output data and deciding by majority voting whether a faulty FCR has to be isolated. For this purpose all data have to pass the so-called fault management element and are distributed to the other participants in the computer pool (FTC). Each fault containment region is connected to the avionic busses of the vehicle avionics system. In case of a faulty FCR (wrong calculation result was detected by the other FCRs or by build-in self-detection) the dedicated FCR will reset itself or will be reset by the others. The bus controller functions of the isolated FCR will be taken over according to a specific deterministic scheme from another FCR. The FTC data throughput will be maintained, the FTC operation will continue without interruption. Each FCR consists of an application CPU board (ALB), the fault management layer (FML), the avionics bus interface board (AVI) and a power supply (PSU), sharing a VME data bus.The FML is fully transparent, in terms of I/O accessibility, to the application S/W and votes the data autonomously received from the avionics busses and transmitted from the application.
Full-Authority Fault-Tolerant Electronic Engine Control System for Variable Cycle Engines.
1982-04-01
single internally self-checked VLSI micro - processor . The selected configuration is an externally checked pair of com- mercially available...Electronic Engine Control FPMH Failures per Million Hours FTMP Fault Tolerant Multi- Processor FTSC Fault Tolerant Spaceborn Computer GRAMP Generalized...Removal * MTBR Mean Time Between Repair MTTF Mean Time to Failure xiii List of Abbreviations (continued) - NH High Pressure Rotor Speed O&S Operating
Copilot: Monitoring Embedded Systems
NASA Technical Reports Server (NTRS)
Pike, Lee; Wegmann, Nis; Niller, Sebastian; Goodloe, Alwyn
2012-01-01
Runtime verification (RV) is a natural fit for ultra-critical systems, where correctness is imperative. In ultra-critical systems, even if the software is fault-free, because of the inherent unreliability of commodity hardware and the adversity of operational environments, processing units (and their hosted software) are replicated, and fault-tolerant algorithms are used to compare the outputs. We investigate both software monitoring in distributed fault-tolerant systems, as well as implementing fault-tolerance mechanisms using RV techniques. We describe the Copilot language and compiler, specifically designed for generating monitors for distributed, hard real-time systems. We also describe two case-studies in which we generated Copilot monitors in avionics systems.
Implementing a strand of a scalable fault-tolerant quantum computing fabric.
Chow, Jerry M; Gambetta, Jay M; Magesan, Easwar; Abraham, David W; Cross, Andrew W; Johnson, B R; Masluk, Nicholas A; Ryan, Colm A; Smolin, John A; Srinivasan, Srikanth J; Steffen, M
2014-06-24
With favourable error thresholds and requiring only nearest-neighbour interactions on a lattice, the surface code is an error-correcting code that has garnered considerable attention. At the heart of this code is the ability to perform a low-weight parity measurement of local code qubits. Here we demonstrate high-fidelity parity detection of two code qubits via measurement of a third syndrome qubit. With high-fidelity gates, we generate entanglement distributed across three superconducting qubits in a lattice where each code qubit is coupled to two bus resonators. Via high-fidelity measurement of the syndrome qubit, we deterministically entangle the code qubits in either an even or odd parity Bell state, conditioned on the syndrome qubit state. Finally, to fully characterize this parity readout, we develop a measurement tomography protocol. The lattice presented naturally extends to larger networks of qubits, outlining a path towards fault-tolerant quantum computing.
Interconnection requirements in avionic systems
NASA Astrophysics Data System (ADS)
Vergnolle, Claude; Houssay, Bruno
1991-04-01
The future aircraft generation will have thousand smart electromagnetic sensors distributed allover. Each sensor is connected with fibers links to the main-frame computer in charge of the real time signal''s correlation. Such a computer must be compactly built and massively parallel: it needs the use of 3 D optical free-space interconnect between neighbouring boards and reconfigurable interconnects via holographic backplane. The optical interconnect facilities will be also used to build fault-tolerant computer through large redundancy.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sadayappan, Ponnuswamy
Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. We propose a new approach to the data and work distribution model provided by system software based on the unifying formalism of an abstract file system. The proposed hierarchical data model providesmore » simple, familiar visibility and access to data structures through the file system hierarchy, while providing fault tolerance through selective redundancy. The hierarchical task model features work queues whose form and organization are represented as file system objects. Data and work are both first class entities. By exposing the relationships between data and work to the runtime system, information is available to optimize execution time and provide fault tolerance. The data distribution scheme provides replication (where desirable and possible) for fault tolerance and efficiency, and it is hierarchical to make it possible to take advantage of locality. The user, tools, and applications, including legacy applications, can interface with the data, work queues, and one another through the abstract file model. This runtime environment will provide multiple interfaces to support traditional Message Passing Interface applications, languages developed under DARPA's High Productivity Computing Systems program, as well as other, experimental programming models. We will validate our runtime system with pilot codes on existing platforms and will use simulation to validate for exascale-class platforms. In this final report, we summarize research results from the work done at the Ohio State University towards the larger goals of the project listed above.« less
European Science Notes Information Bulletin Reports on Current European/ Middle Eastern Science
1991-04-01
Fault tolerance Technology and VLSIIWSI Implementation 10th IFAC2 Workshop on Distributed Computer Optimal designs Commercial and experimental Control...catalysts that would facilitate cooperation between applications experts and computer architects in designing and implementing a new generation of parallel...speculative. Sediments immediately north of Iceland are up to 1-km However, they demonstrate the methodology for thick but thin rapidly to less than 200-m
NASA Technical Reports Server (NTRS)
Miner, Paul S.
1993-01-01
A critical function in a fault-tolerant computer architecture is the synchronization of the redundant computing elements. The synchronization algorithm must include safeguards to ensure that failed components do not corrupt the behavior of good clocks. Reasoning about fault-tolerant clock synchronization is difficult because of the possibility of subtle interactions involving failed components. Therefore, mechanical proof systems are used to ensure that the verification of the synchronization system is correct. In 1987, Schneider presented a general proof of correctness for several fault-tolerant clock synchronization algorithms. Subsequently, Shankar verified Schneider's proof by using the mechanical proof system EHDM. This proof ensures that any system satisfying its underlying assumptions will provide Byzantine fault-tolerant clock synchronization. The utility of Shankar's mechanization of Schneider's theory for the verification of clock synchronization systems is explored. Some limitations of Shankar's mechanically verified theory were encountered. With minor modifications to the theory, a mechanically checked proof is provided that removes these limitations. The revised theory also allows for proven recovery from transient faults. Use of the revised theory is illustrated with the verification of an abstract design of a clock synchronization system.
Modeling and Simulation Reliable Spacecraft On-Board Computing
NASA Technical Reports Server (NTRS)
Park, Nohpill
1999-01-01
The proposed project will investigate modeling and simulation-driven testing and fault tolerance schemes for Spacecraft On-Board Computing, thereby achieving reliable spacecraft telecommunication. A spacecraft communication system has inherent capabilities of providing multipoint and broadcast transmission, connectivity between any two distant nodes within a wide-area coverage, quick network configuration /reconfiguration, rapid allocation of space segment capacity, and distance-insensitive cost. To realize the capabilities above mentioned, both the size and cost of the ground-station terminals have to be reduced by using reliable, high-throughput, fast and cost-effective on-board computing system which has been known to be a critical contributor to the overall performance of space mission deployment. Controlled vulnerability of mission data (measured in sensitivity), improved performance (measured in throughput and delay) and fault tolerance (measured in reliability) are some of the most important features of these systems. The system should be thoroughly tested and diagnosed before employing a fault tolerance into the system. Testing and fault tolerance strategies should be driven by accurate performance models (i.e. throughput, delay, reliability and sensitivity) to find an optimal solution in terms of reliability and cost. The modeling and simulation tools will be integrated with a system architecture module, a testing module and a module for fault tolerance all of which interacting through a centered graphical user interface.
Design of on-board Bluetooth wireless network system based on fault-tolerant technology
NASA Astrophysics Data System (ADS)
You, Zheng; Zhang, Xiangqi; Yu, Shijie; Tian, Hexiang
2007-11-01
In this paper, the Bluetooth wireless data transmission technology is applied in on-board computer system, to realize wireless data transmission between peripherals of the micro-satellite integrating electronic system, and in view of the high demand of reliability of a micro-satellite, a design of Bluetooth wireless network based on fault-tolerant technology is introduced. The reliability of two fault-tolerant systems is estimated firstly using Markov model, then the structural design of this fault-tolerant system is introduced; several protocols are established to make the system operate correctly, some related problems are listed and analyzed, with emphasis on Fault Auto-diagnosis System, Active-standby switch design and Data-Integrity process.
A Fault Oblivious Extreme-Scale Execution Environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
McKie, Jim
The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massivemore » data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.« less
The embedded operating system project
NASA Technical Reports Server (NTRS)
Campbell, R. H.
1984-01-01
This progress report describes research towards the design and construction of embedded operating systems for real-time advanced aerospace applications. The applications concerned require reliable operating system support that must accommodate networks of computers. The report addresses the problems of constructing such operating systems, the communications media, reconfiguration, consistency and recovery in a distributed system, and the issues of realtime processing. A discussion is included on suitable theoretical foundations for the use of atomic actions to support fault tolerance and data consistency in real-time object-based systems. In particular, this report addresses: atomic actions, fault tolerance, operating system structure, program development, reliability and availability, and networking issues. This document reports the status of various experiments designed and conducted to investigate embedded operating system design issues.
Partitioning in Avionics Architectures: Requirements, Mechanisms, and Assurance
NASA Technical Reports Server (NTRS)
Rushby, John
1999-01-01
Automated aircraft control has traditionally been divided into distinct "functions" that are implemented separately (e.g., autopilot, autothrottle, flight management); each function has its own fault-tolerant computer system, and dependencies among different functions are generally limited to the exchange of sensor and control data. A by-product of this "federated" architecture is that faults are strongly contained within the computer system of the function where they occur and cannot readily propagate to affect the operation of other functions. More modern avionics architectures contemplate supporting multiple functions on a single, shared, fault-tolerant computer system where natural fault containment boundaries are less sharply defined. Partitioning uses appropriate hardware and software mechanisms to restore strong fault containment to such integrated architectures. This report examines the requirements for partitioning, mechanisms for their realization, and issues in providing assurance for partitioning. Because partitioning shares some concerns with computer security, security models are reviewed and compared with the concerns of partitioning.
Combining dynamical decoupling with fault-tolerant quantum computation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ng, Hui Khoon; Preskill, John; Lidar, Daniel A.
2011-07-15
We study how dynamical decoupling (DD) pulse sequences can improve the reliability of quantum computers. We prove upper bounds on the accuracy of DD-protected quantum gates and derive sufficient conditions for DD-protected gates to outperform unprotected gates. Under suitable conditions, fault-tolerant quantum circuits constructed from DD-protected gates can tolerate stronger noise and have a lower overhead cost than fault-tolerant circuits constructed from unprotected gates. Our accuracy estimates depend on the dynamics of the bath that couples to the quantum computer and can be expressed either in terms of the operator norm of the bath's Hamiltonian or in terms of themore » power spectrum of bath correlations; we explain in particular how the performance of recursively generated concatenated pulse sequences can be analyzed from either viewpoint. Our results apply to Hamiltonian noise models with limited spatial correlations.« less
NASA Technical Reports Server (NTRS)
Ratner, R. S.; Shapiro, E. B.; Zeidler, H. M.; Wahlstrom, S. E.; Clark, C. B.; Goldberg, J.
1973-01-01
This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation.
Fault Tolerance for VLSI Multicomputers
1985-08-01
that consists of hundreds or thousands of VLSI computation nodes interconnected by dedicated links. Some important applications of high-end computers...technology, and intended applications . A proposed fault tolerance scheme combines hardware that performs error detection and system-level protocols for...order to recover from the error and resume correct operation, a valid system state must be restored. A low-overhead, application -transparent error
A Byzantine-Fault Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems
NASA Technical Reports Server (NTRS)
Malekpour, Mahyar R.
2006-01-01
Embedded distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate fault tolerant clock synchronization in order to achieve ultra-reliable assurance levels. Many efficient clock synchronization protocols do not, however, address Byzantine failures, and most protocols that do tolerate Byzantine failures do not self-stabilize. Of the Byzantine self-stabilizing clock synchronization algorithms that exist in the literature, they are based on either unjustifiably strong assumptions about initial synchrony of the nodes or on the existence of a common pulse at the nodes. The Byzantine self-stabilizing clock synchronization protocol presented here does not rely on any assumptions about the initial state of the clocks. Furthermore, there is neither a central clock nor an externally generated pulse system. The proposed protocol converges deterministically, is scalable, and self-stabilizes in a short amount of time. The convergence time is linear with respect to the self-stabilization period. Proofs of the correctness of the protocol as well as the results of formal verification efforts are reported.
Applications of an architecture design and assessment system (ADAS)
NASA Technical Reports Server (NTRS)
Gray, F. Gail; Debrunner, Linda S.; White, Tennis S.
1988-01-01
A new Architecture Design and Assessment System (ADAS) tool package is introduced, and a range of possible applications is illustrated. ADAS was used to evaluate the performance of an advanced fault-tolerant computer architecture in a modern flight control application. Bottlenecks were identified and possible solutions suggested. The tool was also used to inject faults into the architecture and evaluate the synchronization algorithm, and improvements are suggested. Finally, ADAS was used as a front end research tool to aid in the design of reconfiguration algorithms in a distributed array architecture.
NASA Technical Reports Server (NTRS)
Avizienis, A.; Gunningberg, P.; Kelly, J. P. J.; Strigini, L.; Traverse, P. J.; Tso, K. S.; Voges, U.
1986-01-01
To establish a long-term research facility for experimental investigations of design diversity as a means of achieving fault-tolerant systems, a distributed testbed for multiple-version software was designed. It is part of a local network, which utilizes the Locus distributed operating system to operate a set of 20 VAX 11/750 computers. It is used in experiments to measure the efficacy of design diversity and to investigate reliability increases under large-scale, controlled experimental conditions.
FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft
NASA Technical Reports Server (NTRS)
Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.
1978-01-01
The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.
ROBUS-2: A Fault-Tolerant Broadcast Communication System
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.
2005-01-01
The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault-tolerant integrated modular architecture currently under development at NASA Langley Research Center. The ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, clock synchronization, and distributed diagnosis (group membership). The ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 is tolerant to internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. This version of the ROBUS is intended for laboratory experimentation and demonstrations of the capability to reintegrate failed nodes, dynamically update the communication schedule, and tolerate and recover from correlated transient faults.
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing
2012-12-14
Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing Matei Zaharia Tathagata Das Haoyuan Li Timothy Hunter Scott Shenker Ion...SUBTITLE Discretized Streams: A Fault-Tolerant Model for Scalable Stream Processing 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER...time. However, current programming models for distributed stream processing are relatively low-level often leaving the user to worry about consistency of
Reliability modeling of fault-tolerant computer based systems
NASA Technical Reports Server (NTRS)
Bavuso, Salvatore J.
1987-01-01
Digital fault-tolerant computer-based systems have become commonplace in military and commercial avionics. These systems hold the promise of increased availability, reliability, and maintainability over conventional analog-based systems through the application of replicated digital computers arranged in fault-tolerant configurations. Three tightly coupled factors of paramount importance, ultimately determining the viability of these systems, are reliability, safety, and profitability. Reliability, the major driver affects virtually every aspect of design, packaging, and field operations, and eventually produces profit for commercial applications or increased national security. However, the utilization of digital computer systems makes the task of producing credible reliability assessment a formidable one for the reliability engineer. The root of the problem lies in the digital computer's unique adaptability to changing requirements, computational power, and ability to test itself efficiently. Addressed here are the nuances of modeling the reliability of systems with large state sizes, in the Markov sense, which result from systems based on replicated redundant hardware and to discuss the modeling of factors which can reduce reliability without concomitant depletion of hardware. Advanced fault-handling models are described and methods of acquiring and measuring parameters for these models are delineated.
Time Triggered Protocol (TTP) for Integrated Modular Avionics
NASA Technical Reports Server (NTRS)
Motzet, Guenter; Gwaltney, David A.; Bauer, Guenther; Jakovljevic, Mirko; Gagea, Leonard
2006-01-01
Traditional avionics computing systems are federated, with each system provided on a number of dedicated hardware units. Federated applications are physically separated from one another and analysis of the systems is undertaken individually. Integrated Modular Avionics (IMA) takes these federated functions and integrates them on a common computing platform in a tightly deterministic distributed real-time network of computing modules in which the different applications can run. IMA supports different levels of criticality in the same computing resource and provides a platform for implementation of fault tolerance through hardware and application redundancy. Modular implementation has distinct benefits in design, testing and system maintainability. This paper covers the requirements for fault tolerant bus systems used to provide reliable communication between IMA computing modules. An overview of the Time Triggered Protocol (TTP) specification and implementation as a reliable solution for IMA systems is presented. Application examples in aircraft avionics and a development system for future space application are covered. The commercially available TTP controller can be also be implemented in an FPGA and the results from implementation studies are covered. Finally future direction for the application of TTP and related development activities are presented.
Design Trade-off Between Performance and Fault-Tolerance of Space Onboard Computers
NASA Astrophysics Data System (ADS)
Gorbunov, M. S.; Antonov, A. A.
2017-01-01
It is well known that there is a trade-off between performance and power consumption in onboard computers. The fault-tolerance is another important factor affecting performance, chip area and power consumption. Involving special SRAM cells and error-correcting codes is often too expensive with relation to the performance needed. We discuss the possibility of finding the optimal solutions for modern onboard computer for scientific apparatus focusing on multi-level cache memory design.
NASA Technical Reports Server (NTRS)
Haakensen, Erik Edward
1998-01-01
The desire for low-cost reliable computing is increasing. Most current fault tolerant computing solutions are not very flexible, i.e., they cannot adapt to reliability requirements of newly emerging applications in business, commerce, and manufacturing. It is important that users have a flexible, reliable platform to support both critical and noncritical applications. Chameleon, under development at the Center for Reliable and High-Performance Computing at the University of Illinois, is a software framework. for supporting cost-effective adaptable networked fault tolerant service. This thesis details a simulation of fault injection, detection, and recovery in Chameleon. The simulation was written in C++ using the DEPEND simulation library. The results obtained from the simulation included the amount of overhead incurred by the fault detection and recovery mechanisms supported by Chameleon. In addition, information about fault scenarios from which Chameleon cannot recover was gained. The results of the simulation showed that both critical and noncritical applications can be executed in the Chameleon environment with a fairly small amount of overhead. No single point of failure from which Chameleon could not recover was found. Chameleon was also found to be capable of recovering from several multiple failure scenarios.
A forward view on reliable computers for flight control
NASA Technical Reports Server (NTRS)
Goldberg, J.; Wensley, J. H.
1976-01-01
The requirements for fault-tolerant computers for flight control of commercial aircraft are examined; it is concluded that the reliability requirements far exceed those typically quoted for space missions. Examination of circuit technology and alternative computer architectures indicates that the desired reliability can be achieved with several different computer structures, though there are obvious advantages to those that are more economic, more reliable, and, very importantly, more certifiable as to fault tolerance. Progress in this field is expected to bring about better computer systems that are more rigorously designed and analyzed even though computational requirements are expected to increase significantly.
Preliminary design of the redundant software experiment
NASA Technical Reports Server (NTRS)
Campbell, Roy; Deimel, Lionel; Eckhardt, Dave, Jr.; Kelly, John; Knight, John; Lauterbach, Linda; Lee, Larry; Mcallister, Dave; Mchugh, John
1985-01-01
The goal of the present experiment is to characterize the fault distributions of highly reliable software replicates, constructed using techniques and environments which are similar to those used in comtemporary industrial software facilities. The fault distributions and their effect on the reliability of fault tolerant configurations of the software will be determined through extensive life testing of the replicates against carefully constructed randomly generated test data. Each detected error will be carefully analyzed to provide insight in to their nature and cause. A direct objective is to develop techniques for reducing the intensity of coincident errors, thus increasing the reliability gain which can be achieved with fault tolerance. Data on the reliability gains realized, and the cost of the fault tolerant configurations can be used to design a companion experiment to determine the cost effectiveness of the fault tolerant strategy. Finally, the data and analysis produced by this experiment will be valuable to the software engineering community as a whole because it will provide a useful insight into the nature and cause of hard to find, subtle faults which escape standard software engineering validation techniques and thus persist far into the software life cycle.
Fault tolerant programmable digital attitude control electronics study
NASA Technical Reports Server (NTRS)
Sorensen, A. A.
1974-01-01
The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation.
Experiments in fault tolerant software reliability
NASA Technical Reports Server (NTRS)
Mcallister, David F.; Tai, K. C.; Vouk, Mladen A.
1987-01-01
The reliability of voting was evaluated in a fault-tolerant software system for small output spaces. The effectiveness of the back-to-back testing process was investigated. Version 3.0 of the RSDIMU-ATS, a semi-automated test bed for certification testing of RSDIMU software, was prepared and distributed. Software reliability estimation methods based on non-random sampling are being studied. The investigation of existing fault-tolerance models was continued and formulation of new models was initiated.
Scalable cloud without dedicated storage
NASA Astrophysics Data System (ADS)
Batkovich, D. V.; Kompaniets, M. V.; Zarochentsev, A. K.
2015-05-01
We present a prototype of a scalable computing cloud. It is intended to be deployed on the basis of a cluster without the separate dedicated storage. The dedicated storage is replaced by the distributed software storage. In addition, all cluster nodes are used both as computing nodes and as storage nodes. This solution increases utilization of the cluster resources as well as improves fault tolerance and performance of the distributed storage. Another advantage of this solution is high scalability with a relatively low initial and maintenance cost. The solution is built on the basis of the open source components like OpenStack, CEPH, etc.
Impact of coverage on the reliability of a fault tolerant computer
NASA Technical Reports Server (NTRS)
Bavuso, S. J.
1975-01-01
A mathematical reliability model is established for a reconfigurable fault tolerant avionic computer system utilizing state-of-the-art computers. System reliability is studied in light of the coverage probabilities associated with the first and second independent hardware failures. Coverage models are presented as a function of detection, isolation, and recovery probabilities. Upper and lower bonds are established for the coverage probabilities and the method for computing values for the coverage probabilities is investigated. Further, an architectural variation is proposed which is shown to enhance coverage.
Cost and benefits design optimization model for fault tolerant flight control systems
NASA Technical Reports Server (NTRS)
Rose, J.
1982-01-01
Requirements and specifications for a method of optimizing the design of fault-tolerant flight control systems are provided. Algorithms that could be used for developing new and modifying existing computer programs are also provided, with recommendations for follow-on work.
Software reliability models for fault-tolerant avionics computers and related topics
NASA Technical Reports Server (NTRS)
Miller, Douglas R.
1987-01-01
Software reliability research is briefly described. General research topics are reliability growth models, quality of software reliability prediction, the complete monotonicity property of reliability growth, conceptual modelling of software failure behavior, assurance of ultrahigh reliability, and analysis techniques for fault-tolerant systems.
Error rates and resource overheads of encoded three-qubit gates
NASA Astrophysics Data System (ADS)
Takagi, Ryuji; Yoder, Theodore J.; Chuang, Isaac L.
2017-10-01
A non-Clifford gate is required for universal quantum computation, and, typically, this is the most error-prone and resource-intensive logical operation on an error-correcting code. Small, single-qubit rotations are popular choices for this non-Clifford gate, but certain three-qubit gates, such as Toffoli or controlled-controlled-Z (ccz), are equivalent options that are also more suited for implementing some quantum algorithms, for instance, those with coherent classical subroutines. Here, we calculate error rates and resource overheads for implementing logical ccz with pieceable fault tolerance, a nontransversal method for implementing logical gates. We provide a comparison with a nonlocal magic-state scheme on a concatenated code and a local magic-state scheme on the surface code. We find the pieceable fault-tolerance scheme particularly advantaged over magic states on concatenated codes and in certain regimes over magic states on the surface code. Our results suggest that pieceable fault tolerance is a promising candidate for fault tolerance in a near-future quantum computer.
Hyperswitch Communication Network Computer
NASA Technical Reports Server (NTRS)
Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.
1993-01-01
Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Proactive Fault Tolerance Using Preemptive Migration
DOE Office of Scientific and Technical Information (OSTI.GOV)
Engelmann, Christian; Vallee, Geoffroy R; Naughton, III, Thomas J
2009-01-01
Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
NASA Technical Reports Server (NTRS)
Divito, Ben L.; Butler, Ricky W.; Caldwell, James L.
1990-01-01
A high-level design is presented for a reliable computing platform for real-time control applications. Design tradeoffs and analyses related to the development of the fault-tolerant computing platform are discussed. The architecture is formalized and shown to satisfy a key correctness property. The reliable computing platform uses replicated processors and majority voting to achieve fault tolerance. Under the assumption of a majority of processors working in each frame, it is shown that the replicated system computes the same results as a single processor system not subject to failures. Sufficient conditions are obtained to establish that the replicated system recovers from transient faults within a bounded amount of time. Three different voting schemes are examined and proved to satisfy the bounded recovery time conditions.
The Design of Fault Tolerant Quantum Dot Cellular Automata Based Logic
NASA Technical Reports Server (NTRS)
Armstrong, C. Duane; Humphreys, William M.; Fijany, Amir
2002-01-01
As transistor geometries are reduced, quantum effects begin to dominate device performance. At some point, transistors cease to have the properties that make them useful computational components. New computing elements must be developed in order to keep pace with Moore s Law. Quantum dot cellular automata (QCA) represent an alternative paradigm to transistor-based logic. QCA architectures that are robust to manufacturing tolerances and defects must be developed. We are developing software that allows the exploration of fault tolerant QCA gate architectures by automating the specification, simulation, analysis and documentation processes.
Hua, Yongzhao; Dong, Xiwang; Li, Qingdong; Ren, Zhang
2017-11-01
This paper investigates the fault-tolerant time-varying formation control problems for high-order linear multi-agent systems in the presence of actuator failures. Firstly, a fully distributed formation control protocol is presented to compensate for the influences of both bias fault and loss of effectiveness fault. Using the adaptive online updating strategies, no global knowledge about the communication topology is required and the bounds of actuator failures can be unknown. Then an algorithm is proposed to determine the control parameters of the fault-tolerant formation protocol, where the time-varying formation feasible conditions and an approach to expand the feasible formation set are given. Furthermore, the stability of the proposed algorithm is proven based on the Lyapunov-like theory. Finally, two simulation examples are given to demonstrate the effectiveness of the theoretical results. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
A Primer on Architectural Level Fault Tolerance
NASA Technical Reports Server (NTRS)
Butler, Ricky W.
2008-01-01
This paper introduces the fundamental concepts of fault tolerant computing. Key topics covered are voting, fault detection, clock synchronization, Byzantine Agreement, diagnosis, and reliability analysis. Low level mechanisms such as Hamming codes or low level communications protocols are not covered. The paper is tutorial in nature and does not cover any topic in detail. The focus is on rationale and approach rather than detailed exposition.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, P.A.
1988-02-01
The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Algorithm-Based Fault Tolerance for Numerical Subroutines
NASA Technical Reports Server (NTRS)
Tumon, Michael; Granat, Robert; Lou, John
2007-01-01
A software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.
Fault tolerance of artificial neural networks with applications in critical systems
NASA Technical Reports Server (NTRS)
Protzel, Peter W.; Palumbo, Daniel L.; Arras, Michael K.
1992-01-01
This paper investigates the fault tolerance characteristics of time continuous recurrent artificial neural networks (ANN) that can be used to solve optimization problems. The principle of operations and performance of these networks are first illustrated by using well-known model problems like the traveling salesman problem and the assignment problem. The ANNs are then subjected to 13 simultaneous 'stuck at 1' or 'stuck at 0' faults for network sizes of up to 900 'neurons'. The effects of these faults is demonstrated and the cause for the observed fault tolerance is discussed. An application is presented in which a network performs a critical task for a real-time distributed processing system by generating new task allocations during the reconfiguration of the system. The performance degradation of the ANN under the presence of faults is investigated by large-scale simulations, and the potential benefits of delegating a critical task to a fault tolerant network are discussed.
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Johnson, Sally C.
1995-01-01
This paper presents a step-by-step tutorial of the methods and the tools that were used for the reliability analysis of fault-tolerant systems. The approach used in this paper is the Markov (or semi-Markov) state-space method. The paper is intended for design engineers with a basic understanding of computer architecture and fault tolerance, but little knowledge of reliability modeling. The representation of architectural features in mathematical models is emphasized. This paper does not present details of the mathematical solution of complex reliability models. Instead, it describes the use of several recently developed computer programs SURE, ASSIST, STEM, and PAWS that automate the generation and the solution of these models.
Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.; ...
2016-06-01
Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Won, Jongho; Ma, Chris Y. T.; Yau, David K. Y.
Smart meters are integral to demand response in emerging smart grids, by reporting the electricity consumption of users to serve application needs. But reporting real-time usage information for individual households raises privacy concerns. Existing techniques to guarantee differential privacy (DP) of smart meter users either are not fault tolerant or achieve (possibly partial) fault tolerance at high communication overheads. In this paper, we propose a fault-tolerant protocol for smart metering that can handle general communication failures while ensuring DP with significantly improved efficiency and lower errors compared with the state of the art. Our protocol handles fail-stop faults proactively bymore » using a novel design of future ciphertexts, and distributes trust among the smart meters by sharing secret keys among them. We prove the DP properties of our protocol and analyze its advantages in fault tolerance, accuracy, and communication efficiency relative to competing techniques. We illustrate our analysis by simulations driven by real-world traces of electricity consumption.« less
A research program in empirical computer science
NASA Technical Reports Server (NTRS)
Knight, J. C.
1991-01-01
During the grant reporting period our primary activities have been to begin preparation for the establishment of a research program in experimental computer science. The focus of research in this program will be safety-critical systems. Many questions that arise in the effort to improve software dependability can only be addressed empirically. For example, there is no way to predict the performance of the various proposed approaches to building fault-tolerant software. Performance models, though valuable, are parameterized and cannot be used to make quantitative predictions without experimental determination of underlying distributions. In the past, experimentation has been able to shed some light on the practical benefits and limitations of software fault tolerance. It is common, also, for experimentation to reveal new questions or new aspects of problems that were previously unknown. A good example is the Consistent Comparison Problem that was revealed by experimentation and subsequently studied in depth. The result was a clear understanding of a previously unknown problem with software fault tolerance. The purpose of a research program in empirical computer science is to perform controlled experiments in the area of real-time, embedded control systems. The goal of the various experiments will be to determine better approaches to the construction of the software for computing systems that have to be relied upon. As such it will validate research concepts from other sources, provide new research results, and facilitate the transition of research results from concepts to practical procedures that can be applied with low risk to NASA flight projects. The target of experimentation will be the production software development activities undertaken by any organization prepared to contribute to the research program. Experimental goals, procedures, data analysis and result reporting will be performed for the most part by the University of Virginia.
Advanced Information Processing System (AIPS)
NASA Technical Reports Server (NTRS)
Pitts, Felix L.
1993-01-01
Advanced Information Processing System (AIPS) is a computer systems philosophy, a set of validated hardware building blocks, and a set of validated services as embodied in system software. The goal of AIPS is to provide the knowledgebase which will allow achievement of validated fault-tolerant distributed computer system architectures, suitable for a broad range of applications, having failure probability requirements of 10E-9 at 10 hours. A background and description is given followed by program accomplishments, the current focus, applications, technology transfer, FY92 accomplishments, and funding.
Verifiable fault tolerance in measurement-based quantum computation
NASA Astrophysics Data System (ADS)
Fujii, Keisuke; Hayashi, Masahito
2017-09-01
Quantum systems, in general, cannot be simulated efficiently by a classical computer, and hence are useful for solving certain mathematical problems and simulating quantum many-body systems. This also implies, unfortunately, that verification of the output of the quantum systems is not so trivial, since predicting the output is exponentially hard. As another problem, the quantum system is very delicate for noise and thus needs an error correction. Here, we propose a framework for verification of the output of fault-tolerant quantum computation in a measurement-based model. In contrast to existing analyses on fault tolerance, we do not assume any noise model on the resource state, but an arbitrary resource state is tested by using only single-qubit measurements to verify whether or not the output of measurement-based quantum computation on it is correct. Verifiability is equipped by a constant time repetition of the original measurement-based quantum computation in appropriate measurement bases. Since full characterization of quantum noise is exponentially hard for large-scale quantum computing systems, our framework provides an efficient way to practically verify the experimental quantum error correction.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture
NASA Technical Reports Server (NTRS)
Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.
1973-01-01
This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
Robust Routing Protocol For Digital Messages
NASA Technical Reports Server (NTRS)
Marvit, Maclen
1994-01-01
Refinement of ditigal-message-routing protocol increases fault tolerance of polled networks. AbNET-3 is latest of generic AbNET protocols for transmission of messages among computing nodes. AbNET concept described in "Multiple-Ring Digital Communication Network" (NPO-18133). Specifically aimed at increasing fault tolerance of network in broadcast mode, in which one node broadcasts message to and receives responses from all other nodes. Communication in network of computers maintained even when links fail.
Room temperature high-fidelity holonomic single-qubit gate on a solid-state spin.
Arroyo-Camejo, Silvia; Lazariev, Andrii; Hell, Stefan W; Balasubramanian, Gopalakrishnan
2014-09-12
At its most fundamental level, circuit-based quantum computation relies on the application of controlled phase shift operations on quantum registers. While these operations are generally compromised by noise and imperfections, quantum gates based on geometric phase shifts can provide intrinsically fault-tolerant quantum computing. Here we demonstrate the high-fidelity realization of a recently proposed fast (non-adiabatic) and universal (non-Abelian) holonomic single-qubit gate, using an individual solid-state spin qubit under ambient conditions. This fault-tolerant quantum gate provides an elegant means for achieving the fidelity threshold indispensable for implementing quantum error correction protocols. Since we employ a spin qubit associated with a nitrogen-vacancy colour centre in diamond, this system is based on integrable and scalable hardware exhibiting strong analogy to current silicon technology. This quantum gate realization is a promising step towards viable, fault-tolerant quantum computing under ambient conditions.
Fault-tolerant clock synchronization validation methodology. [in computer systems
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Palumbo, Daniel L.; Johnson, Sally C.
1987-01-01
A validation method for the synchronization subsystem of a fault-tolerant computer system is presented. The high reliability requirement of flight-crucial systems precludes the use of most traditional validation methods. The method presented utilizes formal design proof to uncover design and coding errors and experimentation to validate the assumptions of the design proof. The experimental method is described and illustrated by validating the clock synchronization system of the Software Implemented Fault Tolerance computer. The design proof of the algorithm includes a theorem that defines the maximum skew between any two nonfaulty clocks in the system in terms of specific system parameters. Most of these parameters are deterministic. One crucial parameter is the upper bound on the clock read error, which is stochastic. The probability that this upper bound is exceeded is calculated from data obtained by the measurement of system parameters. This probability is then included in a detailed reliability analysis of the system.
A fault-tolerant intelligent robotic control system
NASA Technical Reports Server (NTRS)
Marzwell, Neville I.; Tso, Kam Sing
1993-01-01
This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines.
Characterization of the faulted behavior of digital computers and fault tolerant systems
NASA Technical Reports Server (NTRS)
Bavuso, Salvatore J.; Miner, Paul S.
1989-01-01
A development status evaluation is presented for efforts conducted at NASA-Langley since 1977, toward the characterization of the latent fault in digital fault-tolerant systems. Attention is given to the practical, high speed, generalized gate-level logic system simulator developed, as well as to the validation methodology used for the simulator, on the basis of faultable software and hardware simulations employing a prototype MIL-STD-1750A processor. After validation, latency tests will be performed.
Idris, Hajara; Junaidu, Sahalu B.; Adewumi, Aderemi O.
2017-01-01
The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user’s Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user’s QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time. PMID:28545075
Fault Mitigation Schemes for Future Spaceflight Multicore Processors
NASA Technical Reports Server (NTRS)
Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.
2012-01-01
Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Coordinated Fault-Tolerance for High-Performance Computing Final Project Report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Panda, Dhabaleswar Kumar; Beckman, Pete
2011-07-28
With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system throughmore » fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on top of existing publish-subscribe tools. We enhanced the intrinsic fault tolerance capabilities representative implementations of a variety of key HPC software subsystems and integrated them with the FTB. Targeting software subsystems included: MPI communication libraries, checkpoint/restart libraries, resource managers and job schedulers, and system monitoring tools. Leveraging the aforementioned infrastructure, as well as developing and utilizing additional tools, we have examined issues associated with expanded, end-to-end fault response from both system and application viewpoints. From the standpoint of system operations, we have investigated log and root cause analysis, anomaly detection and fault prediction, and generalized notification mechanisms. Our applications work has included libraries for fault-tolerance linear algebra, application frameworks for coupled multiphysics applications, and external frameworks to support the monitoring and response for general applications. Our final goal was to engage the high-end computing community to increase awareness of tools and issues around coordinated end-to-end fault management.« less
Fault tolerant computing: A preamble for assuring viability of large computer systems
NASA Technical Reports Server (NTRS)
Lim, R. S.
1977-01-01
The need for fault-tolerant computing is addressed from the viewpoints of (1) why it is needed, (2) how to apply it in the current state of technology, and (3) what it means in the context of the Phoenix computer system and other related systems. To this end, the value of concurrent error detection and correction is described. User protection, program retry, and repair are among the factors considered. The technology of algebraic codes to protect memory systems and arithmetic codes to protect memory systems and arithmetic codes to protect arithmetic operations is discussed.
Advanced reliability modeling of fault-tolerant computer-based systems
NASA Technical Reports Server (NTRS)
Bavuso, S. J.
1982-01-01
Two methodologies for the reliability assessment of fault tolerant digital computer based systems are discussed. The computer-aided reliability estimation 3 (CARE 3) and gate logic software simulation (GLOSS) are assessment technologies that were developed to mitigate a serious weakness in the design and evaluation process of ultrareliable digital systems. The weak link is based on the unavailability of a sufficiently powerful modeling technique for comparing the stochastic attributes of one system against others. Some of the more interesting attributes are reliability, system survival, safety, and mission success.
Using concatenated quantum codes for universal fault-tolerant quantum gates.
Jochym-O'Connor, Tomas; Laflamme, Raymond
2014-01-10
We propose a method for universal fault-tolerant quantum computation using concatenated quantum error correcting codes. The concatenation scheme exploits the transversal properties of two different codes, combining them to provide a means to protect against low-weight arbitrary errors. We give the required properties of the error correcting codes to ensure universal fault tolerance and discuss a particular example using the 7-qubit Steane and 15-qubit Reed-Muller codes. Namely, other than computational basis state preparation as required by the DiVincenzo criteria, our scheme requires no special ancillary state preparation to achieve universality, as opposed to schemes such as magic state distillation. We believe that optimizing the codes used in such a scheme could provide a useful alternative to state distillation schemes that exhibit high overhead costs.
A fault tolerant gait for a hexapod robot over uneven terrain.
Yang, J M; Kim, J H
2000-01-01
The fault tolerant gait of legged robots in static walking is a gait which maintains its stability against a fault event preventing a leg from having the support state. In this paper, a fault tolerant quadruped gait is proposed for a hexapod traversing uneven terrain with forbidden regions, which do not offer viable footholds but can be stepped over. By comparing performance of straight-line motion and crab walking over even terrain, it is shown that the proposed gait has better mobility and terrain adaptability than previously developed gaits. Based on the proposed gait, we present a method for the generation of the fault tolerant locomotion of a hexapod over uneven terrain with forbidden regions. The proposed method minimizes the number of legs on the ground during walking, and foot adjustment algorithm is used for avoiding steps on forbidden regions. The effectiveness of the proposed strategy over uneven terrain is demonstrated with a computer simulation.
Low-Power Fault Tolerance for Spacecraft FPGA-Based Numerical Computing
2006-09-01
Ranganathan , “Power Management – Guest Lecture for CS4135, NPS,” Naval Postgraduate School, Nov 2004 [32] R. L. Phelps, “Operational Experiences with the...4302, and to the Office of Management and Budget, Paperwork Reduction Project (0704-0188) Washington DC 20503. 1. AGENCY USE ONLY (Leave blank) 2...undesirable, are not necessarily harmful. Our intent is to prevent errors by properly managing faults. This research focuses on developing fault-tolerant
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Babikyan, Carol A.; Butler, Bryan P.; Clasen, Robert J.; Harris, Chris H.; Lala, Jaynarayan H.; Masotto, Thomas K.; Nagle, Gail A.; Prizant, Mark J.; Treadwell, Steven
1994-01-01
The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The Computer Aided Low Altitude Night Helicopter Flight Program has identified automated Terrain Following/Terrain Avoidance, Nap of the Earth (TF/TA, NOE) operation as key enabling technology for advanced tactical rotorcraft to enhance mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction times is life-critical and mission-critical necessitating an ultra-reliable/high throughput computing platform for dependable service for flight control, fusion of sensor data, route planning, near-field/far-field navigation, and obstacle avoidance operations. To address these needs the Army Fault Tolerant Architecture (AFTA) is being designed and developed. This computer system is based upon the Fault Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Labs (CSDL). AFTA is hard real-time, Byzantine, fault-tolerant parallel processor which is programmed in the ADA language. This document describes the results of the Detailed Design (Phase 2 and 3 of a 3-year project) of the AFTA development. This document contains detailed descriptions of the program objectives, the TF/TA NOE application requirements, architecture, hardware design, operating systems design, systems performance measurements and analytical models.
General linear codes for fault-tolerant matrix operations on processor arrays
NASA Technical Reports Server (NTRS)
Nair, V. S. S.; Abraham, J. A.
1988-01-01
Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.
On the design of fault-tolerant robotic manipulator systems
NASA Technical Reports Server (NTRS)
Tesar, Delbert
1993-01-01
Robotic systems are finding increasing use in space applications. Many of these devices are going to be operational on board the Space Station Freedom. Fault tolerance has been deemed necessary because of the criticality of the tasks and the inaccessibility of the systems to maintenance and repair. Design for fault tolerance in manipulator systems is an area within robotics that is without precedence in the literature. In this paper, we will attempt to lay down the foundations for such a technology. Design for fault tolerance demands new and special approaches to design, often at considerable variance from established design practices. These design aspects, together with reliability evaluation and modeling tools, are presented. Mechanical architectures that employ protective redundancies at many levels and have a modular architecture are then studied in detail. Once a mechanical architecture for fault tolerance has been derived, the chronological stages of operational fault tolerance are investigated. Failure detection, isolation, and estimation methods are surveyed, and such methods for robot sensors and actuators are derived. Failure recovery methods are also presented for each of the protective layers of redundancy. Failure recovery tactics often span all of the layers of a control hierarchy. Thus, a unified framework for decision-making and control, which orchestrates both the nominal redundancy management tasks and the failure management tasks, has been derived. The well-developed field of fault-tolerant computers is studied next, and some design principles relevant to the design of fault-tolerant robot controllers are abstracted. Conclusions are drawn, and a road map for the design of fault-tolerant manipulator systems is laid out with recommendations for a 10 DOF arm with dual actuators at each joint.
Adding Fault Tolerance to NPB Benchmarks Using ULFM
DOE Office of Scientific and Technical Information (OSTI.GOV)
Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J
2016-01-01
In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In thismore » work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.« less
Evaluation of reliability modeling tools for advanced fault tolerant systems
NASA Technical Reports Server (NTRS)
Baker, Robert; Scheper, Charlotte
1986-01-01
The Computer Aided Reliability Estimation (CARE III) and Automated Reliability Interactice Estimation System (ARIES 82) reliability tools for application to advanced fault tolerance aerospace systems were evaluated. To determine reliability modeling requirements, the evaluation focused on the Draper Laboratories' Advanced Information Processing System (AIPS) architecture as an example architecture for fault tolerance aerospace systems. Advantages and limitations were identified for each reliability evaluation tool. The CARE III program was designed primarily for analyzing ultrareliable flight control systems. The ARIES 82 program's primary use was to support university research and teaching. Both CARE III and ARIES 82 were not suited for determining the reliability of complex nodal networks of the type used to interconnect processing sites in the AIPS architecture. It was concluded that ARIES was not suitable for modeling advanced fault tolerant systems. It was further concluded that subject to some limitations (the difficulty in modeling systems with unpowered spare modules, systems where equipment maintenance must be considered, systems where failure depends on the sequence in which faults occurred, and systems where multiple faults greater than a double near coincident faults must be considered), CARE III is best suited for evaluating the reliability of advanced tolerant systems for air transport.
Sputnik: ad hoc distributed computation.
Völkel, Gunnar; Lausser, Ludwig; Schmid, Florian; Kraus, Johann M; Kestler, Hans A
2015-04-15
In bioinformatic applications, computationally demanding algorithms are often parallelized to speed up computation. Nevertheless, setting up computational environments for distributed computation is often tedious. Aim of this project were the lightweight ad hoc set up and fault-tolerant computation requiring only a Java runtime, no administrator rights, while utilizing all CPU cores most effectively. The Sputnik framework provides ad hoc distributed computation on the Java Virtual Machine which uses all supplied CPU cores fully. It provides a graphical user interface for deployment setup and a web user interface displaying the current status of current computation jobs. Neither a permanent setup nor administrator privileges are required. We demonstrate the utility of our approach on feature selection of microarray data. The Sputnik framework is available on Github http://github.com/sysbio-bioinf/sputnik under the Eclipse Public License. hkestler@fli-leibniz.de or hans.kestler@uni-ulm.de Supplementary data are available at Bioinformatics online. © The Author 2014. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com.
NASA Astrophysics Data System (ADS)
Devaraj, Rajesh; Sarkar, Arnab; Biswas, Santosh
2015-11-01
In the article 'Supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks', Park and Cho presented a systematic way of computing a largest fault-tolerant and schedulable language that provides information on whether the scheduler (i.e., supervisor) should accept or reject a newly arrived aperiodic task. The computation of such a language is mainly dependent on the task execution model presented in their paper. However, the task execution model is unable to capture the situation when the fault of a processor occurs even before the task has arrived. Consequently, a task execution model that does not capture this fact may possibly be assigned for execution on a faulty processor. This problem has been illustrated with an appropriate example. Then, the task execution model of Park and Cho has been modified to strengthen the requirement that none of the tasks are assigned for execution on a faulty processor.
NASA Technical Reports Server (NTRS)
Srivas, Mandayam; Bickford, Mark
1991-01-01
The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Design and Implementation of a Distributed Version of the NASA Engine Performance Program
NASA Technical Reports Server (NTRS)
Cours, Jeffrey T.
1994-01-01
Distributed NEPP is a new version of the NASA Engine Performance Program that runs in parallel on a collection of Unix workstations connected through a network. The program is fault-tolerant, efficient, and shows significant speed-up in a multi-user, heterogeneous environment. This report describes the issues involved in designing distributed NEPP, the algorithms the program uses, and the performance distributed NEPP achieves. It develops an analytical model to predict and measure the performance of the simple distribution, multiple distribution, and fault-tolerant distribution algorithms that distributed NEPP incorporates. Finally, the appendices explain how to use distributed NEPP and document the organization of the program's source code.
The use of automatic programming techniques for fault tolerant computing systems
NASA Technical Reports Server (NTRS)
Wild, C.
1985-01-01
It is conjectured that the production of software for ultra-reliable computing systems such as required by Space Station, aircraft, nuclear power plants and the like will require a high degree of automation as well as fault tolerance. In this paper, the relationship between automatic programming techniques and fault tolerant computing systems is explored. Initial efforts in the automatic synthesis of code from assertions to be used for error detection as well as the automatic generation of assertions and test cases from abstract data type specifications is outlined. Speculation on the ability to generate truly diverse designs capable of recovery from errors by exploring alternate paths in the program synthesis tree is discussed. Some initial thoughts on the use of knowledge based systems for the global detection of abnormal behavior using expectations and the goal-directed reconfiguration of resources to meet critical mission objectives are given. One of the sources of information for these systems would be the knowledge captured during the automatic programming process.
A fault-tolerant avionics suite for an entry research vehicle
NASA Technical Reports Server (NTRS)
Dzwonczyk, Mark; Stone, Howard
1988-01-01
A highly-reliable avionics suite has been designed for an Entry Research Vehicle. The autonomous spacecraft would be deployed from the Space Shuttle Orbiter and perform a variety of aerodynamic and propulsive maneuvers which may be required for future space transportation system vehicles. The flight electronics consist of a central fault-tolerant processor, which is resilient to all first failures, reliably cross-strapped to redundant and distributed sets of sensors and effectors. This paper describes the preliminary design and analysis of the architecture which resulted from a fifteen month study by the Charles Stark Draper Laboratory for the NASA Langley Research Center. After a brief introduction to the design task, the architecture of the central flight computer and its interface to the vehicle are discussed. Following this, the method and results of the baseline reliability study for the avionic suite are presented.
A fault-tolerant avionics suite for an entry research vehicle
NASA Astrophysics Data System (ADS)
Dzwonczyk, Mark; Stone, Howard
A highly-reliable avionics suite has been designed for an Entry Research Vehicle. The autonomous spacecraft would be deployed from the Space Shuttle Orbiter and perform a variety of aerodynamic and propulsive maneuvers which may be required for future space transportation system vehicles. The flight electronics consist of a central fault-tolerant processor, which is resilient to all first failures, reliably cross-strapped to redundant and distributed sets of sensors and effectors. This paper describes the preliminary design and analysis of the architecture which resulted from a fifteen month study by the Charles Stark Draper Laboratory for the NASA Langley Research Center. After a brief introduction to the design task, the architecture of the central flight computer and its interface to the vehicle are discussed. Following this, the method and results of the baseline reliability study for the avionic suite are presented.
Evaluation of fault-tolerant parallel-processor architectures over long space missions
NASA Technical Reports Server (NTRS)
Johnson, Sally C.
1989-01-01
The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable
DOE Office of Scientific and Technical Information (OSTI.GOV)
Schuller, Ivan K.; Stevens, Rick; Pino, Robinson
2015-10-29
Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS basedmore » technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.« less
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Predeployment validation of fault-tolerant systems through software-implemented fault insertion
NASA Technical Reports Server (NTRS)
Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.
1989-01-01
Fault injection-based automated testing (FIAT) environment, which can be used to experimentally characterize and evaluate distributed realtime systems under fault-free and faulted conditions is described. A survey is presented of validation methodologies. The need for fault insertion based on validation methodologies is demonstrated. The origins and models of faults, and motivation for the FIAT concept are reviewed. FIAT employs a validation methodology which builds confidence in the system through first providing a baseline of fault-free performance data and then characterizing the behavior of the system with faults present. Fault insertion is accomplished through software and allows faults or the manifestation of faults to be inserted by either seeding faults into memory or triggering error detection mechanisms. FIAT is capable of emulating a variety of fault-tolerant strategies and architectures, can monitor system activity, and can automatically orchestrate experiments involving insertion of faults. There is a common system interface which allows ease of use to decrease experiment development and run time. Fault models chosen for experiments on FIAT have generated system responses which parallel those observed in real systems under faulty conditions. These capabilities are shown by two example experiments each using a different fault-tolerance strategy.
Scientific Services on the Cloud
NASA Astrophysics Data System (ADS)
Chapman, David; Joshi, Karuna P.; Yesha, Yelena; Halem, Milt; Yesha, Yaacov; Nguyen, Phuong
Scientific Computing was one of the first every applications for parallel and distributed computation. To this date, scientific applications remain some of the most compute intensive, and have inspired creation of petaflop compute infrastructure such as the Oak Ridge Jaguar and Los Alamos RoadRunner. Large dedicated hardware infrastructure has become both a blessing and a curse to the scientific community. Scientists are interested in cloud computing for much the same reason as businesses and other professionals. The hardware is provided, maintained, and administrated by a third party. Software abstraction and virtualization provide reliability, and fault tolerance. Graduated fees allow for multi-scale prototyping and execution. Cloud computing resources are only a few clicks away, and by far the easiest high performance distributed platform to gain access to. There may still be dedicated infrastructure for ultra-scale science, but the cloud can easily play a major part of the scientific computing initiative.
2011-09-01
process). Beyond this project, it will be useful to determine what access structures are meaningful and take the processing cost into account as well. In...14 Table 9 Costs incurred by various approaches...28 Table 10 Costs incurred by various approaches for Top K Plans
2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Katz, D. S.; Daly, J.; DeBardeleben, N.
2009-02-01
This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults causemore » large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a good solution as systems and memories grow faster than I/O bandwidth to disk. There is interest in both modifications to this, such as checkpoints to memory, partial checkpoints, and message logging, and alternative ideas, such as in-memory recovery using residues. We believe that systematic exploration of these ideas holds the most promise for the scientific applications community. Fault tolerance has been an issue of discussion in the HPC community for at least the past 10 years; but much like other issues, the community has managed to put off addressing it during this period. There is a growing recognition that as systems continue to grow to petascale and beyond, the field is approaching the point where we don't have any choice but to address this through R&D efforts.« less
Organization of the secure distributed computing based on multi-agent system
NASA Astrophysics Data System (ADS)
Khovanskov, Sergey; Rumyantsev, Konstantin; Khovanskova, Vera
2018-04-01
Nowadays developing methods for distributed computing is received much attention. One of the methods of distributed computing is using of multi-agent systems. The organization of distributed computing based on the conventional network computers can experience security threats performed by computational processes. Authors have developed the unified agent algorithm of control system of computing network nodes operation. Network PCs is used as computing nodes. The proposed multi-agent control system for the implementation of distributed computing allows in a short time to organize using of the processing power of computers any existing network to solve large-task by creating a distributed computing. Agents based on a computer network can: configure a distributed computing system; to distribute the computational load among computers operated agents; perform optimization distributed computing system according to the computing power of computers on the network. The number of computers connected to the network can be increased by connecting computers to the new computer system, which leads to an increase in overall processing power. Adding multi-agent system in the central agent increases the security of distributed computing. This organization of the distributed computing system reduces the problem solving time and increase fault tolerance (vitality) of computing processes in a changing computing environment (dynamic change of the number of computers on the network). Developed a multi-agent system detects cases of falsification of the results of a distributed system, which may lead to wrong decisions. In addition, the system checks and corrects wrong results.
SARA - SURE/ASSIST RELIABILITY ANALYSIS WORKSTATION (VAX VMS VERSION)
NASA Technical Reports Server (NTRS)
Butler, R. W.
1994-01-01
SARA, the SURE/ASSIST Reliability Analysis Workstation, is a bundle of programs used to solve reliability problems. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. The Systems Validation Methods group at NASA Langley Research Center has created a set of four software packages that form the basis for a reliability analysis workstation, including three for use in analyzing reconfigurable, fault-tolerant systems and one for analyzing non-reconfigurable systems. The SARA bundle includes the three for reconfigurable, fault-tolerant systems: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), and PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920). As indicated by the program numbers in parentheses, each of these three packages is also available separately in two machine versions. The fourth package, which is only available separately, is FTC, the Fault Tree Compiler (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree which describes a non-reconfigurable system. PAWS/STEM and SURE are analysis programs which utilize different solution methods, but have a common input language, the SURE language. ASSIST is a preprocessor that generates SURE language from a more abstract definition. ASSIST, SURE, and PAWS/STEM are described briefly in the following paragraphs. For additional details about the individual packages, including pricing, please refer to their respective abstracts. ASSIST, the Abstract Semi-Markov Specification Interface to the SURE Tool program, allows a reliability engineer to describe the failure behavior of a fault-tolerant computer system in an abstract, high-level language. The ASSIST program then automatically generates a corresponding semi-Markov model. A one-page ASSIST-language description may result in a semi-Markov model with thousands of states and transitions. The ASSIST program also includes model-reduction techniques to facilitate efficient modeling of large systems. The semi-Markov model generated by ASSIST is in the format needed for input to SURE and PAWS/STEM. The Semi-Markov Unreliability Range Evaluator, SURE, is an analysis tool for reconfigurable, fault-tolerant systems. SURE provides an efficient means for calculating accurate upper and lower bounds for the death state probabilities for a large class of semi-Markov models, not just those which can be reduced to critical-pair architectures. The calculated bounds are close enough (usually within 5 percent of each other) for use in reliability studies of ultra-reliable computer systems. The SURE bounding theorems have algebraic solutions and are consequently computationally efficient even for large and complex systems. SURE can optionally regard a specified parameter as a variable over a range of values, enabling an automatic sensitivity analysis. SURE output is tabular. The PAWS/STEM package includes two programs for the creation and evaluation of pure Markov models describing the behavior of fault-tolerant reconfigurable computer systems: the Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The programs that comprise the SARA package were originally developed for use on DEC VAX series computers running VMS and were later ported for use on Sun series computers running SunOS. They are written in C-language, Pascal, and FORTRAN 77. An ANSI compliant C compiler is required in order to compile the C portion of the Sun version source code. The Pascal and FORTRAN code can be compiled on Sun computers using Sun Pascal and Sun Fortran. For the VMS version, VAX C, VAX PASCAL, and VAX FORTRAN can be used to recompile the source code. The standard distribution medium for the VMS version of SARA (COS-10041) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of SARA (COS-10039) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. Electronic copies of the ASSIST user's manual in TeX and PostScript formats are provided on the distribution medium. DEC, VAX, VMS, and TK50 are registered trademarks of Digital Equipment Corporation. Sun, Sun3, Sun4, and SunOS are trademarks of Sun Microsystems, Inc. TeX is a trademark of the American Mathematical Society. PostScript is a registered trademark of Adobe Systems Incorporated.
SARA - SURE/ASSIST RELIABILITY ANALYSIS WORKSTATION (UNIX VERSION)
NASA Technical Reports Server (NTRS)
Butler, R. W.
1994-01-01
SARA, the SURE/ASSIST Reliability Analysis Workstation, is a bundle of programs used to solve reliability problems. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. The Systems Validation Methods group at NASA Langley Research Center has created a set of four software packages that form the basis for a reliability analysis workstation, including three for use in analyzing reconfigurable, fault-tolerant systems and one for analyzing non-reconfigurable systems. The SARA bundle includes the three for reconfigurable, fault-tolerant systems: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), and PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920). As indicated by the program numbers in parentheses, each of these three packages is also available separately in two machine versions. The fourth package, which is only available separately, is FTC, the Fault Tree Compiler (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree which describes a non-reconfigurable system. PAWS/STEM and SURE are analysis programs which utilize different solution methods, but have a common input language, the SURE language. ASSIST is a preprocessor that generates SURE language from a more abstract definition. ASSIST, SURE, and PAWS/STEM are described briefly in the following paragraphs. For additional details about the individual packages, including pricing, please refer to their respective abstracts. ASSIST, the Abstract Semi-Markov Specification Interface to the SURE Tool program, allows a reliability engineer to describe the failure behavior of a fault-tolerant computer system in an abstract, high-level language. The ASSIST program then automatically generates a corresponding semi-Markov model. A one-page ASSIST-language description may result in a semi-Markov model with thousands of states and transitions. The ASSIST program also includes model-reduction techniques to facilitate efficient modeling of large systems. The semi-Markov model generated by ASSIST is in the format needed for input to SURE and PAWS/STEM. The Semi-Markov Unreliability Range Evaluator, SURE, is an analysis tool for reconfigurable, fault-tolerant systems. SURE provides an efficient means for calculating accurate upper and lower bounds for the death state probabilities for a large class of semi-Markov models, not just those which can be reduced to critical-pair architectures. The calculated bounds are close enough (usually within 5 percent of each other) for use in reliability studies of ultra-reliable computer systems. The SURE bounding theorems have algebraic solutions and are consequently computationally efficient even for large and complex systems. SURE can optionally regard a specified parameter as a variable over a range of values, enabling an automatic sensitivity analysis. SURE output is tabular. The PAWS/STEM package includes two programs for the creation and evaluation of pure Markov models describing the behavior of fault-tolerant reconfigurable computer systems: the Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The programs that comprise the SARA package were originally developed for use on DEC VAX series computers running VMS and were later ported for use on Sun series computers running SunOS. They are written in C-language, Pascal, and FORTRAN 77. An ANSI compliant C compiler is required in order to compile the C portion of the Sun version source code. The Pascal and FORTRAN code can be compiled on Sun computers using Sun Pascal and Sun Fortran. For the VMS version, VAX C, VAX PASCAL, and VAX FORTRAN can be used to recompile the source code. The standard distribution medium for the VMS version of SARA (COS-10041) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of SARA (COS-10039) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. Electronic copies of the ASSIST user's manual in TeX and PostScript formats are provided on the distribution medium. DEC, VAX, VMS, and TK50 are registered trademarks of Digital Equipment Corporation. Sun, Sun3, Sun4, and SunOS are trademarks of Sun Microsystems, Inc. TeX is a trademark of the American Mathematical Society. PostScript is a registered trademark of Adobe Systems Incorporated.
NASA Technical Reports Server (NTRS)
Stiffler, J. J.; Bryant, L. A.; Guccione, L.
1979-01-01
A computer program was developed as a general purpose reliability tool for fault tolerant avionics systems. The computer program requirements, together with several appendices containing computer printouts are presented.
Roads towards fault-tolerant universal quantum computation
NASA Astrophysics Data System (ADS)
Campbell, Earl T.; Terhal, Barbara M.; Vuillot, Christophe
2017-09-01
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Roads towards fault-tolerant universal quantum computation.
Campbell, Earl T; Terhal, Barbara M; Vuillot, Christophe
2017-09-13
A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Study of a unified hardware and software fault-tolerant architecture
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart
1989-01-01
A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.
Fault-tolerant clock synchronization in distributed systems
NASA Technical Reports Server (NTRS)
Ramanathan, Parameswaran; Shin, Kang G.; Butler, Ricky W.
1990-01-01
Existing fault-tolerant clock synchronization algorithms are compared and contrasted. These include the following: software synchronization algorithms, such as convergence-averaging, convergence-nonaveraging, and consistency algorithms, as well as probabilistic synchronization; hardware synchronization algorithms; and hybrid synchronization. The worst-case clock skews guaranteed by representative algorithms are compared, along with other important aspects such as time, message, and cost overhead imposed by the algorithms. More recent developments such as hardware-assisted software synchronization and algorithms for synchronizing large, partially connected distributed systems are especially emphasized.
Robot Position Sensor Fault Tolerance
NASA Technical Reports Server (NTRS)
Aldridge, Hal A.
1997-01-01
Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. A new method is proposed that utilizes analytical redundancy to allow for continued operation during joint position sensor failure. Joint torque sensors are used with a virtual passive torque controller to make the robot joint stable without position feedback and improve position tracking performance in the presence of unknown link dynamics and end-effector loading. Two Cartesian accelerometer based methods are proposed to determine the position of the joint. The joint specific position determination method utilizes two triaxial accelerometers attached to the link driven by the joint with the failed position sensor. The joint specific method is not computationally complex and the position error is bounded. The system wide position determination method utilizes accelerometers distributed on different robot links and the end-effector to determine the position of sets of multiple joints. The system wide method requires fewer accelerometers than the joint specific method to make all joint position sensors fault tolerant but is more computationally complex and has lower convergence properties. Experiments were conducted on a laboratory manipulator. Both position determination methods were shown to track the actual position satisfactorily. A controller using the position determination methods and the virtual passive torque controller was able to servo the joints to a desired position during position sensor failure.
Hyperswitch communication network
NASA Technical Reports Server (NTRS)
Peterson, J.; Pniel, M.; Upchurch, E.
1991-01-01
The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.
Experimental analysis of computer system dependability
NASA Technical Reports Server (NTRS)
Iyer, Ravishankar, K.; Tang, Dong
1993-01-01
This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance.
Determination of the optimal tolerance for MLC positioning in sliding window and VMAT techniques
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hernandez, V., E-mail: vhernandezmasgrau@gmail.com; Abella, R.; Calvo, J. F.
2015-04-15
Purpose: Several authors have recommended a 2 mm tolerance for multileaf collimator (MLC) positioning in sliding window treatments. In volumetric modulated arc therapy (VMAT) treatments, however, the optimal tolerance for MLC positioning remains unknown. In this paper, the authors present the results of a multicenter study to determine the optimal tolerance for both techniques. Methods: The procedure used is based on dynalog file analysis. The study was carried out using seven Varian linear accelerators from five different centers. Dynalogs were collected from over 100 000 clinical treatments and in-house software was used to compute the number of tolerance faults as amore » function of the user-defined tolerance. Thus, the optimal value for this tolerance, defined as the lowest achievable value, was investigated. Results: Dynalog files accurately predict the number of tolerance faults as a function of the tolerance value, especially for low fault incidences. All MLCs behaved similarly and the Millennium120 and the HD120 models yielded comparable results. In sliding window techniques, the number of beams with an incidence of hold-offs >1% rapidly decreases for a tolerance of 1.5 mm. In VMAT techniques, the number of tolerance faults sharply drops for tolerances around 2 mm. For a tolerance of 2.5 mm, less than 0.1% of the VMAT arcs presented tolerance faults. Conclusions: Dynalog analysis provides a feasible method for investigating the optimal tolerance for MLC positioning in dynamic fields. In sliding window treatments, the tolerance of 2 mm was found to be adequate, although it can be reduced to 1.5 mm. In VMAT treatments, the typically used 5 mm tolerance is excessively high. Instead, a tolerance of 2.5 mm is recommended.« less
An Architectural Concept for Intrusion Tolerance in Air Traffic Networks
NASA Technical Reports Server (NTRS)
Maddalon, Jeffrey M.; Miner, Paul S.
2003-01-01
The goal of an intrusion tolerant network is to continue to provide predictable and reliable communication in the presence of a limited num ber of compromised network components. The behavior of a compromised network component ranges from a node that no longer responds to a nod e that is under the control of a malicious entity that is actively tr ying to cause other nodes to fail. Most current data communication ne tworks do not include support for tolerating unconstrained misbehavio r of components in the network. However, the fault tolerance communit y has developed protocols that provide both predictable and reliable communication in the presence of the worst possible behavior of a limited number of nodes in the system. One may view a malicious entity in a communication network as a node that has failed and is behaving in an arbitrary manner. NASA/Langley Research Center has developed one such fault-tolerant computing platform called SPIDER (Scalable Proces sor-Independent Design for Electromagnetic Resilience). The protocols and interconnection mechanisms of SPIDER may be adapted to large-sca le, distributed communication networks such as would be required for future Air Traffic Management systems. The predictability and reliabi lity guarantees provided by the SPIDER protocols have been formally v erified. This analysis can be readily adapted to similar network stru ctures.
NASA Technical Reports Server (NTRS)
Shooman, Martin L.
1991-01-01
Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc.
The implementation and use of Ada on distributed systems with high reliability requirements
NASA Technical Reports Server (NTRS)
Knight, J. C.
1986-01-01
The use and implementation of Ada in distributed environments in which reliability is the primary concern were investigted. A distributed system, programmed entirely in Ada, was studied to assess the use of individual tasks without concern for the processor used. Continued development and testing of the fault tolerant Ada testbed; development of suggested changes to Ada to cope with the failures of interest; design of approaches to fault tolerant software in real time systems, and the integration of these ideas into Ada; and the preparation of various papers and presentations were discussed.
NASA Astrophysics Data System (ADS)
de Laat, Cees; Develder, Chris; Jukan, Admela; Mambretti, Joe
This topic is devoted to communication issues in scalable compute and storage systems, such as parallel computers, networks of workstations, and clusters. All aspects of communication in modern systems were solicited, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, system and storage area networks, on-chip interconnects, communication protocols, routing and communication algorithms, and communication aspects of parallel and distributed algorithms. In total 15 papers were submitted to this topic of which we selected the 7 strongest papers. We grouped the papers in two sessions of 3 papers each and one paper was selected for the best paper session. We noted a number of papers dealing with changing topologies, stability and forwarding convergence in source routing based cluster interconnect network architectures. We grouped these for the first session. The authors of the paper titled: “Implementing a Change Assimilation Mechanism for Source Routing Interconnects” propose a mechanism that can obtain the new topology, and compute and distribute a new set of fabric paths to the source routed network end points to minimize the impact on the forwarding service. The article entitled “Dependability Analysis of a Fault-tolerant Network Reconfiguration Strateg” reports on a case study analyzing the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures. The last paper in this session: “RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration” presents a new dynamic reconfiguration method, that ensures deadlock-freedom during the reconfiguration without causing performance degradation such as increased latency or decreased throughput. The second session groups 3 papers presenting methods, protocols and architectures that enhance capacities in the Networks. The paper titled: “NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet” presents the addition of multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are treated on the same core. It then introduces the idea of binding the target end process near its dedicated receive queue. In general this multiqueue receive stack performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult. The authors of: “A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks” focus on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal was to solve a certain number of link and node failures, considering its impact, and occurrence probability. Their experiments show that their method allows applications to successfully finalize their execution in the presence of several faults, with an average performance value of 97% with respect to the fault-free scenarios. The paper: “Hardware implementation study of the Self-Clocked Fair Queuing Credit Aware (SCFQ-CA) and Deficit Round Robin Credit Aware (DRR-CA) scheduling algorithms” proposes specific implementations of the two schedulers taking into account the characteristics of current high-performance networks. A comparison is presented on the complexity of these two algorithms in terms of silicon area and computation delay. Finally we selected one paper for the special paper session: “A Case Study of Communication Optimizations on 3D Mesh Interconnects”. In this paper the authors present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
A Poisson process approximation for generalized K-5 confidence regions
NASA Technical Reports Server (NTRS)
Arsham, H.; Miller, D. R.
1982-01-01
One-sided confidence regions for continuous cumulative distribution functions are constructed using empirical cumulative distribution functions and the generalized Kolmogorov-Smirnov distance. The band width of such regions becomes narrower in the right or left tail of the distribution. To avoid tedious computation of confidence levels and critical values, an approximation based on the Poisson process is introduced. This aproximation provides a conservative confidence region; moreover, the approximation error decreases monotonically to 0 as sample size increases. Critical values necessary for implementation are given. Applications are made to the areas of risk analysis, investment modeling, reliability assessment, and analysis of fault tolerant systems.
Unconditional security of quantum key distribution over arbitrarily long distances
Lo; Chau
1999-03-26
Quantum key distribution is widely thought to offer unconditional security in communication between two users. Unfortunately, a widely accepted proof of its security in the presence of source, device, and channel noises has been missing. This long-standing problem is solved here by showing that, given fault-tolerant quantum computers, quantum key distribution over an arbitrarily long distance of a realistic noisy channel can be made unconditionally secure. The proof is reduced from a noisy quantum scheme to a noiseless quantum scheme and then from a noiseless quantum scheme to a noiseless classical scheme, which can then be tackled by classical probability theory.
NASA Technical Reports Server (NTRS)
Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.
1992-01-01
Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation.
Assessing the Progress of Trapped-Ion Processors Towards Fault-Tolerant Quantum Computation
NASA Astrophysics Data System (ADS)
Bermudez, A.; Xu, X.; Nigmatullin, R.; O'Gorman, J.; Negnevitsky, V.; Schindler, P.; Monz, T.; Poschinger, U. G.; Hempel, C.; Home, J.; Schmidt-Kaler, F.; Biercuk, M.; Blatt, R.; Benjamin, S.; Müller, M.
2017-10-01
A quantitative assessment of the progress of small prototype quantum processors towards fault-tolerant quantum computation is a problem of current interest in experimental and theoretical quantum information science. We introduce a necessary and fair criterion for quantum error correction (QEC), which must be achieved in the development of these quantum processors before their sizes are sufficiently big to consider the well-known QEC threshold. We apply this criterion to benchmark the ongoing effort in implementing QEC with topological color codes using trapped-ion quantum processors and, more importantly, to guide the future hardware developments that will be required in order to demonstrate beneficial QEC with small topological quantum codes. In doing so, we present a thorough description of a realistic trapped-ion toolbox for QEC and a physically motivated error model that goes beyond standard simplifications in the QEC literature. We focus on laser-based quantum gates realized in two-species trapped-ion crystals in high-optical aperture segmented traps. Our large-scale numerical analysis shows that, with the foreseen technological improvements described here, this platform is a very promising candidate for fault-tolerant quantum computation.
The use of programmable logic controllers (PLC) for rocket engine component testing
NASA Technical Reports Server (NTRS)
Nail, William; Scheuermann, Patrick; Witcher, Kern
1991-01-01
Application of PLCs to the rocket engine component testing at a new Stennis Space Center Component Test Facility is suggested as an alternative to dedicated specialized computers. The PLC systems are characterized by rugged design, intuitive software, fault tolerance, flexibility, multiple end device options, networking capability, and built-in diagnostics. A distributed PLC-based system is projected to be used for testing LH2/LOx turbopumps required for the ALS/NLS rocket engines.
2013-11-01
big data with R is relatively new. RHadoop is a mature product from Revolution Analytics that uses R with Hadoop Streaming [15] and provides...agnostic all- data summaries or computations, in which case we use MapReduce directly. 2.3 D&R Software Environment In this work, we use the Hadoop ...job scheduling and tracking, data distribu- tion, system architecture, heterogeneity, and fault-tolerance. Hadoop also provides a distributed key-value
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Divito, Ben L.
1992-01-01
The design and formal verification of the Reliable Computing Platform (RCP), a fault tolerant computing system for digital flight control applications is presented. The RCP uses N-Multiply Redundant (NMR) style redundancy to mask faults and internal majority voting to flush the effects of transient faults. The system is formally specified and verified using the Ehdm verification system. A major goal of this work is to provide the system with significant capability to withstand the effects of High Intensity Radiated Fields (HIRF).
NASA Astrophysics Data System (ADS)
Yim, Keun Soo
This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
Highly fault-tolerant parallel computation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Spielman, D.A.
We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less
Design and Implementation of Replicated Object Layer
NASA Technical Reports Server (NTRS)
Koka, Sudhir
1996-01-01
One of the widely used techniques for construction of fault tolerant applications is the replication of resources so that if one copy fails sufficient copies may still remain operational to allow the application to continue to function. This thesis involves the design and implementation of an object oriented framework for replicating data on multiple sites and across different platforms. Our approach, called the Replicated Object Layer (ROL) provides a mechanism for consistent replication of data over dynamic networks. ROL uses the Reliable Multicast Protocol (RMP) as a communication protocol that provides for reliable delivery, serialization and fault tolerance. Besides providing type registration, this layer facilitates distributed atomic transactions on replicated data. A novel algorithm called the RMP Commit Protocol, which commits transactions efficiently in reliable multicast environment is presented. ROL provides recovery procedures to ensure that site and communication failures do not corrupt persistent data, and male the system fault tolerant to network partitions. ROL will facilitate building distributed fault tolerant applications by performing the burdensome details of replica consistency operations, and making it completely transparent to the application.Replicated databases are a major class of applications which could be built on top of ROL.
Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian
2014-01-01
A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis.
Design of a fault-tolerant reversible control unit in molecular quantum-dot cellular automata
NASA Astrophysics Data System (ADS)
Bahadori, Golnaz; Houshmand, Monireh; Zomorodi-Moghadam, Mariam
Quantum-dot cellular automata (QCA) is a promising emerging nanotechnology that has been attracting considerable attention due to its small feature size, ultra-low power consuming, and high clock frequency. Therefore, there have been many efforts to design computational units based on this technology. Despite these advantages of the QCA-based nanotechnologies, their implementation is susceptible to a high error rate. On the other hand, using the reversible computing leads to zero bit erasures and no energy dissipation. As the reversible computation does not lose information, the fault detection happens with a high probability. In this paper, first we propose a fault-tolerant control unit using reversible gates which improves on the previous design. The proposed design is then synthesized to the QCA technology and is simulated by the QCADesigner tool. Evaluation results indicate the performance of the proposed approach.
Interface Circuits for Self-Checking Microprocessors
NASA Technical Reports Server (NTRS)
Rennels, D. A.; Chandramouli, R.
1986-01-01
Fault-tolerant-microcomputer concept based on enhancing "simple" computer with redundancy and self-checking logic circuits detect hardware faults. Interface and checking logic and redundant processors confer on 16-bit microcomputer ability to check itself for hardware faults. Checking circuitry also checks itself. Concept of self-checking complementary pairs (SCCP's) employed throughout ICL unit.
NASA Technical Reports Server (NTRS)
Hruby, R. J.; Bjorkman, W. S.
1977-01-01
Flight test results of the strapdown inertial reference unit (SIRU) navigation system are presented. The fault-tolerant SIRU navigation system features a redundant inertial sensor unit and dual computers. System software provides for detection and isolation of inertial sensor failures and continued operation in the event of failures. Flight test results include assessments of the system's navigational performance and fault tolerance.
The art of fault-tolerant system reliability modeling
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Johnson, Sally C.
1990-01-01
A step-by-step tutorial of the methods and tools used for the reliability analysis of fault-tolerant systems is presented. Emphasis is on the representation of architectural features in mathematical models. Details of the mathematical solution of complex reliability models are not presented. Instead the use of several recently developed computer programs--SURE, ASSIST, STEM, PAWS--which automate the generation and solution of these models is described.
SUMC fault tolerant computer system
NASA Technical Reports Server (NTRS)
1980-01-01
The results of the trade studies are presented. These trades cover: establishing the basic configuration, establishing the CPU/memory configuration, establishing an approach to crosstrapping interfaces, defining the requirements of the redundancy management unit (RMU), establishing a spare plane switching strategy for the fault-tolerant memory (FTM), and identifying the most cost effective way of extending the memory addressing capability beyond the 64 K-bytes (K=1024) of SUMC-II B. The results of the design are compiled in Contract End Item (CEI) Specification for the NASA Standard Spacecraft Computer II (NSSC-II), IBM 7934507. The implementation of the FTM and memory address expansion.
Holonomic surface codes for fault-tolerant quantum computation
NASA Astrophysics Data System (ADS)
Zhang, Jiang; Devitt, Simon J.; You, J. Q.; Nori, Franco
2018-02-01
Surface codes can protect quantum information stored in qubits from local errors as long as the per-operation error rate is below a certain threshold. Here we propose holonomic surface codes by harnessing the quantum holonomy of the system. In our scheme, the holonomic gates are built via auxiliary qubits rather than the auxiliary levels in multilevel systems used in conventional holonomic quantum computation. The key advantage of our approach is that the auxiliary qubits are in their ground state before and after each gate operation, so they are not involved in the operation cycles of surface codes. This provides an advantageous way to implement surface codes for fault-tolerant quantum computation.
Design of a modular digital computer system, CDRL no. D001, final design plan
NASA Technical Reports Server (NTRS)
Easton, R. A.
1975-01-01
The engineering breadboard implementation for the CDRL no. D001 modular digital computer system developed during design of the logic system was documented. This effort followed the architecture study completed and documented previously, and was intended to verify the concepts of a fault tolerant, automatically reconfigurable, modular version of the computer system conceived during the architecture study. The system has a microprogrammed 32 bit word length, general register architecture and an instruction set consisting of a subset of the IBM System 360 instruction set plus additional fault tolerance firmware. The following areas were covered: breadboard packaging, central control element, central processing element, memory, input/output processor, and maintenance/status panel and electronics.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Lumsdaine, Andrew
2013-03-08
The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility ormore » control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.« less
Aerospace Applications of Weibull and Monte Carlo Simulation with Importance Sampling
NASA Technical Reports Server (NTRS)
Bavuso, Salvatore J.
1998-01-01
Recent developments in reliability modeling and computer technology have made it practical to use the Weibull time to failure distribution to model the system reliability of complex fault-tolerant computer-based systems. These system models are becoming increasingly popular in space systems applications as a result of mounting data that support the decreasing Weibull failure distribution and the expectation of increased system reliability. This presentation introduces the new reliability modeling developments and demonstrates their application to a novel space system application. The application is a proposed guidance, navigation, and control (GN&C) system for use in a long duration manned spacecraft for a possible Mars mission. Comparisons to the constant failure rate model are presented and the ramifications of doing so are discussed.
Ye, Dan; Chen, Mengmeng; Li, Kui
2017-11-01
In this paper, we consider the distributed containment control problem of multi-agent systems with actuator bias faults based on observer method. The objective is to drive the followers into the convex hull spanned by the dynamic leaders, where the input is unknown but bounded. By constructing an observer to estimate the states and bias faults, an effective distributed adaptive fault-tolerant controller is developed. Different from the traditional method, an auxiliary controller gain is designed to deal with the unknown inputs and bias faults together. Moreover, the coupling gain can be adjusted online through the adaptive mechanism without using the global information. Furthermore, the proposed control protocol can guarantee that all the signals of the closed-loop systems are bounded and all the followers converge to the convex hull with bounded residual errors formed by the dynamic leaders. Finally, a decoupled linearized longitudinal motion model of the F-18 aircraft is used to demonstrate the effectiveness. Copyright © 2017 ISA. Published by Elsevier Ltd. All rights reserved.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration
NASA Technical Reports Server (NTRS)
Obando, Rodrigo A.; Stoughton, John W.
1995-01-01
The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
Quantitative fault tolerant control design for a hydraulic actuator with a leaking piston seal
NASA Astrophysics Data System (ADS)
Karpenko, Mark
Hydraulic actuators are complex fluid power devices whose performance can be degraded in the presence of system faults. In this thesis a linear, fixed-gain, fault tolerant controller is designed that can maintain the positioning performance of an electrohydraulic actuator operating under load with a leaking piston seal and in the presence of parametric uncertainties. Developing a control system tolerant to this class of internal leakage fault is important since a leaking piston seal can be difficult to detect, unless the actuator is disassembled. The designed fault tolerant control law is of low-order, uses only the actuator position as feedback, and can: (i) accommodate nonlinearities in the hydraulic functions, (ii) maintain robustness against typical uncertainties in the hydraulic system parameters, and (iii) keep the positioning performance of the actuator within prescribed tolerances despite an internal leakage fault that can bypass up to 40% of the rated servovalve flow across the actuator piston. Experimental tests verify the functionality of the fault tolerant control under normal and faulty operating conditions. The fault tolerant controller is synthesized based on linear time-invariant equivalent (LTIE) models of the hydraulic actuator using the quantitative feedback theory (QFT) design technique. A numerical approach for identifying LTIE frequency response functions of hydraulic actuators from acceptable input-output responses is developed so that linearizing the hydraulic functions can be avoided. The proposed approach can properly identify the features of the hydraulic actuator frequency response that are important for control system design and requires no prior knowledge about the asymptotic behavior or structure of the LTIE transfer functions. A distributed hardware-in-the-loop (HIL) simulation architecture is constructed that enables the performance of the proposed fault tolerant control law to be further substantiated, under realistic operating conditions. Using the HIL framework, the fault tolerant hydraulic actuator is operated as a flight control actuator against the real-time numerical simulation of a high-performance jet aircraft. A robust electrohydraulic loading system is also designed using QFT so that the in-flight aerodynamic load can be experimentally replicated. The results of the HIL experiments show that using the fault tolerant controller to compensate the internal leakage fault at the actuator level can benefit the flight performance of the airplane.
Middleware for big data processing: test results
NASA Astrophysics Data System (ADS)
Gankevich, I.; Gaiduchok, V.; Korkhov, V.; Degtyarev, A.; Bogdanov, A.
2017-12-01
Dealing with large volumes of data is resource-consuming work which is more and more often delegated not only to a single computer but also to a whole distributed computing system at once. As the number of computers in a distributed system increases, the amount of effort put into effective management of the system grows. When the system reaches some critical size, much effort should be put into improving its fault tolerance. It is difficult to estimate when some particular distributed system needs such facilities for a given workload, so instead they should be implemented in a middleware which works efficiently with a distributed system of any size. It is also difficult to estimate whether a volume of data is large or not, so the middleware should also work with data of any volume. In other words, the purpose of the middleware is to provide facilities that adapt distributed computing system for a given workload. In this paper we introduce such middleware appliance. Tests show that this middleware is well-suited for typical HPC and big data workloads and its performance is comparable with well-known alternatives.
Development and Evaluation of Fault-Tolerant Flight Control Systems
NASA Technical Reports Server (NTRS)
Song, Yong D.; Gupta, Kajal (Technical Monitor)
2004-01-01
The research is concerned with developing a new approach to enhancing fault tolerance of flight control systems. The original motivation for fault-tolerant control comes from the need for safe operation of control elements (e.g. actuators) in the event of hardware failures in high reliability systems. One such example is modem space vehicle subjected to actuator/sensor impairments. A major task in flight control is to revise the control policy to balance impairment detectability and to achieve sufficient robustness. This involves careful selection of types and parameters of the controllers and the impairment detecting filters used. It also involves a decision, upon the identification of some failures, on whether and how a control reconfiguration should take place in order to maintain a certain system performance level. In this project new flight dynamic model under uncertain flight conditions is considered, in which the effects of both ramp and jump faults are reflected. Stabilization algorithms based on neural network and adaptive method are derived. The control algorithms are shown to be effective in dealing with uncertain dynamics due to external disturbances and unpredictable faults. The overall strategy is easy to set up and the computation involved is much less as compared with other strategies. Computer simulation software is developed. A serious of simulation studies have been conducted with varying flight conditions.
The implementation and use of Ada on distributed systems with high reliability requirements
NASA Technical Reports Server (NTRS)
Knight, J. C.
1984-01-01
The use and implementation of Ada in distributed environments in which reliability is the primary concern is investigated. Emphasis is placed on the possibility that a distributed system may be programmed entirely in ADA so that the individual tasks of the system are unconcerned with which processors they are executing on, and that failures may occur in the software or underlying hardware. The primary activities are: (1) Continued development and testing of our fault-tolerant Ada testbed; (2) consideration of desirable language changes to allow Ada to provide useful semantics for failure; (3) analysis of the inadequacies of existing software fault tolerance strategies.
Advanced information processing system: Local system services
NASA Technical Reports Server (NTRS)
Burkhardt, Laura; Alger, Linda; Whittredge, Roy; Stasiowski, Peter
1989-01-01
The Advanced Information Processing System (AIPS) is a multi-computer architecture composed of hardware and software building blocks that can be configured to meet a broad range of application requirements. The hardware building blocks are fault-tolerant, general-purpose computers, fault-and damage-tolerant networks (both computer and input/output), and interfaces between the networks and the computers. The software building blocks are the major software functions: local system services, input/output, system services, inter-computer system services, and the system manager. The foundation of the local system services is an operating system with the functions required for a traditional real-time multi-tasking computer, such as task scheduling, inter-task communication, memory management, interrupt handling, and time maintenance. Resting on this foundation are the redundancy management functions necessary in a redundant computer and the status reporting functions required for an operator interface. The functional requirements, functional design and detailed specifications for all the local system services are documented.
NASA Technical Reports Server (NTRS)
Platt, M. E.; Lewis, E. E.; Boehm, F.
1991-01-01
A Monte Carlo Fortran computer program was developed that uses two variance reduction techniques for computing system reliability applicable to solving very large highly reliable fault-tolerant systems. The program is consistent with the hybrid automated reliability predictor (HARP) code which employs behavioral decomposition and complex fault-error handling models. This new capability is called MC-HARP which efficiently solves reliability models with non-constant failures rates (Weibull). Common mode failure modeling is also a specialty.
NASA Astrophysics Data System (ADS)
Shao, Xinxin; Naghdy, Fazel; Du, Haiping
2017-03-01
A fault-tolerant fuzzy H∞ control design approach for active suspension of in-wheel motor driven electric vehicles in the presence of sprung mass variation, actuator faults and control input constraints is proposed. The controller is designed based on the quarter-car active suspension model with a dynamic-damping-in-wheel-motor-driven-system, in which the suspended motor is operated as a dynamic absorber. The Takagi-Sugeno (T-S) fuzzy model is used to model this suspension with possible sprung mass variation. The parallel-distributed compensation (PDC) scheme is deployed to derive a fault-tolerant fuzzy controller for the T-S fuzzy suspension model. In order to reduce the motor wear caused by the dynamic force transmitted to the in-wheel motor, the dynamic force is taken as an additional controlled output besides the traditional optimization objectives such as sprung mass acceleration, suspension deflection and actuator saturation. The H∞ performance of the proposed controller is derived as linear matrix inequalities (LMIs) comprising three equality constraints which are solved efficiently by means of MATLAB LMI Toolbox. The proposed controller is applied to an electric vehicle suspension and its effectiveness is demonstrated through computer simulation.
2001-12-01
and Lieutenant Namik Kaplan , Turkish Navy. Maj Tiefert’s thesis, “Modeling Control Channel Dynamics of SAAM using NS Network Simulation”, helped lay...DEC99] Deconinck , Dr. ir. Geert, Fault Tolerant Systems, ESAT / Division ACCA , Katholieke Universiteit Leuven, October 1999. [FRE00] Freed...Systems”, Addison-Wesley, 1989. [KAP99] Kaplan , Namik, “Prototyping of an Active and Lightweight Router,” March 1999 [KAT99] Kati, Effraim
Flight test results of the strapdown hexad inertial reference unit (SIRU). Volume 2: Test report
NASA Technical Reports Server (NTRS)
Hruby, R. J.; Bjorkman, W. S.
1977-01-01
Results of flight tests of the Strapdown Inertial Reference Unit (SIRU) navigation system are presented. The fault tolerant SIRU navigation system features a redundant inertial sensor unit and dual computers. System software provides for detection and isolation of inertial sensor failures and continued operation in the event of failures. Flight test results include assessments of the system's navigational performance and fault tolerance. Performance shortcomings are analyzed.
Imperfect construction of microclusters
NASA Astrophysics Data System (ADS)
Schneider, E.; Zhou, K.; Gilbert, G.; Weinstein, Y. S.
2014-01-01
Microclusters are the basic building blocks used to construct cluster states capable of supporting fault-tolerant quantum computation. In this paper, we explore the consequences of errors on microcluster construction using two error models. To quantify the effect of the errors we calculate the fidelity of the constructed microclusters and the fidelity with which two such microclusters can be fused together. Such simulations are vital for gauging the capability of an experimental system to achieve fault tolerance.
Design of the Protocol Processor for the ROBUS-2 Communication System
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.
2005-01-01
The ROBUS-2 Protocol Processor (RPP) is a custom-designed hardware component implementing the functionality of the ROBUS-2 fault-tolerant communication system. The Reliable Optical Bus (ROBUS) is the core communication system of the Scalable Processor-Independent Design for Enhanced Reliability (SPIDER), a general-purpose fault tolerant integrated modular architecture currently under development at NASA Langley Research Center. ROBUS is a time-division multiple access (TDMA) broadcast communication system with medium access control by means of time-indexed communication schedule. ROBUS-2 is a developmental version of the ROBUS providing guaranteed fault-tolerant services to the attached processing elements (PEs), in the presence of a bounded number of faults. These services include message broadcast (Byzantine Agreement), dynamic communication schedule update, time reference (clock synchronization), and distributed diagnosis (group membership). ROBUS also features fault-tolerant startup and restart capabilities. ROBUS-2 tolerates internal as well as PE faults, and incorporates a dynamic self-reconfiguration capability driven by the internal diagnostic system. ROBUS consists of RPPs connected to each other by a lower-level physical communication network. The RPP has a pipelined architecture and the design is parameterized in the behavioral and structural domains. The design of the RPP enables the bus to achieve a PE-message throughput that approaches the available bandwidth at the physical layer.
Formal design specification of a Processor Interface Unit
NASA Technical Reports Server (NTRS)
Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.
1992-01-01
This report describes work to formally specify the requirements and design of a processor interface unit (PIU), a single-chip subsystem providing memory-interface bus-interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. The need for high-quality design assurance in such applications is an undisputed fact, given the disastrous consequences that even a single design flaw can produce. Thus, the further development and application of formal methods to fault-tolerant systems is of critical importance as these systems see increasing use in modern society.
Examples of Nonconservatism in the CARE 3 Program
NASA Technical Reports Server (NTRS)
Dotson, Kelly J.
1988-01-01
This paper presents parameter regions in the CARE 3 (Computer-Aided Reliability Estimation version 3) computer program where the program overestimates the reliability of a modeled system without warning the user. Five simple models of fault-tolerant computer systems are analyzed; and, the parameter regions where reliability is overestimated are given. The source of the error in the reliability estimates for models which incorporate transient fault occurrences was not readily apparent. However, the source of much of the error for models with permanent and intermittent faults can be attributed to the choice of values for the run-time parameters of the program.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fang, Aiman; Laguna, Ignacio; Sato, Kento
Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enablesmore » failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.« less
Redundant Asynchronous Microprocessor System
NASA Technical Reports Server (NTRS)
Meyer, G.; Johnston, J. O.; Dunn, W. R.
1985-01-01
Fault-tolerant computer structure called RAMPS (for redundant asynchronous microprocessor system) has simplicity of static redundancy but offers intermittent-fault handling ability of complex, dynamically redundant systems. New structure useful wherever several microprocessors are employed for control - in aircraft, industrial processes, robotics, and automatic machining, for example.
Fault-tolerance of a neural network solving the traveling salesman problem
NASA Technical Reports Server (NTRS)
Protzel, P.; Palumbo, D.; Arras, M.
1989-01-01
This study presents the results of a fault-injection experiment that stimulates a neural network solving the Traveling Salesman Problem (TSP). The network is based on a modified version of Hopfield's and Tank's original method. We define a performance characteristic for the TSP that allows an overall assessment of the solution quality for different city-distributions and problem sizes. Five different 10-, 20-, and 30- city cases are sued for the injection of up to 13 simultaneous stuck-at-0 and stuck-at-1 faults. The results of more than 4000 simulation-runs show the extreme fault-tolerance of the network, especially with respect to stuck-at-0 faults. One possible explanation for the overall surprising result is the redundancy of the problem representation.
Bound states for magic state distillation in fault-tolerant quantum computation.
Campbell, Earl T; Browne, Dan E
2010-01-22
Magic state distillation is an important primitive in fault-tolerant quantum computation. The magic states are pure nonstabilizer states which can be distilled from certain mixed nonstabilizer states via Clifford group operations alone. Because of the Gottesman-Knill theorem, mixtures of Pauli eigenstates are not expected to be magic state distillable, but it has been an open question whether all mixed states outside this set may be distilled. In this Letter we show that, when resources are finitely limited, nondistillable states exist outside the stabilizer octahedron. In analogy with the bound entangled states, which arise in entanglement theory, we call such states bound states for magic state distillation.
A Fault Tolerant Self-Routing Computer Network Topology
1987-01-01
Herr and Thomas J. Plevyak, *ISDN: The Opportunity Beginso, IEEECommunicationsMaqaz I t, pp. 6-10, November 1986. 5. Mario Gerla and Rodolfo A . Pazos ...WOLAVER a Dean for Research anProfessional Development Air Force Institute Bf Technology Wright-Patterson AFB OH 45433-6583 19. KEY WORDS (Continue...DD I 1473 EDITION OF I NOV 65 IS OBSOLETE UM!C[ASSIFIEy SECURITY CLASSIFICATION OF THIS PAGE (When Data Entered) 41 ,." 5.’ A Fault Tolerant Self
NASA Technical Reports Server (NTRS)
Smith, T. B., III; Lala, J. H.
1984-01-01
The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
Design and Experimental Validation for Direct-Drive Fault-Tolerant Permanent-Magnet Vernier Machines
Liu, Guohai; Yang, Junqin; Chen, Ming; Chen, Qian
2014-01-01
A fault-tolerant permanent-magnet vernier (FT-PMV) machine is designed for direct-drive applications, incorporating the merits of high torque density and high reliability. Based on the so-called magnetic gearing effect, PMV machines have the ability of high torque density by introducing the flux-modulation poles (FMPs). This paper investigates the fault-tolerant characteristic of PMV machines and provides a design method, which is able to not only meet the fault-tolerant requirements but also keep the ability of high torque density. The operation principle of the proposed machine has been analyzed. The design process and optimization are presented specifically, such as the combination of slots and poles, the winding distribution, and the dimensions of PMs and teeth. By using the time-stepping finite element method (TS-FEM), the machine performances are evaluated. Finally, the FT-PMV machine is manufactured, and the experimental results are presented to validate the theoretical analysis. PMID:25045729
Making classical ground-state spin computing fault-tolerant.
Crosson, I J; Bacon, D; Brown, K R
2010-09-01
We examine a model of classical deterministic computing in which the ground state of the classical system is a spatial history of the computation. This model is relevant to quantum dot cellular automata as well as to recent universal adiabatic quantum computing constructions. In its most primitive form, systems constructed in this model cannot compute in an error-free manner when working at nonzero temperature. However, by exploiting a mapping between the partition function for this model and probabilistic classical circuits we are able to show that it is possible to make this model effectively error-free. We achieve this by using techniques in fault-tolerant classical computing and the result is that the system can compute effectively error-free if the temperature is below a critical temperature. We further link this model to computational complexity and show that a certain problem concerning finite temperature classical spin systems is complete for the complexity class Merlin-Arthur. This provides an interesting connection between the physical behavior of certain many-body spin systems and computational complexity.
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.
2008-01-01
This report describes the design of the test articles and monitoring systems developed to characterize the response of a fault-tolerant computer communication system when stressed beyond the theoretical limits for guaranteed correct performance. A high-intensity radiated electromagnetic field (HIRF) environment was selected as the means of injecting faults, as such environments are known to have the potential to cause arbitrary and coincident common-mode fault manifestations that can overwhelm redundancy management mechanisms. The monitors generate stimuli for the systems-under-test (SUTs) and collect data in real-time on the internal state and the response at the external interfaces. A real-time health assessment capability was developed to support the automation of the test. A detailed description of the nature and structure of the collected data is included. The goal of the report is to provide insight into the design and operation of these systems, and to serve as a reference document for use in post-test analyses.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Riesen, Rolf E.; Bridges, Patrick G.; Stearley, Jon R.
Next-generation exascale systems, those capable of performing a quintillion (10{sup 18}) operations per second, are expected to be delivered in the next 8-10 years. These systems, which will be 1,000 times faster than current systems, will be of unprecedented scale. As these systems continue to grow in size, faults will become increasingly common, even over the course of small calculations. Therefore, issues such as fault tolerance and reliability will limit application scalability. Current techniques to ensure progress across faults like checkpoint/restart, the dominant fault tolerance mechanism for the last 25 years, are increasingly problematic at the scales of future systemsmore » due to their excessive overheads. In this work, we evaluate a number of techniques to decrease the overhead of checkpoint/restart and keep this method viable for future exascale systems. More specifically, this work evaluates state-machine replication to dramatically increase the checkpoint interval (the time between successive checkpoint) and hash-based, probabilistic incremental checkpointing using graphics processing units to decrease the checkpoint commit time (the time to save one checkpoint). Using a combination of empirical analysis, modeling, and simulation, we study the costs and benefits of these approaches on a wide range of parameters. These results, which cover of number of high-performance computing capability workloads, different failure distributions, hardware mean time to failures, and I/O bandwidths, show the potential benefits of these techniques for meeting the reliability demands of future exascale platforms.« less
Software/hardware distributed processing network supporting the Ada environment
NASA Astrophysics Data System (ADS)
Wood, Richard J.; Pryk, Zen
1993-09-01
A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
Fault Model Development for Fault Tolerant VLSI Design
1988-05-01
0 % .%. . BEIDGING FAULTS A bridging fault in a digital circuit connects two or more conducting paths of the circuit. The resistance...Melvin Breuer and Arthur Friedman, "Diagnosis and Reliable Design of Digital Systems", Computer Science Press, Inc., 1976. 4. [Chandramouli,1983] R...2138 AEDC LIBARY (TECH REPORTS FILE) MS-O0 ARNOLD AFS TN 37389-9998 USAG1 Attn: ASH-PCA-CRT Ft Huachuca AZ 85613-6000 DOT LIBRARY/iQA SECTION - ATTN
Loss Tolerance in One-Way Quantum Computation via Counterfactual Error Correction
NASA Astrophysics Data System (ADS)
Varnava, Michael; Browne, Daniel E.; Rudolph, Terry
2006-09-01
We introduce a scheme for fault tolerantly dealing with losses (or other “leakage” errors) in cluster state computation that can tolerate up to 50% qubit loss. This is achieved passively using an adaptive strategy of measurement—no coherent measurements or coherent correction is required. Since the scheme relies on inferring information about what would have been the outcome of a measurement had one been able to carry it out, we call this counterfactual error correction.
Sequential Test Strategies for Multiple Fault Isolation
NASA Technical Reports Server (NTRS)
Shakeri, M.; Pattipati, Krishna R.; Raghavan, V.; Patterson-Hine, Ann; Kell, T.
1997-01-01
In this paper, we consider the problem of constructing near optimal test sequencing algorithms for diagnosing multiple faults in redundant (fault-tolerant) systems. The computational complexity of solving the optimal multiple-fault isolation problem is super-exponential, that is, it is much more difficult than the single-fault isolation problem, which, by itself, is NP-hard. By employing concepts from information theory and Lagrangian relaxation, we present several static and dynamic (on-line or interactive) test sequencing algorithms for the multiple fault isolation problem that provide a trade-off between the degree of suboptimality and computational complexity. Furthermore, we present novel diagnostic strategies that generate a static diagnostic directed graph (digraph), instead of a static diagnostic tree, for multiple fault diagnosis. Using this approach, the storage complexity of the overall diagnostic strategy reduces substantially. Computational results based on real-world systems indicate that the size of a static multiple fault strategy is strictly related to the structure of the system, and that the use of an on-line multiple fault strategy can diagnose faults in systems with as many as 10,000 failure sources.
A Unified Fault-Tolerance Protocol
NASA Technical Reports Server (NTRS)
Miner, Paul; Gedser, Alfons; Pike, Lee; Maddalon, Jeffrey
2004-01-01
Davies and Wakerly show that Byzantine fault tolerance can be achieved by a cascade of broadcasts and middle value select functions. We present an extension of the Davies and Wakerly protocol, the unified protocol, and its proof of correctness. We prove that it satisfies validity and agreement properties for communication of exact values. We then introduce bounded communication error into the model. Inexact communication is inherent for clock synchronization protocols. We prove that validity and agreement properties hold for inexact communication, and that exact communication is a special case. As a running example, we illustrate the unified protocol using the SPIDER family of fault-tolerant architectures. In particular we demonstrate that the SPIDER interactive consistency, distributed diagnosis, and clock synchronization protocols are instances of the unified protocol.
DOT National Transportation Integrated Search
1993-05-01
The Maglev control computer system should be designed to verifiably possess high reliability and safety as well as high availability to make Maglev a dependable and attractive transportation alternative to the public. A Maglev computer system has bee...
A survey of an introduction to fault diagnosis algorithms
NASA Technical Reports Server (NTRS)
Mathur, F. P.
1972-01-01
This report surveys the field of diagnosis and introduces some of the key algorithms and heuristics currently in use. Fault diagnosis is an important and a rapidly growing discipline. This is important in the design of self-repairable computers because the present diagnosis resolution of its fault-tolerant computer is limited to a functional unit or processor. Better resolution is necessary before failed units can become partially reuseable. The approach that holds the greatest promise is that of resident microdiagnostics; however, that presupposes a microprogrammable architecture for the computer being self-diagnosed. The presentation is tutorial and contains examples. An extensive bibliography of some 220 entries is included.
NASA Technical Reports Server (NTRS)
Trivedi, K. S. (Editor); Clary, J. B. (Editor)
1980-01-01
A computer aided reliability estimation procedure (CARE 3), developed to model the behavior of ultrareliable systems required by flight-critical avionics and control systems, is evaluated. The mathematical models, numerical method, and fault-tolerant architecture modeling requirements are examined, and the testing and characterization procedures are discussed. Recommendations aimed at enhancing CARE 3 are presented; in particular, the need for a better exposition of the method and the user interface is emphasized.
Implementation of a Fault Tolerant Control Unit within an FPGA for Space Applications
2006-12-01
Conference 2002, September 2002. [20] M. Alderighi, A. Candelori, F. Casini, S. D’Angelo, M. Mancini, A. Paccagnella, S. Pastore , G.R. Sechi, “Heavy...Luigi Carro and Ricardo Reis , “Designing and Testing Fault-Tolerant Techniques for SRAM-based FPGAs,” in Proc. 1st Conference on Computer Frontiers, pp...susceptibility,” in IEEE Proc. 12th IEEE Intl. Symposium on On-Line Testing, pp. 89-91, 2006. [45] Fernanda Lima, Luigi Carro and Ricardo Reis
OpenCluster: A Flexible Distributed Computing Framework for Astronomical Data Processing
NASA Astrophysics Data System (ADS)
Wei, Shoulin; Wang, Feng; Deng, Hui; Liu, Cuiyin; Dai, Wei; Liang, Bo; Mei, Ying; Shi, Congming; Liu, Yingbo; Wu, Jingping
2017-02-01
The volume of data generated by modern astronomical telescopes is extremely large and rapidly growing. However, current high-performance data processing architectures/frameworks are not well suited for astronomers because of their limitations and programming difficulties. In this paper, we therefore present OpenCluster, an open-source distributed computing framework to support rapidly developing high-performance processing pipelines of astronomical big data. We first detail the OpenCluster design principles and implementations and present the APIs facilitated by the framework. We then demonstrate a case in which OpenCluster is used to resolve complex data processing problems for developing a pipeline for the Mingantu Ultrawide Spectral Radioheliograph. Finally, we present our OpenCluster performance evaluation. Overall, OpenCluster provides not only high fault tolerance and simple programming interfaces, but also a flexible means of scaling up the number of interacting entities. OpenCluster thereby provides an easily integrated distributed computing framework for quickly developing a high-performance data processing system of astronomical telescopes and for significantly reducing software development expenses.
Pyshkin, P V; Luo, Da-Wei; Jing, Jun; You, J Q; Wu, Lian-Ao
2016-11-25
Holonomic quantum computation (HQC) may not show its full potential in quantum speedup due to the prerequisite of a long coherent runtime imposed by the adiabatic condition. Here we show that the conventional HQC can be dramatically accelerated by using external control fields, of which the effectiveness is exclusively determined by the integral of the control fields in the time domain. This control scheme can be realized with net zero energy cost and it is fault-tolerant against fluctuation and noise, significantly relaxing the experimental constraints. We demonstrate how to realize the scheme via decoherence-free subspaces. In this way we unify quantum robustness merits of this fault-tolerant control scheme, the conventional HQC and decoherence-free subspace, and propose an expedited holonomic quantum computation protocol.
Pyshkin, P. V.; Luo, Da-Wei; Jing, Jun; You, J. Q.; Wu, Lian-Ao
2016-01-01
Holonomic quantum computation (HQC) may not show its full potential in quantum speedup due to the prerequisite of a long coherent runtime imposed by the adiabatic condition. Here we show that the conventional HQC can be dramatically accelerated by using external control fields, of which the effectiveness is exclusively determined by the integral of the control fields in the time domain. This control scheme can be realized with net zero energy cost and it is fault-tolerant against fluctuation and noise, significantly relaxing the experimental constraints. We demonstrate how to realize the scheme via decoherence-free subspaces. In this way we unify quantum robustness merits of this fault-tolerant control scheme, the conventional HQC and decoherence-free subspace, and propose an expedited holonomic quantum computation protocol. PMID:27886234
Experimental magic state distillation for fault-tolerant quantum computing.
Souza, Alexandre M; Zhang, Jingfu; Ryan, Colm A; Laflamme, Raymond
2011-01-25
Any physical quantum device for quantum information processing (QIP) is subject to errors in implementation. In order to be reliable and efficient, quantum computers will need error-correcting or error-avoiding methods. Fault-tolerance achieved through quantum error correction will be an integral part of quantum computers. Of the many methods that have been discovered to implement it, a highly successful approach has been to use transversal gates and specific initial states. A critical element for its implementation is the availability of high-fidelity initial states, such as |0〉 and the 'magic state'. Here, we report an experiment, performed in a nuclear magnetic resonance (NMR) quantum processor, showing sufficient quantum control to improve the fidelity of imperfect initial magic states by distilling five of them into one with higher fidelity.
Lattice surgery on the Raussendorf lattice
NASA Astrophysics Data System (ADS)
Herr, Daniel; Paler, Alexandru; Devitt, Simon J.; Nori, Franco
2018-07-01
Lattice surgery is a method to perform quantum computation fault-tolerantly by using operations on boundary qubits between different patches of the planar code. This technique allows for universal planar code computation without eliminating the intrinsic two-dimensional nearest-neighbor properties of the surface code that eases physical hardware implementations. Lattice surgery approaches to algorithmic compilation and optimization have been demonstrated to be more resource efficient for resource-intensive components of a fault-tolerant algorithm, and consequently may be preferable over braid-based logic. Lattice surgery can be extended to the Raussendorf lattice, providing a measurement-based approach to the surface code. In this paper we describe how lattice surgery can be performed on the Raussendorf lattice and therefore give a viable alternative to computation using braiding in measurement-based implementations of topological codes.
Fault-tolerant quantum computation with nondeterministic entangling gates
NASA Astrophysics Data System (ADS)
Auger, James M.; Anwar, Hussain; Gimeno-Segovia, Mercedes; Stace, Thomas M.; Browne, Dan E.
2018-03-01
Performing entangling gates between physical qubits is necessary for building a large-scale universal quantum computer, but in some physical implementations—for example, those that are based on linear optics or networks of ion traps—entangling gates can only be implemented probabilistically. In this work, we study the fault-tolerant performance of a topological cluster state scheme with local nondeterministic entanglement generation, where failed entangling gates (which correspond to bonds on the lattice representation of the cluster state) lead to a defective three-dimensional lattice with missing bonds. We present two approaches for dealing with missing bonds; the first is a nonadaptive scheme that requires no additional quantum processing, and the second is an adaptive scheme in which qubits can be measured in an alternative basis to effectively remove them from the lattice, hence eliminating their damaging effect and leading to better threshold performance. We find that a fault-tolerance threshold can still be observed with a bond-loss rate of 6.5% for the nonadaptive scheme, and a bond-loss rate as high as 14.5% for the adaptive scheme.
Change Detection of Mobile LIDAR Data Using Cloud Computing
NASA Astrophysics Data System (ADS)
Liu, Kun; Boehm, Jan; Alis, Christian
2016-06-01
Change detection has long been a challenging problem although a lot of research has been conducted in different fields such as remote sensing and photogrammetry, computer vision, and robotics. In this paper, we blend voxel grid and Apache Spark together to propose an efficient method to address the problem in the context of big data. Voxel grid is a regular geometry representation consisting of the voxels with the same size, which fairly suites parallel computation. Apache Spark is a popular distributed parallel computing platform which allows fault tolerance and memory cache. These features can significantly enhance the performance of Apache Spark and results in an efficient and robust implementation. In our experiments, both synthetic and real point cloud data are employed to demonstrate the quality of our method.
Automatic maintenance payload on board of a Mexican LEO microsatellite
NASA Astrophysics Data System (ADS)
Vicente-Vivas, Esaú; García-Nocetti, Fabián; Mendieta-Jiménez, Francisco
2006-02-01
Few research institutions from Mexico work together to finalize the integration of a technological demonstration microsatellite called Satex, aiming the launching of the first ever fully designed and manufactured domestic space vehicle. The project is based on technical knowledge gained in previous space experiences, particularly in developing GASCAN automatic experiments for NASA's space shuttle, and in some support obtained from the local team which assembled the México-OSCAR-30 microsatellites. Satex includes three autonomous payloads and a power subsystem, each one with a local microcomputer to provide intelligent and dedicated control. It also contains a flight computer (FC) with a pair of full redundancies. This enables the remote maintenance of processing boards from the ground station. A fourth communications payload depends on the flight computer for control purposes. A fifth payload was decided to be developed for the satellite. It adds value to the available on-board computers and extends the opportunity for a developing country to learn and to generate domestic space technology. Its aim is to provide automatic maintenance capabilities for the most critical on-board computer in order to achieve continuous satellite operations. This paper presents the virtual computer architecture specially developed to provide maintenance capabilities to the flight computer. The architecture is periodically implemented by software with a small amount of physical processors (FC processors) and virtual redundancies (payload processors) to emulate a hybrid redundancy computer. Communications among processors are accomplished over a fault-tolerant LAN. This allows a versatile operating behavior in terms of data communication as well as in terms of distributed fault tolerance. Obtained results, payload validation and reliability results are also presented.
Self-Checking Pairs Of Microprocessors
NASA Technical Reports Server (NTRS)
Smith, Brian S.
1995-01-01
Method of imparting fault tolerance to computer system provides for immediate detection of faults at microprocessor level. Shadow microprocessor provides nominal duplicate outputs to verify functioning of main microprocessor. When output signal on any pin of one microprocessor differs from that on corresponding pin of other microprocessor, comparator puts out alarm signal.
Critical fault patterns determination in fault-tolerant computer systems
NASA Technical Reports Server (NTRS)
Mccluskey, E. J.; Losq, J.
1978-01-01
The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics.
Fault Tolerant Frequent Pattern Mining
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shohdy, Sameh; Vishnu, Abhinav; Agrawal, Gagan
FP-Growth algorithm is a Frequent Pattern Mining (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing,more » though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.« less
Expert System Detects Power-Distribution Faults
NASA Technical Reports Server (NTRS)
Walters, Jerry L.; Quinn, Todd M.
1994-01-01
Autonomous Power Expert (APEX) computer program is prototype expert-system program detecting faults in electrical-power-distribution system. Assists human operators in diagnosing faults and deciding what adjustments or repairs needed for immediate recovery from faults or for maintenance to correct initially nonthreatening conditions that could develop into faults. Written in Lisp.
Fault-tolerant power distribution system
NASA Technical Reports Server (NTRS)
Volp, Jeffrey A. (Inventor)
1987-01-01
A fault-tolerant power distribution system which includes a plurality of power sources and a plurality of nodes responsive thereto for supplying power to one or more loads associated with each node. Each node includes a plurality of switching circuits, each of which preferably uses a power field effect transistor which provides a diode operation when power is first applied to the nodes and which thereafter provides bi-directional current flow through the switching circuit in a manner such that a low voltage drop is produced in each direction. Each switching circuit includes circuitry for disabling the power field effect transistor when the current in the switching circuit exceeds a preselected value.
Cost-effective solutions to maintaining smart grid reliability
NASA Astrophysics Data System (ADS)
Qin, Qiu
As the aging power systems are increasingly working closer to the capacity and thermal limits, maintaining an sufficient reliability has been of great concern to the government agency, utility companies and users. This dissertation focuses on improving the reliability of transmission and distribution systems. Based on the wide area measurements, multiple model algorithms are developed to diagnose transmission line three-phase short to ground faults in the presence of protection misoperations. The multiple model algorithms utilize the electric network dynamics to provide prompt and reliable diagnosis outcomes. Computational complexity of the diagnosis algorithm is reduced by using a two-step heuristic. The multiple model algorithm is incorporated into a hybrid simulation framework, which consist of both continuous state simulation and discrete event simulation, to study the operation of transmission systems. With hybrid simulation, line switching strategy for enhancing the tolerance to protection misoperations is studied based on the concept of security index, which involves the faulted mode probability and stability coverage. Local measurements are used to track the generator state and faulty mode probabilities are calculated in the multiple model algorithms. FACTS devices are considered as controllers for the transmission system. The placement of FACTS devices into power systems is investigated with a criterion of maintaining a prescribed level of control reconfigurability. Control reconfigurability measures the small signal combined controllability and observability of a power system with an additional requirement on fault tolerance. For the distribution systems, a hierarchical framework, including a high level recloser allocation scheme and a low level recloser placement scheme, is presented. The impacts of recloser placement on the reliability indices is analyzed. Evaluation of reliability indices in the placement process is carried out via discrete event simulation. The reliability requirements are described with probabilities and evaluated from the empirical distributions of reliability indices.
The implementation and use of Ada on distributed systems with high reliability requirements
NASA Technical Reports Server (NTRS)
Knight, J. C.; Gregory, S. T.; Urquhart, J. I. A.
1985-01-01
The use and implementation of Ada in distributed environments in which reliability is the primary concern were investigated. In particular, the concept that a distributed system may be programmed entirely in Ada so that the individual tasks of the system are unconcerned with which processors they are executing on, and that failures may occur in the software or underlying hardware was examined. Progress is discussed for the following areas: continued development and testing of the fault-tolerant Ada testbed; development of suggested changes to Ada so that it might more easily cope with the failure of interest; and design of new approaches to fault-tolerant software in real-time systems, and integration of these ideas into Ada.
NASA Technical Reports Server (NTRS)
Jansen, Mark; Montague, Gerald; Provenza, Andrew; Palazzolo, Alan
2004-01-01
Closed loop operation of a single, high temperature magnetic radial bearing to 30,000 RPM (2.25 million DN) and 540 C (1000 F) is discussed. Also, high temperature, fault tolerant operation for the three axis system is examined. A novel, hydrostatic backup bearing system was employed to attain high speed, high temperature, lubrication free support of the entire rotor system. The hydrostatic bearings were made of a high lubricity material and acted as journal-type backup bearings. New, high temperature displacement sensors were successfully employed to monitor shaft position throughout the entire temperature range and are described in this paper. Control of the system was accomplished through a stand alone, high speed computer controller and it was used to run both the fault-tolerant PID and active vibration control algorithms.
NASA Technical Reports Server (NTRS)
Bickford, Mark; Srivas, Mandayam
1991-01-01
Presented here is a formal specification and verification of a property of a quadruplicately redundant fault tolerant microprocessor system design. A complete listing of the formal specification of the system and the correctness theorems that are proved are given. The system performs the task of obtaining interactive consistency among the processors using a special instruction on the processors. The design is based on an algorithm proposed by Pease, Shostak, and Lamport. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, providing certain preconditions hold, using a computer aided design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Fault Tolerant Real-Time Networks
2007-05-30
Alberto Sangiovanni-Vincentelli, editors Hybrid Systems: Computation and Control. Fourth International Workshop (HSCC, Rome, Italy, March 2001...average dwell time by solving optimization problems. In Ashish Tiwari and Joao P. Hespanha, editors, Hybrid Systems: Computation and Control (HSCC 06
Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -
NASA Technical Reports Server (NTRS)
Chen, Paul Peichuan
1993-01-01
Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.
2nd Generation QUATARA Flight Computer Project
NASA Technical Reports Server (NTRS)
Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven
2015-01-01
Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
A resource management architecture based on complex network theory in cloud computing federation
NASA Astrophysics Data System (ADS)
Zhang, Zehua; Zhang, Xuejie
2011-10-01
Cloud Computing Federation is a main trend of Cloud Computing. Resource Management has significant effect on the design, realization, and efficiency of Cloud Computing Federation. Cloud Computing Federation has the typical characteristic of the Complex System, therefore, we propose a resource management architecture based on complex network theory for Cloud Computing Federation (abbreviated as RMABC) in this paper, with the detailed design of the resource discovery and resource announcement mechanisms. Compare with the existing resource management mechanisms in distributed computing systems, a Task Manager in RMABC can use the historical information and current state data get from other Task Managers for the evolution of the complex network which is composed of Task Managers, thus has the advantages in resource discovery speed, fault tolerance and adaptive ability. The result of the model experiment confirmed the advantage of RMABC in resource discovery performance.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wu, Qishi; Zhu, Mengxia; Rao, Nageswara S
We propose an intelligent decision support system based on sensor and computer networks that incorporates various component techniques for sensor deployment, data routing, distributed computing, and information fusion. The integrated system is deployed in a distributed environment composed of both wireless sensor networks for data collection and wired computer networks for data processing in support of homeland security defense. We present the system framework and formulate the analytical problems and develop approximate or exact solutions for the subtasks: (i) sensor deployment strategy based on a two-dimensional genetic algorithm to achieve maximum coverage with cost constraints; (ii) data routing scheme tomore » achieve maximum signal strength with minimum path loss, high energy efficiency, and effective fault tolerance; (iii) network mapping method to assign computing modules to network nodes for high-performance distributed data processing; and (iv) binary decision fusion rule that derive threshold bounds to improve system hit rate and false alarm rate. These component solutions are implemented and evaluated through either experiments or simulations in various application scenarios. The extensive results demonstrate that these component solutions imbue the integrated system with the desirable and useful quality of intelligence in decision making.« less
cOSPREY: A Cloud-Based Distributed Algorithm for Large-Scale Computational Protein Design
Pan, Yuchao; Dong, Yuxi; Zhou, Jingtian; Hallen, Mark; Donald, Bruce R.; Xu, Wei
2016-01-01
Abstract Finding the global minimum energy conformation (GMEC) of a huge combinatorial search space is the key challenge in computational protein design (CPD) problems. Traditional algorithms lack a scalable and efficient distributed design scheme, preventing researchers from taking full advantage of current cloud infrastructures. We design cloud OSPREY (cOSPREY), an extension to a widely used protein design software OSPREY, to allow the original design framework to scale to the commercial cloud infrastructures. We propose several novel designs to integrate both algorithm and system optimizations, such as GMEC-specific pruning, state search partitioning, asynchronous algorithm state sharing, and fault tolerance. We evaluate cOSPREY on three different cloud platforms using different technologies and show that it can solve a number of large-scale protein design problems that have not been possible with previous approaches. PMID:27154509
Advanced information processing system for advanced launch system: Avionics architecture synthesis
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan H.; Harper, Richard E.; Jaskowiak, Kenneth R.; Rosch, Gene; Alger, Linda S.; Schor, Andrei L.
1991-01-01
The Advanced Information Processing System (AIPS) is a fault-tolerant distributed computer system architecture that was developed to meet the real time computational needs of advanced aerospace vehicles. One such vehicle is the Advanced Launch System (ALS) being developed jointly by NASA and the Department of Defense to launch heavy payloads into low earth orbit at one tenth the cost (per pound of payload) of the current launch vehicles. An avionics architecture that utilizes the AIPS hardware and software building blocks was synthesized for ALS. The AIPS for ALS architecture synthesis process starting with the ALS mission requirements and ending with an analysis of the candidate ALS avionics architecture is described.
Comparative analysis of techniques for evaluating the effectiveness of aircraft computing systems
NASA Technical Reports Server (NTRS)
Hitt, E. F.; Bridgman, M. S.; Robinson, A. C.
1981-01-01
Performability analysis is a technique developed for evaluating the effectiveness of fault-tolerant computing systems in multiphase missions. Performability was evaluated for its accuracy, practical usefulness, and relative cost. The evaluation was performed by applying performability and the fault tree method to a set of sample problems ranging from simple to moderately complex. The problems involved as many as five outcomes, two to five mission phases, permanent faults, and some functional dependencies. Transient faults and software errors were not considered. A different analyst was responsible for each technique. Significantly more time and effort were required to learn performability analysis than the fault tree method. Performability is inherently as accurate as fault tree analysis. For the sample problems, fault trees were more practical and less time consuming to apply, while performability required less ingenuity and was more checkable. Performability offers some advantages for evaluating very complex problems.
Fault-tolerance in Two-dimensional Topological Systems
NASA Astrophysics Data System (ADS)
Anderson, Jonas T.
This thesis is a collection of ideas with the general goal of building, at least in the abstract, a local fault-tolerant quantum computer. The connection between quantum information and topology has proven to be an active area of research in several fields. The introduction of the toric code by Alexei Kitaev demonstrated the usefulness of topology for quantum memory and quantum computation. Many quantum codes used for quantum memory are modeled by spin systems on a lattice, with operators that extract syndrome information placed on vertices or faces of the lattice. It is natural to wonder whether the useful codes in such systems can be classified. This thesis presents work that leverages ideas from topology and graph theory to explore the space of such codes. Homological stabilizer codes are introduced and it is shown that, under a set of reasonable assumptions, any qubit homological stabilizer code is equivalent to either a toric code or a color code. Additionally, the toric code and the color code correspond to distinct classes of graphs. Many systems have been proposed as candidate quantum computers. It is very desirable to design quantum computing architectures with two-dimensional layouts and low complexity in parity-checking circuitry. Kitaev's surface codes provided the first example of codes satisfying this property. They provided a new route to fault tolerance with more modest overheads and thresholds approaching 1%. The recently discovered color codes share many properties with the surface codes, such as the ability to perform syndrome extraction locally in two dimensions. Some families of color codes admit a transversal implementation of the entire Clifford group. This work investigates color codes on the 4.8.8 lattice known as triangular codes. I develop a fault-tolerant error-correction strategy for these codes in which repeated syndrome measurements on this lattice generate a three-dimensional space-time combinatorial structure. I then develop an integer program that analyzes this structure and determines the most likely set of errors consistent with the observed syndrome values. I implement this integer program to find the threshold for depolarizing noise on small versions of these triangular codes. Because the threshold for magic-state distillation is likely to be higher than this value and because logical
A Log-Scaling Fault Tolerant Agreement Algorithm for a Fault Tolerant MPI
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hursey, Joshua J; Naughton, III, Thomas J; Vallee, Geoffroy R
The lack of fault tolerance is becoming a limiting factor for application scalability in HPC systems. The MPI does not provide standardized fault tolerance interfaces and semantics. The MPI Forum's Fault Tolerance Working Group is proposing a collective fault tolerant agreement algorithm for the next MPI standard. Such algorithms play a central role in many fault tolerant applications. This paper combines a log-scaling two-phase commit agreement algorithm with a reduction operation to provide the necessary functionality for the new collective without any additional messages. Error handling mechanisms are described that preserve the fault tolerance properties while maintaining overall scalability.
An Autonomous Distributed Fault-Tolerant Local Positioning System
NASA Technical Reports Server (NTRS)
Malekpour, Mahyar R.
2017-01-01
We describe a fault-tolerant, GPS-independent (Global Positioning System) distributed autonomous positioning system for static/mobile objects and present solutions for providing highly-accurate geo-location data for the static/mobile objects in dynamic environments. The reliability and accuracy of a positioning system fundamentally depends on two factors; its timeliness in broadcasting signals and the knowledge of its geometry, i.e., locations and distances of the beacons. Existing distributed positioning systems either synchronize to a common external source like GPS or establish their own time synchrony using a scheme similar to a master-slave by designating a particular beacon as the master and other beacons synchronize to it, resulting in a single point of failure. Another drawback of existing positioning systems is their lack of addressing various fault manifestations, in particular, communication link failures, which, as in wireless networks, are increasingly dominating the process failures and are typically transient and mobile, in the sense that they typically affect different messages to/from different processes over time.
NASA Technical Reports Server (NTRS)
Lala, J. H.; Smith, T. B., III
1983-01-01
The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
Experimental fault-tolerant universal quantum gates with solid-state spins under ambient conditions
Rong, Xing; Geng, Jianpei; Shi, Fazhan; Liu, Ying; Xu, Kebiao; Ma, Wenchao; Kong, Fei; Jiang, Zhen; Wu, Yang; Du, Jiangfeng
2015-01-01
Quantum computation provides great speedup over its classical counterpart for certain problems. One of the key challenges for quantum computation is to realize precise control of the quantum system in the presence of noise. Control of the spin-qubits in solids with the accuracy required by fault-tolerant quantum computation under ambient conditions remains elusive. Here, we quantitatively characterize the source of noise during quantum gate operation and demonstrate strategies to suppress the effect of these. A universal set of logic gates in a nitrogen-vacancy centre in diamond are reported with an average single-qubit gate fidelity of 0.999952 and two-qubit gate fidelity of 0.992. These high control fidelities have been achieved at room temperature in naturally abundant 13C diamond via composite pulses and an optimized control method. PMID:26602456
Software Implemented Fault-Tolerant (SIFT) user's guide
NASA Technical Reports Server (NTRS)
Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.
1984-01-01
Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.
Fault-tolerant battery system employing intra-battery network architecture
Hagen, Ronald A.; Chen, Kenneth W.; Comte, Christophe; Knudson, Orlin B.; Rouillard, Jean
2000-01-01
A distributed energy storing system employing a communications network is disclosed. A distributed battery system includes a number of energy storing modules, each of which includes a processor and communications interface. In a network mode of operation, a battery computer communicates with each of the module processors over an intra-battery network and cooperates with individual module processors to coordinate module monitoring and control operations. The battery computer monitors a number of battery and module conditions, including the potential and current state of the battery and individual modules, and the conditions of the battery's thermal management system. An over-discharge protection system, equalization adjustment system, and communications system are also controlled by the battery computer. The battery computer logs and reports various status data on battery level conditions which may be reported to a separate system platform computer. A module transitions to a stand-alone mode of operation if the module detects an absence of communication connectivity with the battery computer. A module which operates in a stand-alone mode performs various monitoring and control functions locally within the module to ensure safe and continued operation.
Distributed Evaluation Functions for Fault Tolerant Multi-Rover Systems
NASA Technical Reports Server (NTRS)
Agogino, Adrian; Turner, Kagan
2005-01-01
The ability to evolve fault tolerant control strategies for large collections of agents is critical to the successful application of evolutionary strategies to domains where failures are common. Furthermore, while evolutionary algorithms have been highly successful in discovering single-agent control strategies, extending such algorithms to multiagent domains has proven to be difficult. In this paper we present a method for shaping evaluation functions for agents that provide control strategies that both are tolerant to different types of failures and lead to coordinated behavior in a multi-agent setting. This method neither relies of a centralized strategy (susceptible to single point of failures) nor a distributed strategy where each agent uses a system wide evaluation function (severe credit assignment problem). In a multi-rover problem, we show that agents using our agent-specific evaluation perform up to 500% better than agents using the system evaluation. In addition we show that agents are still able to maintain a high level of performance when up to 60% of the agents fail due to actuator, communication or controller faults.
Coherent Oscillations inside a Quantum Manifold Stabilized by Dissipation
NASA Astrophysics Data System (ADS)
Touzard, S.; Grimm, A.; Leghtas, Z.; Mundhada, S. O.; Reinhold, P.; Axline, C.; Reagor, M.; Chou, K.; Blumoff, J.; Sliwa, K. M.; Shankar, S.; Frunzio, L.; Schoelkopf, R. J.; Mirrahimi, M.; Devoret, M. H.
2018-04-01
Manipulating the state of a logical quantum bit (qubit) usually comes at the expense of exposing it to decoherence. Fault-tolerant quantum computing tackles this problem by manipulating quantum information within a stable manifold of a larger Hilbert space, whose symmetries restrict the number of independent errors. The remaining errors do not affect the quantum computation and are correctable after the fact. Here we implement the autonomous stabilization of an encoding manifold spanned by Schrödinger cat states in a superconducting cavity. We show Zeno-driven coherent oscillations between these states analogous to the Rabi rotation of a qubit protected against phase flips. Such gates are compatible with quantum error correction and hence are crucial for fault-tolerant logical qubits.
Use of non-adiabatic geometric phase for quantum computing by NMR.
Das, Ranabir; Kumar, S K Karthick; Kumar, Anil
2005-12-01
Geometric phases have stimulated researchers for its potential applications in many areas of science. One of them is fault-tolerant quantum computation. A preliminary requisite of quantum computation is the implementation of controlled dynamics of qubits. In controlled dynamics, one qubit undergoes coherent evolution and acquires appropriate phase, depending on the state of other qubits. If the evolution is geometric, then the phase acquired depend only on the geometry of the path executed, and is robust against certain types of error. This phenomenon leads to an inherently fault-tolerant quantum computation. Here we suggest a technique of using non-adiabatic geometric phase for quantum computation, using selective excitation. In a two-qubit system, we selectively evolve a suitable subsystem where the control qubit is in state |1, through a closed circuit. By this evolution, the target qubit gains a phase controlled by the state of the control qubit. Using the non-adiabatic geometric phase we demonstrate implementation of Deutsch-Jozsa algorithm and Grover's search algorithm in a two-qubit system.
Adaptive and technology-independent architecture for fault-tolerant distributed AAL solutions.
Schmidt, Michael; Obermaisser, Roman
2018-04-01
Today's architectures for Ambient Assisted Living (AAL) must cope with a variety of challenges like flawless sensor integration and time synchronization (e.g. for sensor data fusion) while abstracting from the underlying technologies at the same time. Furthermore, an architecture for AAL must be capable to manage distributed application scenarios in order to support elderly people in all situations of their everyday life. This encompasses not just life at home but in particular the mobility of elderly people (e.g. when going for a walk or having sports) as well. Within this paper we will introduce a novel architecture for distributed AAL solutions whose design follows a modern Microservices approach by providing small core services instead of a monolithic application framework. The architecture comprises core services for sensor integration, and service discovery while supporting several communication models (periodic, sporadic, streaming). We extend the state-of-the-art by introducing a fault-tolerance model for our architecture on the basis of a fault-hypothesis describing the fault-containment regions (FCRs) with their respective failure modes and failure rates in order to support safety-critical AAL applications. Copyright © 2017 Elsevier Ltd. All rights reserved.
Córcoles, A.D.; Magesan, Easwar; Srinivasan, Srikanth J.; Cross, Andrew W.; Steffen, M.; Gambetta, Jay M.; Chow, Jerry M.
2015-01-01
The ability to detect and deal with errors when manipulating quantum systems is a fundamental requirement for fault-tolerant quantum computing. Unlike classical bits that are subject to only digital bit-flip errors, quantum bits are susceptible to a much larger spectrum of errors, for which any complete quantum error-correcting code must account. Whilst classical bit-flip detection can be realized via a linear array of qubits, a general fault-tolerant quantum error-correcting code requires extending into a higher-dimensional lattice. Here we present a quantum error detection protocol on a two-by-two planar lattice of superconducting qubits. The protocol detects an arbitrary quantum error on an encoded two-qubit entangled state via quantum non-demolition parity measurements on another pair of error syndrome qubits. This result represents a building block towards larger lattices amenable to fault-tolerant quantum error correction architectures such as the surface code. PMID:25923200
Córcoles, A D; Magesan, Easwar; Srinivasan, Srikanth J; Cross, Andrew W; Steffen, M; Gambetta, Jay M; Chow, Jerry M
2015-04-29
The ability to detect and deal with errors when manipulating quantum systems is a fundamental requirement for fault-tolerant quantum computing. Unlike classical bits that are subject to only digital bit-flip errors, quantum bits are susceptible to a much larger spectrum of errors, for which any complete quantum error-correcting code must account. Whilst classical bit-flip detection can be realized via a linear array of qubits, a general fault-tolerant quantum error-correcting code requires extending into a higher-dimensional lattice. Here we present a quantum error detection protocol on a two-by-two planar lattice of superconducting qubits. The protocol detects an arbitrary quantum error on an encoded two-qubit entangled state via quantum non-demolition parity measurements on another pair of error syndrome qubits. This result represents a building block towards larger lattices amenable to fault-tolerant quantum error correction architectures such as the surface code.
Fault tolerance with noisy and slow measurements and preparation.
Paz-Silva, Gerardo A; Brennen, Gavin K; Twamley, Jason
2010-09-03
It is not so well known that measurement-free quantum error correction protocols can be designed to achieve fault-tolerant quantum computing. Despite their potential advantages in terms of the relaxation of accuracy, speed, and addressing requirements, they have usually been overlooked since they are expected to yield a very bad threshold. We show that this is not the case. We design fault-tolerant circuits for the 9-qubit Bacon-Shor code and find an error threshold for unitary gates and preparation of p((p,g)thresh)=3.76×10(-5) (30% of the best known result for the same code using measurement) while admitting up to 1/3 error rates for measurements and allocating no constraints on measurement speed. We further show that demanding gate error rates sufficiently below the threshold pushes the preparation threshold up to p((p)thresh)=1/3.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Katti, Amogh; Di Fatta, Giuseppe; Naughton III, Thomas J
Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implementedmore » and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.« less
A data management system to enable urgent natural disaster computing
NASA Astrophysics Data System (ADS)
Leong, Siew Hoon; Kranzlmüller, Dieter; Frank, Anton
2014-05-01
Civil protection, in particular natural disaster management, is very important to most nations and civilians in the world. When disasters like flash floods, earthquakes and tsunamis are expected or have taken place, it is of utmost importance to make timely decisions for managing the affected areas and reduce casualties. Computer simulations can generate information and provide predictions to facilitate this decision making process. Getting the data to the required resources is a critical requirement to enable the timely computation of the predictions. An urgent data management system to support natural disaster computing is thus necessary to effectively carry out data activities within a stipulated deadline. Since the trigger of a natural disaster is usually unpredictable, it is not always possible to prepare required resources well in advance. As such, an urgent data management system for natural disaster computing has to be able to work with any type of resources. Additional requirements include the need to manage deadlines and huge volume of data, fault tolerance, reliable, flexibility to changes, ease of usage, etc. The proposed data management platform includes a service manager to provide a uniform and extensible interface for the supported data protocols, a configuration manager to check and retrieve configurations of available resources, a scheduler manager to ensure that the deadlines can be met, a fault tolerance manager to increase the reliability of the platform and a data manager to initiate and perform the data activities. These managers will enable the selection of the most appropriate resource, transfer protocol, etc. such that the hard deadline of an urgent computation can be met for a particular urgent activity, e.g. data staging or computation. We associated 2 types of deadlines [2] with an urgent computing system. Soft-hard deadline: Missing a soft-firm deadline will render the computation less useful resulting in a cost that can have severe consequences Hard deadline: Missing a hard deadline renders the computation useless and results in full catastrophic consequences. A prototype of this system has a REST-based service manager. The REST-based implementation provides a uniform interface that is easy to use. New and upcoming file transfer protocols can easily be extended and accessed via the service manager. The service manager interacts with the other four managers to coordinate the data activities so that the fundamental natural disaster urgent computing requirement, i.e. deadline, can be fulfilled in a reliable manner. A data activity can include data storing, data archiving and data storing. Reliability is ensured by the choice of a network of managers organisation model[1] the configuration manager and the fault tolerance manager. With this proposed design, an easy to use, resource-independent data management system that can support and fulfill the computation of a natural disaster prediction within stipulated deadlines can thus be realised. References [1] H. G. Hegering, S. Abeck, and B. Neumair, Integrated management of networked systems - concepts, architectures, and their operational application, Morgan Kaufmann Publishers, 340 Pine Stret, Sixth Floor, San Francisco, CA 94104-3205, USA, 1999. [2] H. Kopetz, Real-time systems design principles for distributed embedded applications, second edition, Springer, LLC, 233 Spring Street, New York, NY 10013, USA, 2011. [3] S. H. Leong, A. Frank, and D. Kranzlmu¨ ller, Leveraging e-infrastructures for urgent computing, Procedia Computer Science 18 (2013), no. 0, 2177 - 2186, 2013 International Conference on Computational Science. [4] N. Trebon, Enabling urgent computing within the existing distributed computing infrastructure, Ph.D. thesis, University of Chicago, August 2011, http://people.cs.uchicago.edu/~ntrebon/docs/dissertation.pdf.
Zhang, Zhe; Kong, Xiangping; Yin, Xianggen; Yang, Zengli; Wang, Lijun
2014-01-01
In order to solve the problems of the existing wide-area backup protection (WABP) algorithms, the paper proposes a novel WABP algorithm based on the distribution characteristics of fault component current and improved Dempster/Shafer (D-S) evidence theory. When a fault occurs, slave substations transmit to master station the amplitudes of fault component currents of transmission lines which are the closest to fault element. Then master substation identifies suspicious faulty lines according to the distribution characteristics of fault component current. After that, the master substation will identify the actual faulty line with improved D-S evidence theory based on the action states of traditional protections and direction components of these suspicious faulty lines. The simulation examples based on IEEE 10-generator-39-bus system show that the proposed WABP algorithm has an excellent performance. The algorithm has low requirement of sampling synchronization, small wide-area communication flow, and high fault tolerance. PMID:25050399
Software life cycle methodologies and environments
NASA Technical Reports Server (NTRS)
Fridge, Ernest
1991-01-01
Products of this project will significantly improve the quality and productivity of Space Station Freedom Program software processes by: improving software reliability and safety; and broadening the range of problems that can be solved with computational solutions. Projects brings in Computer Aided Software Engineering (CASE) technology for: Environments such as Engineering Script Language/Parts Composition System (ESL/PCS) application generator, Intelligent User Interface for cost avoidance in setting up operational computer runs, Framework programmable platform for defining process and software development work flow control, Process for bringing CASE technology into an organization's culture, and CLIPS/CLIPS Ada language for developing expert systems; and methodologies such as Method for developing fault tolerant, distributed systems and a method for developing systems for common sense reasoning and for solving expert systems problems when only approximate truths are known.
High-fidelity spin measurement on the nitrogen-vacancy center
NASA Astrophysics Data System (ADS)
Hanks, Michael; Trupke, Michael; Schmiedmayer, Jörg; Munro, William J.; Nemoto, Kae
2017-10-01
Nitrogen-vacancy (NV) centers in diamond are versatile candidates for many quantum information processing tasks, ranging from quantum imaging and sensing through to quantum communication and fault-tolerant quantum computers. Critical to almost every potential application is an efficient mechanism for the high fidelity readout of the state of the electronic and nuclear spins. Typically such readout has been achieved through an optically resonant fluorescence measurement, but the presence of decay through a meta-stable state will limit its efficiency to the order of 99%. While this is good enough for many applications, it is insufficient for large scale quantum networks and fault-tolerant computational tasks. Here we explore an alternative approach based on dipole induced transparency (state-dependent reflection) in an NV center cavity QED system, using the most recent knowledge of the NV center’s parameters to determine its feasibility, including the decay channels through the meta-stable subspace and photon ionization. We find that single-shot measurements above fault-tolerant thresholds should be available in the strong coupling regime for a wide range of cavity-center cooperativities, using a majority voting approach utilizing single photon detection. Furthermore, extremely high fidelity measurements are possible using weak optical pulses.
Efficient preparation of large-block-code ancilla states for fault-tolerant quantum computation
NASA Astrophysics Data System (ADS)
Zheng, Yi-Cong; Lai, Ching-Yi; Brun, Todd A.
2018-03-01
Fault-tolerant quantum computation (FTQC) schemes that use multiqubit large block codes can potentially reduce the resource overhead to a great extent. A major obstacle is the requirement for a large number of clean ancilla states of different types without correlated errors inside each block. These ancilla states are usually logical stabilizer states of the data-code blocks, which are generally difficult to prepare if the code size is large. Previously, we have proposed an ancilla distillation protocol for Calderbank-Shor-Steane (CSS) codes by classical error-correcting codes. It was assumed that the quantum gates in the distillation circuit were perfect; however, in reality, noisy quantum gates may introduce correlated errors that are not treatable by the protocol. In this paper, we show that additional postselection by another classical error-detecting code can be applied to remove almost all correlated errors. Consequently, the revised protocol is fully fault tolerant and capable of preparing a large set of stabilizer states sufficient for FTQC using large block codes. At the same time, the yield rate can be boosted from O (t-2) to O (1 ) in practice for an [[n ,k ,d =2 t +1
Squid - a simple bioinformatics grid.
Carvalho, Paulo C; Glória, Rafael V; de Miranda, Antonio B; Degrave, Wim M
2005-08-03
BLAST is a widely used genetic research tool for analysis of similarity between nucleotide and protein sequences. This paper presents a software application entitled "Squid" that makes use of grid technology. The current version, as an example, is configured for BLAST applications, but adaptation for other computing intensive repetitive tasks can be easily accomplished in the open source version. This enables the allocation of remote resources to perform distributed computing, making large BLAST queries viable without the need of high-end computers. Most distributed computing / grid solutions have complex installation procedures requiring a computer specialist, or have limitations regarding operating systems. Squid is a multi-platform, open-source program designed to "keep things simple" while offering high-end computing power for large scale applications. Squid also has an efficient fault tolerance and crash recovery system against data loss, being able to re-route jobs upon node failure and recover even if the master machine fails. Our results show that a Squid application, working with N nodes and proper network resources, can process BLAST queries almost N times faster than if working with only one computer. Squid offers high-end computing, even for the non-specialist, and is freely available at the project web site. Its open-source and binary Windows distributions contain detailed instructions and a "plug-n-play" instalation containing a pre-configured example.
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Obando, Rodrigo A.
1993-01-01
The modeling and design of a fault-tolerant multiprocessor system is addressed. In particular, the behavior of the system during recovery and restoration after a fault has occurred is investigated. Given that a multicomputer system is designed using the Algorithm to Architecture to Mapping Model (ATAMM), and that a fault (death of a computing resource) occurs during its normal steady-state operation, a model is presented as a viable research tool for predicting the performance bounds of the system during its recovery and restoration phases. Furthermore, the bounds of the performance behavior of the system during this transient mode can be assessed. These bounds include: time to recover from the fault (t(sub rec)), time to restore the system (t(sub rec)) and whether there is a permanent delay in the system's Time Between Input and Output (TBIO) after the system has reached a steady state. An implementation of an ATAMM based computer was developed with the Generic VHSIC Spaceborne Computer (GVSC) as the target system. A simulation of the GVSC was also written based on the code used in ATAMM Multicomputer Operating System (AMOS). The simulation is in turn used to validate the new model in the usefulness and accuracy in tracking the propagation of the delay through the system and predicting the behavior in the transient state of recovery and restoration. The model is validated as an accurate method to predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
Integrated Data and Control Level Fault Tolerance Techniques for Signal Processing Computer Design
1990-09-01
TOLERANCE TECHNIQUES FOR SIGNAL PROCESSING COMPUTER DESIGN G. Robert Redinbo I. INTRODUCTION High-speed signal processing is an important application of...techniques and mathematical approaches will be expanded later to the situation where hardware errors and roundoff and quantization noise affect all...detect errors equal in number to the degree of g(X), the maximum permitted by the Singleton bound [13]. Real cyclic codes, primarily applicable to
Formal specification and mechanical verification of SIFT - A fault-tolerant flight control system
NASA Technical Reports Server (NTRS)
Melliar-Smith, P. M.; Schwartz, R. L.
1982-01-01
The paper describes the methodology being employed to demonstrate rigorously that the SIFT (software-implemented fault-tolerant) computer meets its requirements. The methodology uses a hierarchy of design specifications, expressed in the mathematical domain of multisorted first-order predicate calculus. The most abstract of these, from which almost all details of mechanization have been removed, represents the requirements on the system for reliability and intended functionality. Successive specifications in the hierarchy add design and implementation detail until the PASCAL programs implementing the SIFT executive are reached. A formal proof that a SIFT system in a 'safe' state operates correctly despite the presence of arbitrary faults has been completed all the way from the most abstract specifications to the PASCAL program.
Software-implemented fault insertion: An FTMP example
NASA Technical Reports Server (NTRS)
Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.
1987-01-01
This report presents a model for fault insertion through software; describes its implementation on a fault-tolerant computer, FTMP; presents a summary of fault detection, identification, and reconfiguration data collected with software-implemented fault insertion; and compares the results to hardware fault insertion data. Experimental results show detection time to be a function of time of insertion and system workload. For the fault detection time, there is no correlation between software-inserted faults and hardware-inserted faults; this is because hardware-inserted faults must manifest as errors before detection, whereas software-inserted faults immediately exercise the error detection mechanisms. In summary, the software-implemented fault insertion is able to be used as an evaluation technique for the fault-handling capabilities of a system in fault detection, identification and recovery. Although the software-inserted faults do not map directly to hardware-inserted faults, experiments show software-implemented fault insertion is capable of emulating hardware fault insertion, with greater ease and automation.
NASA Astrophysics Data System (ADS)
King, Nelson E.; Liu, Brent; Zhou, Zheng; Documet, Jorge; Huang, H. K.
2005-04-01
Grid Computing represents the latest and most exciting technology to evolve from the familiar realm of parallel, peer-to-peer and client-server models that can address the problem of fault-tolerant storage for backup and recovery of clinical images. We have researched and developed a novel Data Grid testbed involving several federated PAC systems based on grid architecture. By integrating a grid computing architecture to the DICOM environment, a failed PACS archive can recover its image data from others in the federation in a timely and seamless fashion. The design reflects the five-layer architecture of grid computing: Fabric, Resource, Connectivity, Collective, and Application Layers. The testbed Data Grid architecture representing three federated PAC systems, the Fault-Tolerant PACS archive server at the Image Processing and Informatics Laboratory, Marina del Rey, the clinical PACS at Saint John's Health Center, Santa Monica, and the clinical PACS at the Healthcare Consultation Center II, USC Health Science Campus, will be presented. The successful demonstration of the Data Grid in the testbed will provide an understanding of the Data Grid concept in clinical image data backup as well as establishment of benchmarks for performance from future grid technology improvements and serve as a road map for expanded research into large enterprise and federation level data grids to guarantee 99.999 % up time.
Application of the actor model to large scale NDE data analysis
NASA Astrophysics Data System (ADS)
Coughlin, Chris
2018-03-01
The Actor model of concurrent computation discretizes a problem into a series of independent units or actors that interact only through the exchange of messages. Without direct coupling between individual components, an Actor-based system is inherently concurrent and fault-tolerant. These traits lend themselves to so-called "Big Data" applications in which the volume of data to analyze requires a distributed multi-system design. For a practical demonstration of the Actor computational model, a system was developed to assist with the automated analysis of Nondestructive Evaluation (NDE) datasets using the open source Myriad Data Reduction Framework. A machine learning model trained to detect damage in two-dimensional slices of C-Scan data was deployed in a streaming data processing pipeline. To demonstrate the flexibility of the Actor model, the pipeline was deployed on a local system and re-deployed as a distributed system without recompiling, reconfiguring, or restarting the running application.
PAWS/STEM - PADE APPROXIMATION WITH SCALING AND SCALED TAYLOR EXPONENTIAL MATRIX (VAX VMS VERSION)
NASA Technical Reports Server (NTRS)
Butler, R. W.
1994-01-01
Traditional fault-tree techniques for analyzing the reliability of large, complex systems fail to model the dynamic reconfiguration capabilities of modern computer systems. Markov models, on the other hand, can describe fault-recovery (via system reconfiguration) as well as fault-occurrence. The Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs provide a flexible, user-friendly, language-based interface for the creation and evaluation of Markov models describing the behavior of fault-tolerant reconfigurable computer systems. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. The calculation of the probability of entering a death state of a Markov model (representing system failure) requires the solution of a set of coupled differential equations. Because of the large disparity between the rates of fault arrivals and system recoveries, Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. Although different solution techniques are utilized on different programs, it is possible to have a common input language. The Systems Validation Methods group at NASA Langley Research Center has created a set of programs that form the basis for a reliability analysis workstation. The set of programs are: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920); and the FTC fault tree tool (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree. PAWS/STEM and SURE are programs which interpret the same SURE language, but utilize different solution methods. ASSIST is a preprocessor that generates SURE language from a more abstract definition. SURE, ASSIST, and PAWS/STEM are also offered as a bundle. Please see the abstract for COS-10039/COS-10041, SARA - SURE/ASSIST Reliability Analysis Workstation, for pricing details. PAWS/STEM was originally developed for DEC VAX series computers running VMS and was later ported for use on Sun computers running SunOS. The package is written in PASCAL, ANSI compliant C-language, and FORTRAN 77. The standard distribution medium for the VMS version of PAWS/STEM (LAR-14165) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of PAWS/STEM (LAR-14920) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. PAWS/STEM was developed in 1989 and last updated in 1991. DEC, VAX, VMS, and TK50 are trademarks of Digital Equipment Corporation. SunOS, Sun3, and Sun4 are trademarks of Sun Microsystems, Inc. UNIX is a registered trademark of AT&T Bell Laboratories.
PAWS/STEM - PADE APPROXIMATION WITH SCALING AND SCALED TAYLOR EXPONENTIAL MATRIX (SUN VERSION)
NASA Technical Reports Server (NTRS)
Butler, R. W.
1994-01-01
Traditional fault-tree techniques for analyzing the reliability of large, complex systems fail to model the dynamic reconfiguration capabilities of modern computer systems. Markov models, on the other hand, can describe fault-recovery (via system reconfiguration) as well as fault-occurrence. The Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs provide a flexible, user-friendly, language-based interface for the creation and evaluation of Markov models describing the behavior of fault-tolerant reconfigurable computer systems. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. The calculation of the probability of entering a death state of a Markov model (representing system failure) requires the solution of a set of coupled differential equations. Because of the large disparity between the rates of fault arrivals and system recoveries, Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. Although different solution techniques are utilized on different programs, it is possible to have a common input language. The Systems Validation Methods group at NASA Langley Research Center has created a set of programs that form the basis for a reliability analysis workstation. The set of programs are: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920); and the FTC fault tree tool (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree. PAWS/STEM and SURE are programs which interpret the same SURE language, but utilize different solution methods. ASSIST is a preprocessor that generates SURE language from a more abstract definition. SURE, ASSIST, and PAWS/STEM are also offered as a bundle. Please see the abstract for COS-10039/COS-10041, SARA - SURE/ASSIST Reliability Analysis Workstation, for pricing details. PAWS/STEM was originally developed for DEC VAX series computers running VMS and was later ported for use on Sun computers running SunOS. The package is written in PASCAL, ANSI compliant C-language, and FORTRAN 77. The standard distribution medium for the VMS version of PAWS/STEM (LAR-14165) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of PAWS/STEM (LAR-14920) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. PAWS/STEM was developed in 1989 and last updated in 1991. DEC, VAX, VMS, and TK50 are trademarks of Digital Equipment Corporation. SunOS, Sun3, and Sun4 are trademarks of Sun Microsystems, Inc. UNIX is a registered trademark of AT&T Bell Laboratories.
A fault-tolerant information processing concept for space vehicles.
NASA Technical Reports Server (NTRS)
Hopkins, A. L., Jr.
1971-01-01
A distributed fault-tolerant information processing system is proposed, comprising a central multiprocessor, dedicated local processors, and multiplexed input-output buses connecting them together. The processors in the multiprocessor are duplicated for error detection, which is felt to be less expensive than using coded redundancy of comparable effectiveness. Error recovery is made possible by a triplicated scratchpad memory in each processor. The main multiprocessor memory uses replicated memory for error detection and correction. Local processors use any of three conventional redundancy techniques: voting, duplex pairs with backup, and duplex pairs in independent subsystems.
Fault Tolerant Microcontroller for the Configurable Fault Tolerant Processor
2008-09-01
many others come to mind I also wish to thank Jan Grey for providing an excellent System-on-a-Chip that formed a core component of this thesis...developed by Jan Gray as documented in his "Building a RISC CPU and System-on-a-Chip in an FPGA" series of articles that was published in Circuit Cellar...those detailed by Jan Gray in his "Getting Started with the XSOC Project v0.93" [16]. The XSOC distribution is available at <http://www.fpgacpu.org
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Mini-Ckpts: Surviving OS Failures in Persistent Memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fiala, David; Mueller, Frank; Ferreira, Kurt Brian
Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory are more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs -- and in HPC also for any other nodes a parallelized application runs on and communicates with: Any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore,more » the OS itself must be capable of tolerating failures. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. Mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime system can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current, coarse-grained, application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional fault scenarios.« less
Test experience on an ultrareliable computer communication network
NASA Technical Reports Server (NTRS)
Abbott, L. W.
1984-01-01
The dispersed sensor processing mesh (DSPM) is an experimental, ultrareliable, fault-tolerant computer communications network that exhibits an organic-like ability to regenerate itself after suffering damage. The regeneration is accomplished by two routines - grow and repair. This paper discusses the DSPM concept for achieving fault tolerance and provides a brief description of the mechanization of both the experiment and the six-node experimental network. The main topic of this paper is the system performance of the growth algorithm contained in the grow routine. The characteristics imbued to DSPM by the growth algorithm are also discussed. Data from an experimental DSPM network and software simulation of larger DSPM-type networks are used to examine the inherent limitation on growth time by the growth algorithm and the relationship of growth time to network size and topology.
Application of a Resource Theory for Magic States to Fault-Tolerant Quantum Computing.
Howard, Mark; Campbell, Earl
2017-03-03
Motivated by their necessity for most fault-tolerant quantum computation schemes, we formulate a resource theory for magic states. First, we show that robustness of magic is a well-behaved magic monotone that operationally quantifies the classical simulation overhead for a Gottesman-Knill-type scheme using ancillary magic states. Our framework subsequently finds immediate application in the task of synthesizing non-Clifford gates using magic states. When magic states are interspersed with Clifford gates, Pauli measurements, and stabilizer ancillas-the most general synthesis scenario-then the class of synthesizable unitaries is hard to characterize. Our techniques can place nontrivial lower bounds on the number of magic states required for implementing a given target unitary. Guided by these results, we have found new and optimal examples of such synthesis.
Local Alignment Tool Based on Hadoop Framework and GPU Architecture
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance. PMID:24955362
Local alignment tool based on Hadoop framework and GPU architecture.
Hung, Che-Lun; Hua, Guan-Jie
2014-01-01
With the rapid growth of next generation sequencing technologies, such as Slex, more and more data have been discovered and published. To analyze such huge data the computational performance is an important issue. Recently, many tools, such as SOAP, have been implemented on Hadoop and GPU parallel computing architectures. BLASTP is an important tool, implemented on GPU architectures, for biologists to compare protein sequences. To deal with the big biology data, it is hard to rely on single GPU. Therefore, we implement a distributed BLASTP by combining Hadoop and multi-GPUs. The experimental results present that the proposed method can improve the performance of BLASTP on single GPU, and also it can achieve high availability and fault tolerance.
Fault tolerance in a supercomputer through dynamic repartitioning
Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.
2007-02-27
A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Low cost management of replicated data in fault-tolerant distributed systems
NASA Technical Reports Server (NTRS)
Joseph, Thomas A.; Birman, Kenneth P.
1990-01-01
Many distributed systems replicate data for fault tolerance or availability. In such systems, a logical update on a data item results in a physical update on a number of copies. The synchronization and communication required to keep the copies of replicated data consistent introduce a delay when operations are performed. A technique is described that relaxes the usual degree of synchronization, permitting replicated data items to be updated concurrently with other operations, while at the same time ensuring that correctness is not violated. The additional concurrency thus obtained results in better response time when performing operations on replicated data. How this technique performs in conjunction with a roll-back and a roll-forward failure recovery mechanism is also discussed.
Continuous-Variable Instantaneous Quantum Computing is Hard to Sample.
Douce, T; Markham, D; Kashefi, E; Diamanti, E; Coudreau, T; Milman, P; van Loock, P; Ferrini, G
2017-02-17
Instantaneous quantum computing is a subuniversal quantum complexity class, whose circuits have proven to be hard to simulate classically in the discrete-variable realm. We extend this proof to the continuous-variable (CV) domain by using squeezed states and homodyne detection, and by exploring the properties of postselected circuits. In order to treat postselection in CVs, we consider finitely resolved homodyne detectors, corresponding to a realistic scheme based on discrete probability distributions of the measurement outcomes. The unavoidable errors stemming from the use of finitely squeezed states are suppressed through a qubit-into-oscillator Gottesman-Kitaev-Preskill encoding of quantum information, which was previously shown to enable fault-tolerant CV quantum computation. Finally, we show that, in order to render postselected computational classes in CVs meaningful, a logarithmic scaling of the squeezing parameter with the circuit size is necessary, translating into a polynomial scaling of the input energy.
Rule-based fault diagnosis of hall sensors and fault-tolerant control of PMSM
NASA Astrophysics Data System (ADS)
Song, Ziyou; Li, Jianqiu; Ouyang, Minggao; Gu, Jing; Feng, Xuning; Lu, Dongbin
2013-07-01
Hall sensor is widely used for estimating rotor phase of permanent magnet synchronous motor(PMSM). And rotor position is an essential parameter of PMSM control algorithm, hence it is very dangerous if Hall senor faults occur. But there is scarcely any research focusing on fault diagnosis and fault-tolerant control of Hall sensor used in PMSM. From this standpoint, the Hall sensor faults which may occur during the PMSM operating are theoretically analyzed. According to the analysis results, the fault diagnosis algorithm of Hall sensor, which is based on three rules, is proposed to classify the fault phenomena accurately. The rotor phase estimation algorithms, based on one or two Hall sensor(s), are initialized to engender the fault-tolerant control algorithm. The fault diagnosis algorithm can detect 60 Hall fault phenomena in total as well as all detections can be fulfilled in 1/138 rotor rotation period. The fault-tolerant control algorithm can achieve a smooth torque production which means the same control effect as normal control mode (with three Hall sensors). Finally, the PMSM bench test verifies the accuracy and rapidity of fault diagnosis and fault-tolerant control strategies. The fault diagnosis algorithm can detect all Hall sensor faults promptly and fault-tolerant control algorithm allows the PMSM to face failure conditions of one or two Hall sensor(s). In addition, the transitions between health-control and fault-tolerant control conditions are smooth without any additional noise and harshness. Proposed algorithms can deal with the Hall sensor faults of PMSM in real applications, and can be provided to realize the fault diagnosis and fault-tolerant control of PMSM.
Fault tolerant and lifetime control architecture for autonomous vehicles
NASA Astrophysics Data System (ADS)
Bogdanov, Alexander; Chen, Yi-Liang; Sundareswaran, Venkataraman; Altshuler, Thomas
2008-04-01
Increased vehicle autonomy, survivability and utility can provide an unprecedented impact on mission success and are one of the most desirable improvements for modern autonomous vehicles. We propose a general architecture of intelligent resource allocation, reconfigurable control and system restructuring for autonomous vehicles. The architecture is based on fault-tolerant control and lifetime prediction principles, and it provides improved vehicle survivability, extended service intervals, greater operational autonomy through lower rate of time-critical mission failures and lesser dependence on supplies and maintenance. The architecture enables mission distribution, adaptation and execution constrained on vehicle and payload faults and desirable lifetime. The proposed architecture will allow managing missions more efficiently by weighing vehicle capabilities versus mission objectives and replacing the vehicle only when it is necessary.
Advanced information processing system: Authentication protocols for network communication
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Adams, Stuart J.; Babikyan, Carol A.; Butler, Bryan P.; Clark, Anne L.; Lala, Jaynarayan H.
1994-01-01
In safety critical I/O and intercomputer communication networks, reliable message transmission is an important concern. Difficulties of communication and fault identification in networks arise primarily because the sender of a transmission cannot be identified with certainty, an intermediate node can corrupt a message without certainty of detection, and a babbling node cannot be identified and silenced without lengthy diagnosis and reconfiguration . Authentication protocols use digital signature techniques to verify the authenticity of messages with high probability. Such protocols appear to provide an efficient solution to many of these problems. The objective of this program is to develop, demonstrate, and evaluate intercomputer communication architectures which employ authentication. As a context for the evaluation, the authentication protocol-based communication concept was demonstrated under this program by hosting a real-time flight critical guidance, navigation and control algorithm on a distributed, heterogeneous, mixed redundancy system of workstations and embedded fault-tolerant computers.
Sequential behavior and its inherent tolerance to memory faults.
NASA Technical Reports Server (NTRS)
Meyer, J. F.
1972-01-01
Representation of a memory fault of a sequential machine M by a function mu on the states of M and the result of the fault by an appropriately determined machine M(mu). Given some sequential behavior B, its inherent tolerance to memory faults can then be measured in terms of the minimum memory redundancy required to realize B with a state-assigned machine having fault tolerance type tau and fault tolerance level t. A behavior having maximum inherent tolerance is exhibited, and it is shown that behaviors of the same size can have different inherent tolerance.
Lognormal Approximations of Fault Tree Uncertainty Distributions.
El-Shanawany, Ashraf Ben; Ardron, Keith H; Walker, Simon P
2018-01-26
Fault trees are used in reliability modeling to create logical models of fault combinations that can lead to undesirable events. The output of a fault tree analysis (the top event probability) is expressed in terms of the failure probabilities of basic events that are input to the model. Typically, the basic event probabilities are not known exactly, but are modeled as probability distributions: therefore, the top event probability is also represented as an uncertainty distribution. Monte Carlo methods are generally used for evaluating the uncertainty distribution, but such calculations are computationally intensive and do not readily reveal the dominant contributors to the uncertainty. In this article, a closed-form approximation for the fault tree top event uncertainty distribution is developed, which is applicable when the uncertainties in the basic events of the model are lognormally distributed. The results of the approximate method are compared with results from two sampling-based methods: namely, the Monte Carlo method and the Wilks method based on order statistics. It is shown that the closed-form expression can provide a reasonable approximation to results obtained by Monte Carlo sampling, without incurring the computational expense. The Wilks method is found to be a useful means of providing an upper bound for the percentiles of the uncertainty distribution while being computationally inexpensive compared with full Monte Carlo sampling. The lognormal approximation method and Wilks's method appear attractive, practical alternatives for the evaluation of uncertainty in the output of fault trees and similar multilinear models. © 2018 Society for Risk Analysis.
Performance and economy of a fault-tolerant multiprocessor
NASA Technical Reports Server (NTRS)
Lala, J. H.; Smith, C. J.
1979-01-01
The FTMP (Fault-Tolerant Multiprocessor) is one of two central aircraft fault-tolerant architectures now in the prototype phase under NASA sponsorship. The intended application of the computer includes such critical real-time tasks as 'fly-by-wire' active control and completely automatic Category III landings of commercial aircraft. The FTMP architecture is briefly described and it is shown that it is a viable solution to the multi-faceted problems of safety, speed, and cost. Three job dispatch strategies are described, and their results with respect to job-starting delay are presented. The first strategy is a simple First-Come-First-Serve (FCFS) job dispatch executive. The other two schedulers are an adaptive FCFS and an interrupt driven scheduler. Three failure modes are discussed, and the FTMP survival probability in the face of random hard failures is evaluated. It is noted that the hourly cost of operating two FTMPs in a transport aircraft can be as little as one-to-two percent of the total flight-hour cost of the aircraft.
A New On-Line Diagnosis Protocol for the SPIDER Family of Byzantine Fault Tolerant Architectures
NASA Technical Reports Server (NTRS)
Geser, Alfons; Miner, Paul S.
2004-01-01
This paper presents the formal verification of a new protocol for online distributed diagnosis for the SPIDER family of architectures. An instance of the Scalable Processor-Independent Design for Electromagnetic Resilience (SPIDER) architecture consists of a collection of processing elements communicating over a Reliable Optical Bus (ROBUS). The ROBUS is a specialized fault-tolerant device that guarantees Interactive Consistency, Distributed Diagnosis (Group Membership), and Synchronization in the presence of a bounded number of physical faults. Formal verification of the original SPIDER diagnosis protocol provided a detailed understanding that led to the discovery of a significantly more efficient protocol. The original protocol was adapted from the formally verified protocol used in the MAFT architecture. It required O(N) message exchanges per defendant to correctly diagnose failures in a system with N nodes. The new protocol achieves the same diagnostic fidelity, but only requires O(1) exchanges per defendant. This paper presents this new diagnosis protocol and a formal proof of its correctness using PVS.
NASA Technical Reports Server (NTRS)
Dasgupta, Partha; Leblanc, Richard J., Jr.; Appelbe, William F.
1988-01-01
Clouds is an operating system in a novel class of distributed operating systems providing the integration, reliability, and structure that makes a distributed system usable. Clouds is designed to run on a set of general purpose computers that are connected via a medium-of-high speed local area network. The system structuring paradigm chosen for the Clouds operating system, after substantial research, is an object/thread model. All instances of services, programs and data in Clouds are encapsulated in objects. The concept of persistent objects does away with the need for file systems, and replaces it with a more powerful concept, namely the object system. The facilities in Clouds include integration of resources through location transparency; support for various types of atomic operations, including conventional transactions; advanced support for achieving fault tolerance; and provisions for dynamic reconfiguration.
Test experience on an ultrareliable computer communication network
NASA Technical Reports Server (NTRS)
Abbott, L. W.
1984-01-01
The dispersed sensor processing mesh (DSPM) is an experimental, ultra-reliable, fault-tolerant computer communications network that exhibits an organic-like ability to regenerate itself after suffering damage. The regeneration is accomplished by two routines - grow and repair. This paper discusses the DSPM concept for achieving fault tolerance and provides a brief description of the mechanization of both the experiment and the six-node experimental network. The main topic of this paper is the system performance of the growth algorithm contained in the grow routine. The characteristics imbued to DSPM by the growth algorithm are also discussed. Data from an experimental DSPM network and software simulation of larger DSPM-type networks are used to examine the inherent limitation on growth time by the growth algorithm and the relationship of growth time to network size and topology.
The PAWS and STEM reliability analysis programs
NASA Technical Reports Server (NTRS)
Butler, Ricky W.; Stevenson, Philip H.
1988-01-01
The PAWS and STEM programs are new design/validation tools. These programs provide a flexible, user-friendly, language-based interface for the input of Markov models describing the behavior of fault-tolerant computer systems. These programs produce exact solutions of the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. PAWS uses a Pade approximation as a solution technique; STEM uses a Taylor series as a solution technique. Both programs have the capability to solve numerically stiff models. PAWS and STEM possess complementary properties with regard to their input space; and, an additional strength of these programs is that they accept input compatible with the SURE program. If used in conjunction with SURE, PAWS and STEM provide a powerful suite of programs to analyze the reliability of fault-tolerant computer systems.
NASA Technical Reports Server (NTRS)
Goldstein, David
1991-01-01
Extensions to an architecture for real-time, distributed (parallel) knowledge-based systems called the Parallel Real-time Artificial Intelligence System (PRAIS) are discussed. PRAIS strives for transparently parallelizing production (rule-based) systems, even under real-time constraints. PRAIS accomplished these goals (presented at the first annual C Language Integrated Production System (CLIPS) conference) by incorporating a dynamic task scheduler, operating system extensions for fact handling, and message-passing among multiple copies of CLIPS executing on a virtual blackboard. This distributed knowledge-based system tool uses the portability of CLIPS and common message-passing protocols to operate over a heterogeneous network of processors. Results using the original PRAIS architecture over a network of Sun 3's, Sun 4's and VAX's are presented. Mechanisms using the producer-consumer model to extend the architecture for fault-tolerance and distributed truth maintenance initiation are also discussed.
A study of the relationship between the performance and dependability of a fault-tolerant computer
NASA Technical Reports Server (NTRS)
Goswami, Kumar K.
1994-01-01
This thesis studies the relationship by creating a tool (FTAPE) that integrates a high stress workload generator with fault injection and by using the tool to evaluate system performance under error conditions. The workloads are comprised of processes which are formed from atomic components that represent CPU, memory, and I/O activity. The fault injector is software-implemented and is capable of injecting any memory addressable location, including special registers and caches. This tool has been used to study a Tandem Integrity S2 Computer. Workloads with varying numbers of processes and varying compositions of CPU, memory, and I/O activity are first characterized in terms of performance. Then faults are injected into these workloads. The results show that as the number of concurrent processes increases, the mean fault latency initially increases due to increased contention for the CPU. However, for even higher numbers of processes (less than 3 processes), the mean latency decreases because long latency faults are paged out before they can be activated.
A technique for evaluating the application of the pin-level stuck-at fault model to VLSI circuits
NASA Technical Reports Server (NTRS)
Palumbo, Daniel L.; Finelli, George B.
1987-01-01
Accurate fault models are required to conduct the experiments defined in validation methodologies for highly reliable fault-tolerant computers (e.g., computers with a probability of failure of 10 to the -9 for a 10-hour mission). Described is a technique by which a researcher can evaluate the capability of the pin-level stuck-at fault model to simulate true error behavior symptoms in very large scale integrated (VLSI) digital circuits. The technique is based on a statistical comparison of the error behavior resulting from faults applied at the pin-level of and internal to a VLSI circuit. As an example of an application of the technique, the error behavior of a microprocessor simulation subjected to internal stuck-at faults is compared with the error behavior which results from pin-level stuck-at faults. The error behavior is characterized by the time between errors and the duration of errors. Based on this example data, the pin-level stuck-at fault model is found to deliver less than ideal performance. However, with respect to the class of faults which cause a system crash, the pin-level, stuck-at fault model is found to provide a good modeling capability.
A Test Generation Framework for Distributed Fault-Tolerant Algorithms
NASA Technical Reports Server (NTRS)
Goodloe, Alwyn; Bushnell, David; Miner, Paul; Pasareanu, Corina S.
2009-01-01
Heavyweight formal methods such as theorem proving have been successfully applied to the analysis of safety critical fault-tolerant systems. Typically, the models and proofs performed during such analysis do not inform the testing process of actual implementations. We propose a framework for generating test vectors from specifications written in the Prototype Verification System (PVS). The methodology uses a translator to produce a Java prototype from a PVS specification. Symbolic (Java) PathFinder is then employed to generate a collection of test cases. A small example is employed to illustrate how the framework can be used in practice.
Physical fault tolerance of nanoelectronics.
Szkopek, Thomas; Roychowdhury, Vwani P; Antoniadis, Dimitri A; Damoulakis, John N
2011-04-29
The error rate in complementary transistor circuits is suppressed exponentially in electron number, arising from an intrinsic physical implementation of fault-tolerant error correction. Contrariwise, explicit assembly of gates into the most efficient known fault-tolerant architecture is characterized by a subexponential suppression of error rate with electron number, and incurs significant overhead in wiring and complexity. We conclude that it is more efficient to prevent logical errors with physical fault tolerance than to correct logical errors with fault-tolerant architecture.
ALLIANCE: An architecture for fault tolerant, cooperative control of heterogeneous mobile robots
DOE Office of Scientific and Technical Information (OSTI.GOV)
Parker, L.E.
1995-02-01
This research addresses the problem of achieving fault tolerant cooperation within small- to medium-sized teams of heterogeneous mobile robots. The author describes a novel behavior-based, fully distributed architecture, called ALLIANCE, that utilizes adaptive action selection to achieve fault tolerant cooperative control in robot missions involving loosely coupled, largely independent tasks. The robots in this architecture possess a variety of high-level functions that they can perform during a mission, and must at all times select an appropriate action based on the requirements of the mission, the activities of other robots, the current environmental conditions, and their own internal states. Since suchmore » cooperative teams often work in dynamic and unpredictable environments, the software architecture allows the team members to respond robustly and reliably to unexpected environmental changes and modifications in the robot team that may occur due to mechanical failure, the learning of new skills, or the addition or removal of robots from the team by human intervention. After presenting ALLIANCE, the author describes in detail experimental results of an implementation of this architecture on a team of physical mobile robots performing a cooperative box pushing demonstration. These experiments illustrate the ability of ALLIANCE to achieve adaptive, fault-tolerant cooperative control amidst dynamic changes in the capabilities of the robot team.« less
The scientific data acquisition system of the GAMMA-400 space project
NASA Astrophysics Data System (ADS)
Bobkov, S. G.; Serdin, O. V.; Gorbunov, M. S.; Arkhangelskiy, A. I.; Topchiev, N. P.
2016-02-01
The description of scientific data acquisition system (SDAS) designed by SRISA for the GAMMA-400 space project is presented. We consider the problem of different level electronics unification: the set of reliable fault-tolerant integrated circuits fabricated on Silicon-on-Insulator 0.25 mkm CMOS technology and the high-speed interfaces and reliable modules used in the space instruments. The characteristics of reliable fault-tolerant very large scale integration (VLSI) technology designed by SRISA for the developing of computation systems for space applications are considered. The scalable net structure of SDAS based on Serial RapidIO interface including real-time operating system BAGET is described too.
Leung, Vitus J [Albuquerque, NM; Phillips, Cynthia A [Albuquerque, NM; Bender, Michael A [East Northport, NY; Bunde, David P [Urbana, IL
2009-07-21
In a multiple processor computing apparatus, directional routing restrictions and a logical channel construct permit fault tolerant, deadlock-free routing. Processor allocation can be performed by creating a linear ordering of the processors based on routing rules used for routing communications between the processors. The linear ordering can assume a loop configuration, and bin-packing is applied to this loop configuration. The interconnection of the processors can be conceptualized as a generally rectangular 3-dimensional grid, and the MC allocation algorithm is applied with respect to the 3-dimensional grid.
Space Shuttle critical function audit
NASA Technical Reports Server (NTRS)
Sacks, Ivan J.; Dipol, John; Su, Paul
1990-01-01
A large fault-tolerance model of the main propulsion system of the US space shuttle has been developed. This model is being used to identify single components and pairs of components that will cause loss of shuttle critical functions. In addition, this model is the basis for risk quantification of the shuttle. The process used to develop and analyze the model is digraph matrix analysis (DMA). The DMA modeling and analysis process is accessed via a graphics-based computer user interface. This interface provides coupled display of the integrated system schematics, the digraph models, the component database, and the results of the fault tolerance and risk analyses.
The aircraft energy efficiency active controls technology program
NASA Technical Reports Server (NTRS)
Hood, R. V., Jr.
1977-01-01
Broad outlines of the NASA Aircraft Energy Efficiency Program for expediting the application of active controls technology to civil transport aircraft are presented. Advances in propulsion and airframe technology to cut down on fuel consumption and fuel costs, a program for an energy-efficient transport, and integrated analysis and design technology in aerodynamics, structures, and active controls are envisaged. Fault-tolerant computer systems and fault-tolerant flight control system architectures are under study. Contracts with leading manufacturers for research and development work on wing-tip extensions and winglets for the B-747, a wing load alleviation system, elastic mode suppression, maneuver-load control, and gust alleviation are mentioned.
Care 3 phase 2 report, maintenance manual
NASA Technical Reports Server (NTRS)
Bryant, L. A.; Stiffler, J. J.
1982-01-01
CARE 3 (Computer-Aided Reliability Estimation, version three) is a computer program designed to help estimate the reliability of complex, redundant systems. Although the program can model a wide variety of redundant structures, it was developed specifically for fault-tolerant avionics systems--systems distinguished by the need for extremely reliable performance since a system failure could well result in the loss of human life. It substantially generalizes the class of redundant configurations that could be accommodated, and includes a coverage model to determine the various coverage probabilities as a function of the applicable fault recovery mechanisms (detection delay, diagnostic scheduling interval, isolation and recovery delay, etc.). CARE 3 further generalizes the class of system structures that can be modeled and greatly expands the coverage model to take into account such effects as intermittent and transient faults, latent faults, error propagation, etc.
Fault-Tolerant and Elastic Streaming MapReduce with Decentralized Coordination
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kumbhare, Alok; Frincu, Marc; Simmhan, Yogesh
2015-06-29
The MapReduce programming model, due to its simplicity and scalability, has become an essential tool for processing large data volumes in distributed environments. Recent Stream Processing Systems (SPS) extend this model to provide low-latency analysis of high-velocity continuous data streams. However, integrating MapReduce with streaming poses challenges: first, the runtime variations in data characteristics such as data-rates and key-distribution cause resource overload, that inturn leads to fluctuations in the Quality of the Service (QoS); and second, the stateful reducers, whose state depends on the complete tuple history, necessitates efficient fault-recovery mechanisms to maintain the desired QoS in the presence ofmore » resource failures. We propose an integrated streaming MapReduce architecture leveraging the concept of consistent hashing to support runtime elasticity along with locality-aware data and state replication to provide efficient load-balancing with low-overhead fault-tolerance and parallel fault-recovery from multiple simultaneous failures. Our evaluation on a private cloud shows up to 2:8 improvement in peak throughput compared to Apache Storm SPS, and a low recovery latency of 700 -1500 ms from multiple failures.« less
Wei, Yu-Jia; He, Yu-Ming; Chen, Ming-Cheng; Hu, Yi-Nan; He, Yu; Wu, Dian; Schneider, Christian; Kamp, Martin; Höfling, Sven; Lu, Chao-Yang; Pan, Jian-Wei
2014-11-12
Single photons are attractive candidates of quantum bits (qubits) for quantum computation and are the best messengers in quantum networks. Future scalable, fault-tolerant photonic quantum technologies demand both stringently high levels of photon indistinguishability and generation efficiency. Here, we demonstrate deterministic and robust generation of pulsed resonance fluorescence single photons from a single semiconductor quantum dot using adiabatic rapid passage, a method robust against fluctuation of driving pulse area and dipole moments of solid-state emitters. The emitted photons are background-free, have a vanishing two-photon emission probability of 0.3% and a raw (corrected) two-photon Hong-Ou-Mandel interference visibility of 97.9% (99.5%), reaching a precision that places single photons at the threshold for fault-tolerant surface-code quantum computing. This single-photon source can be readily scaled up to multiphoton entanglement and used for quantum metrology, boson sampling, and linear optical quantum computing.
A methodology for testing fault-tolerant software
NASA Technical Reports Server (NTRS)
Andrews, D. M.; Mahmood, A.; Mccluskey, E. J.
1985-01-01
A methodology for testing fault tolerant software is presented. There are problems associated with testing fault tolerant software because many errors are masked or corrected by voters, limiter, or automatic channel synchronization. This methodology illustrates how the same strategies used for testing fault tolerant hardware can be applied to testing fault tolerant software. For example, one strategy used in testing fault tolerant hardware is to disable the redundancy during testing. A similar testing strategy is proposed for software, namely, to move the major emphasis on testing earlier in the development cycle (before the redundancy is in place) thus reducing the possibility that undetected errors will be masked when limiters and voters are added.
QCCM Center for Quantum Algorithms
2008-10-17
algorithms (e.g., quantum walks and adiabatic computing ), as well as theoretical advances relating algorithms to physical implementations (e.g...Park, NC 27709-2211 15. SUBJECT TERMS Quantum algorithms, quantum computing , fault-tolerant error correction Richard Cleve MITACS East Academic...0511200 Algebraic results on quantum automata A. Ambainis, M. Beaudry, M. Golovkins, A. Kikusts, M. Mercer, D. Thrien Theory of Computing Systems 39(2006
Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro
2012-08-01
Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments.
Eigenstructure Assignment for Fault Tolerant Flight Control Design
NASA Technical Reports Server (NTRS)
Sobel, Kenneth; Joshi, Suresh (Technical Monitor)
2002-01-01
In recent years, fault tolerant flight control systems have gained an increased interest for high performance military aircraft as well as civil aircraft. Fault tolerant control systems can be described as either active or passive. An active fault tolerant control system has to either reconfigure or adapt the controller in response to a failure. One approach is to reconfigure the controller based upon detection and identification of the failure. Another approach is to use direct adaptive control to adjust the controller without explicitly identifying the failure. In contrast, a passive fault tolerant control system uses a fixed controller which achieves acceptable performance for a presumed set of failures. We have obtained a passive fault tolerant flight control law for the F/A-18 aircraft which achieves acceptable handling qualities for a class of control surface failures. The class of failures includes the symmetric failure of any one control surface being stuck at its trim value. A comparison was made of an eigenstructure assignment gain designed for the unfailed aircraft with a fault tolerant multiobjective optimization gain. We have shown that time responses for the unfailed aircraft using the eigenstructure assignment gain and the fault tolerant gain are identical. Furthermore, the fault tolerant gain achieves MIL-F-8785C specifications for all failure conditions.
Modeling and measurement of fault-tolerant multiprocessors
NASA Technical Reports Server (NTRS)
Shin, K. G.; Woodbury, M. H.; Lee, Y. H.
1985-01-01
The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.
Towards fault tolerant adiabatic quantum computation.
Lidar, Daniel A
2008-04-25
I show how to protect adiabatic quantum computation (AQC) against decoherence and certain control errors, using a hybrid methodology involving dynamical decoupling, subsystem and stabilizer codes, and energy gaps. Corresponding error bounds are derived. As an example, I show how to perform decoherence-protected AQC against local noise using at most two-body interactions.
Lightweight causal and atomic group multicast
NASA Technical Reports Server (NTRS)
Birman, Kenneth P.; Schiper, Andre; Stephenson, Pat
1991-01-01
The ISIS toolkit is a distributed programming environment based on support for virtually synchronous process groups and group communication. A suite of protocols is presented to support this model. The approach revolves around a multicast primitive, called CBCAST, which implements a fault-tolerant, causally ordered message delivery. This primitive can be used directly or extended into a totally ordered multicast primitive, called ABCAST. It normally delivers messages immediately upon reception, and imposes a space overhead proportional to the size of the groups to which the sender belongs, usually a small number. It is concluded that process groups and group communication can achieve performance and scaling comparable to that of a raw message transport layer. This finding contradicts the widespread concern that this style of distributed computing may be unacceptably costly.
Adaptive distributed outlier detection for WSNs.
De Paola, Alessandra; Gaglio, Salvatore; Lo Re, Giuseppe; Milazzo, Fabrizio; Ortolani, Marco
2015-05-01
The paradigm of pervasive computing is gaining more and more attention nowadays, thanks to the possibility of obtaining precise and continuous monitoring. Ease of deployment and adaptivity are typically implemented by adopting autonomous and cooperative sensory devices; however, for such systems to be of any practical use, reliability and fault tolerance must be guaranteed, for instance by detecting corrupted readings amidst the huge amount of gathered sensory data. This paper proposes an adaptive distributed Bayesian approach for detecting outliers in data collected by a wireless sensor network; our algorithm aims at optimizing classification accuracy, time complexity and communication complexity, and also considering externally imposed constraints on such conflicting goals. The performed experimental evaluation showed that our approach is able to improve the considered metrics for latency and energy consumption, with limited impact on classification accuracy.
Proactive Fault Tolerance for HPC with Xen Virtualization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian
2007-01-01
with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanismmore » for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.« less
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB
NASA Technical Reports Server (NTRS)
Clune, E.; Segall, Z.; Siewiorek, D.
1985-01-01
This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.
A verified design of a fault-tolerant clock synchronization circuit: Preliminary investigations
NASA Technical Reports Server (NTRS)
Miner, Paul S.
1992-01-01
Schneider demonstrates that many fault tolerant clock synchronization algorithms can be represented as refinements of a single proven correct paradigm. Shankar provides mechanical proof that Schneider's schema achieves Byzantine fault tolerant clock synchronization provided that 11 constraints are satisfied. Some of the constraints are assumptions about physical properties of the system and cannot be established formally. Proofs are given that the fault tolerant midpoint convergence function satisfies three of the constraints. A hardware design is presented, implementing the fault tolerant midpoint function, which is shown to satisfy the remaining constraints. The synchronization circuit will recover completely from transient faults provided the maximum fault assumption is not violated. The initialization protocol for the circuit also provides a recovery mechanism from total system failure caused by correlated transient faults.
Low cost computer subsystem for the Solar Electric Propulsion Stage (SEPS)
NASA Technical Reports Server (NTRS)
1975-01-01
The Solar Electric Propulsion Stage (SEPS) subsystem which consists of the computer, custom input/output (I/O) unit, and tape recorder for mass storage of telemetry data was studied. Computer software and interface requirements were developed along with computer and I/O unit design parameters. Redundancy implementation was emphasized. Reliability analysis was performed for the complete command computer sybsystem. A SEPS fault tolerant memory breadboard was constructed and its operation demonstrated.
Fault tree models for fault tolerant hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Boyd, Mark A.; Tuazon, Jezus O.
1991-01-01
Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.
Flight elements: Fault detection and fault management
NASA Technical Reports Server (NTRS)
Lum, H.; Patterson-Hine, A.; Edge, J. T.; Lawler, D.
1990-01-01
Fault management for an intelligent computational system must be developed using a top down integrated engineering approach. An approach proposed includes integrating the overall environment involving sensors and their associated data; design knowledge capture; operations; fault detection, identification, and reconfiguration; testability; causal models including digraph matrix analysis; and overall performance impacts on the hardware and software architecture. Implementation of the concept to achieve a real time intelligent fault detection and management system will be accomplished via the implementation of several objectives, which are: Development of fault tolerant/FDIR requirement and specification from a systems level which will carry through from conceptual design through implementation and mission operations; Implementation of monitoring, diagnosis, and reconfiguration at all system levels providing fault isolation and system integration; Optimize system operations to manage degraded system performance through system integration; and Lower development and operations costs through the implementation of an intelligent real time fault detection and fault management system and an information management system.
Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; ...
2017-02-15
Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Finally, we usemore » gate set tomography to completely characterize operations on a trapped-Yb +-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10 -4).« less
Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik; Rudinger, Kenneth; Mizrahi, Jonathan; Fortier, Kevin; Maunz, Peter
2017-01-01
Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Here we use gate set tomography to completely characterize operations on a trapped-Yb+-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10−4). PMID:28198466
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blume-Kohout, Robin; Gamble, John King; Nielsen, Erik
Quantum information processors promise fast algorithms for problems inaccessible to classical computers. But since qubits are noisy and error-prone, they will depend on fault-tolerant quantum error correction (FTQEC) to compute reliably. Quantum error correction can protect against general noise if—and only if—the error in each physical qubit operation is smaller than a certain threshold. The threshold for general errors is quantified by their diamond norm. Until now, qubits have been assessed primarily by randomized benchmarking, which reports a different error rate that is not sensitive to all errors, and cannot be compared directly to diamond norm thresholds. Finally, we usemore » gate set tomography to completely characterize operations on a trapped-Yb +-ion qubit and demonstrate with greater than 95% confidence that they satisfy a rigorous threshold for FTQEC (diamond norm ≤6.7 × 10 -4).« less
Experimental entanglement purification of arbitrary unknown states.
Pan, Jian-Wei; Gasparoni, Sara; Ursin, Rupert; Weihs, Gregor; Zeilinger, Anton
2003-05-22
Distribution of entangled states between distant locations is essential for quantum communication over large distances. But owing to unavoidable decoherence in the quantum communication channel, the quality of entangled states generally decreases exponentially with the channel length. Entanglement purification--a way to extract a subset of states of high entanglement and high purity from a large set of less entangled states--is thus needed to overcome decoherence. Besides its important application in quantum communication, entanglement purification also plays a crucial role in error correction for quantum computation, because it can significantly increase the quality of logic operations between different qubits. Here we demonstrate entanglement purification for general mixed states of polarization-entangled photons using only linear optics. Typically, one photon pair of fidelity 92% could be obtained from two pairs, each of fidelity 75%. In our experiments, decoherence is overcome to the extent that the technique would achieve tolerable error rates for quantum repeaters in long-distance quantum communication. Our results also imply that the requirement of high-accuracy logic operations in fault-tolerant quantum computation can be considerably relaxed.
NASA Technical Reports Server (NTRS)
Tomayko, James E.
1986-01-01
Twenty-five years of spacecraft onboard computer development have resulted in a better understanding of the requirements for effective, efficient, and fault tolerant flight computer systems. Lessons from eight flight programs (Gemini, Apollo, Skylab, Shuttle, Mariner, Voyager, and Galileo) and three reserach programs (digital fly-by-wire, STAR, and the Unified Data System) are useful in projecting the computer hardware configuration of the Space Station and the ways in which the Ada programming language will enhance the development of the necessary software. The evolution of hardware technology, fault protection methods, and software architectures used in space flight in order to provide insight into the pending development of such items for the Space Station are reviewed.
Fault-Tolerant Multiprocessor and VLSI-Based Systems.
1987-03-15
54590 170 Table 1: Statistics for the Benchmark Programs pages are distributed amongst the groups of the reconfigured memory in proportion to the...distances are proportional to only the logarithm of the sure that possesses relevance to a system which consists of alare nmbe ofhomgenouseleent...and comn.unication overhead resulting from faults communicating with all of the other elements in the system the network to degrade proportionately to
Design for dependability: A simulation-based approach. Ph.D. Thesis, 1993
NASA Technical Reports Server (NTRS)
Goswami, Kumar K.
1994-01-01
This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced.
Certification of computational results
NASA Technical Reports Server (NTRS)
Sullivan, Gregory F.; Wilson, Dwight S.; Masson, Gerald M.
1993-01-01
A conceptually novel and powerful technique to achieve fault detection and fault tolerance in hardware and software systems is described. When used for software fault detection, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared and if they agree the results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance is formalized and realizations of it are illustrated by considering algorithms for the following problems: convex hull, sorting, and shortest path. Cases in which the second phase can be run concurrently with the first and act as a monitor are discussed. The certification trail approach are compared to other approaches to fault tolerance.
DEPEND: A simulation-based environment for system level dependability analysis
NASA Technical Reports Server (NTRS)
Goswami, Kumar; Iyer, Ravishankar K.
1992-01-01
The design and evaluation of highly reliable computer systems is a complex issue. Designers mostly develop such systems based on prior knowledge and experience and occasionally from analytical evaluations of simplified designs. A simulation-based environment called DEPEND which is especially geared for the design and evaluation of fault-tolerant architectures is presented. DEPEND is unique in that it exploits the properties of object-oriented programming to provide a flexible framework with which a user can rapidly model and evaluate various fault-tolerant systems. The key features of the DEPEND environment are described, and its capabilities are illustrated with a detailed analysis of a real design. In particular, DEPEND is used to simulate the Unix based Tandem Integrity fault-tolerance and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times which affect how the system handles near-coincident errors are also evaluated. Issues such as the method used by DEPEND to simulate error latency and the time acceleration technique that provides enormous simulation speed up are also discussed. Unlike any other simulation-based dependability studies, the use of these approaches and the accuracy of the simulation model are validated by comparing the results of the simulations, with measurements obtained from fault injection experiments conducted on a production Tandem Integrity machine.
Monte Carlo simulation of photon migration in a cloud computing environment with MapReduce
Pratx, Guillem; Xing, Lei
2011-01-01
Monte Carlo simulation is considered the most reliable method for modeling photon migration in heterogeneous media. However, its widespread use is hindered by the high computational cost. The purpose of this work is to report on our implementation of a simple MapReduce method for performing fault-tolerant Monte Carlo computations in a massively-parallel cloud computing environment. We ported the MC321 Monte Carlo package to Hadoop, an open-source MapReduce framework. In this implementation, Map tasks compute photon histories in parallel while a Reduce task scores photon absorption. The distributed implementation was evaluated on a commercial compute cloud. The simulation time was found to be linearly dependent on the number of photons and inversely proportional to the number of nodes. For a cluster size of 240 nodes, the simulation of 100 billion photon histories took 22 min, a 1258 × speed-up compared to the single-threaded Monte Carlo program. The overall computational throughput was 85,178 photon histories per node per second, with a latency of 100 s. The distributed simulation produced the same output as the original implementation and was resilient to hardware failure: the correctness of the simulation was unaffected by the shutdown of 50% of the nodes. PMID:22191916
Metacomputing on Commodity Computers
1999-05-01
on NOWs, and this has contributed to the popularity of systems such as PVM [59], MPI [67], Linda [33], and TreadMarks [2]. 26 Challenges Given that...presents the performance of Calypso and Persistent Linda (PLinda) [77] programs and compares how they can tolerate failures. A biological pattern...adds fault tolerance to Linda programs by using light-weight transac- tions, whereas Calypso uses the combination of eager scheduling and two-phase
Fault tolerant high-performance PACS network design and implementation
NASA Astrophysics Data System (ADS)
Chimiak, William J.; Boehme, Johannes M.
1998-07-01
The Wake Forest University School of Medicine and the Wake Forest University/Baptist Medical Center (WFUBMC) are implementing a second generation PACS. The first generation PACS provided helpful information about the functional and temporal requirements of the system. It highlighted the importance of image retrieval speed, system availability, RIS/HIS integration, the ability to rapidly view images on any PACS workstation, network bandwidth, equipment redundancy, and the ability for the system to evolve using standards-based components. This paper deals with the network design and implementation of the PACS. The physical layout of the hospital areas served by the PACS, the choice of network equipment and installation issues encountered are addressed. Efforts to optimize fault tolerance are discussed. The PACS network is a gigabit, mixed-media network based on LAN emulation over ATM (LANE) with a rapid migration from LANE to Multiple Protocols Over ATM (MPOA) planned. Two fault-tolerant backbone ATM switches serve to distribute network accesses with two load-balancing 622 megabit per second (Mbps) OC-12 interconnections. The switch was sized to be upgradable to provide a 2.54 Gbps OC-48 interconnection with an OC-12 interconnection as a load-balancing backup. Modalities connect with legacy network interface cards to a switched-ethernet device. This device has two 155 Mbps OC-3 load-balancing uplinks to each of the backbone ATM switches of the PACS. This provides a fault-tolerant logical connection to the modality servers which pass verified DICOM images to the PACS servers and proper PACS diagnostic workstations. Where fiber pulls were prohibitively expensive, edge ATM switches were installed with an OC-12 uplink to a backbone ATM switches. The PACS and data base servers are fault-tolerant, hot-swappable Sun Enterprise Servers with an OC-12 connection to a backbone ATM switch and a fast-ethernet connection to a back-up network. The workstations come with 10/100 BASET autosense cards. A redundant switched-ethernet network will be installed to provide yet another degree of network fault-tolerance. The switched-ethernet devices are connected to each of the backbone ATM switches with two-load-balancing OC-3 connections to provide fault-tolerant connectivity in the event of a primary network failure.
Distributed controller clustering in software defined networks.
Abdelaziz, Ahmed; Fong, Ang Tan; Gani, Abdullah; Garba, Usman; Khan, Suleman; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond
2017-01-01
Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability.
A fault-tolerant addressable spin qubit in a natural silicon quantum dot
Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R.; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo
2016-01-01
Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot–based qubits. This result can inspire contributions to quantum computing from industrial communities. PMID:27536725
A fault-tolerant addressable spin qubit in a natural silicon quantum dot.
Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo
2016-08-01
Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot-based qubits. This result can inspire contributions to quantum computing from industrial communities.
Fault-tolerant Remote Quantum Entanglement Establishment for Secure Quantum Communications
NASA Astrophysics Data System (ADS)
Tsai, Chia-Wei; Lin, Jason
2016-07-01
This work presents a strategy for constructing long-distance quantum communications among a number of remote users through collective-noise channel. With the assistance of semi-honest quantum certificate authorities (QCAs), the remote users can share a secret key through fault-tolerant entanglement swapping. The proposed protocol is feasible for large-scale distributed quantum networks with numerous users. Each pair of communicating parties only needs to establish the quantum channels and the classical authenticated channels with his/her local QCA. Thus, it enables any user to communicate freely without point-to-point pre-establishing any communication channels, which is efficient and feasible for practical environments.
MULTIPROCESSOR AND DISTRIBUTED PROCESSING BIBLIOGRAPHIC DATA BASE SOFTWARE SYSTEM
NASA Technical Reports Server (NTRS)
Miya, E. N.
1994-01-01
Multiprocessors and distributed processing are undergoing increased scientific scrutiny for many reasons. It is more and more difficult to keep track of the existing research in these fields. This package consists of a large machine-readable bibliographic data base which, in addition to the usual keyword searches, can be used for producing citations, indexes, and cross-references. The data base is compiled from smaller existing multiprocessing bibliographies, and tables of contents from journals and significant conferences. There are approximately 4,000 entries covering topics such as parallel and vector processing, networks, supercomputers, fault-tolerant computers, and cellular automata. Each entry is represented by 21 fields including keywords, author, referencing book or journal title, volume and page number, and date and city of publication. The data base contains UNIX 'refer' formatted ASCII data and can be implemented on any computer running under the UNIX operating system. The data base requires approximately one megabyte of secondary storage. The documentation for this program is included with the distribution tape, although it can be purchased for the price below. This bibliography was compiled in 1985 and updated in 1988.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.
Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein
2015-12-01
Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.
CSP: A Multifaceted Hybrid Architecture for Space Computing
NASA Technical Reports Server (NTRS)
Rudolph, Dylan; Wilson, Christopher; Stewart, Jacob; Gauvin, Patrick; George, Alan; Lam, Herman; Crum, Gary Alex; Wirthlin, Mike; Wilson, Alex; Stoddard, Aaron
2014-01-01
Research on the CHREC Space Processor (CSP) takes a multifaceted hybrid approach to embedded space computing. Working closely with the NASA Goddard SpaceCube team, researchers at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Florida and Brigham Young University are developing hybrid space computers that feature an innovative combination of three technologies: commercial-off-the-shelf (COTS) devices, radiation-hardened (RadHard) devices, and fault-tolerant computing. Modern COTS processors provide the utmost in performance and energy-efficiency but are susceptible to ionizing radiation in space, whereas RadHard processors are virtually immune to this radiation but are more expensive, larger, less energy-efficient, and generations behind in speed and functionality. By featuring COTS devices to perform the critical data processing, supported by simpler RadHard devices that monitor and manage the COTS devices, and augmented with novel uses of fault-tolerant hardware, software, information, and networking within and between COTS devices, the resulting system can maximize performance and reliability while minimizing energy consumption and cost. NASA Goddard has adopted the CSP concept and technology with plans underway to feature flight-ready CSP boards on two upcoming space missions.
Closed-Loop HIRF Experiments Performed on a Fault Tolerant Flight Control Computer
NASA Technical Reports Server (NTRS)
Belcastro, Celeste M.
1997-01-01
ABSTRACT Closed-loop HIRF experiments were performed on a fault tolerant flight control computer (FCC) at the NASA Langley Research Center. The FCC used in the experiments was a quad-redundant flight control computer executing B737 Autoland control laws. The FCC was placed in one of the mode-stirred reverberation chambers in the HIRF Laboratory and interfaced to a computer simulation of the B737 flight dynamics, engines, sensors, actuators, and atmosphere in the Closed-Loop Systems Laboratory. Disturbances to the aircraft associated with wind gusts and turbulence were simulated during tests. Electrical isolation between the FCC under test and the simulation computer was achieved via a fiber optic interface for the analog and discrete signals. Closed-loop operation of the FCC enabled flight dynamics and atmospheric disturbances affecting the aircraft to be represented during tests. Upset was induced in the FCC as a result of exposure to HIRF, and the effect of upset on the simulated flight of the aircraft was observed and recorded. This paper presents a description of these closed- loop HIRF experiments, upset data obtained from the FCC during these experiments, and closed-loop effects on the simulated flight of the aircraft.
The Design of a Fault-Tolerant COTS-Based Bus Architecture for Space Applications
NASA Technical Reports Server (NTRS)
Chau, Savio N.; Alkalai, Leon; Tai, Ann T.
2000-01-01
The high-performance, scalability and miniaturization requirements together with the power, mass and cost constraints mandate the use of commercial-off-the-shelf (COTS) components and standards in the X2000 avionics system architecture for deep-space missions. In this paper, we report our experiences and findings on the design of an IEEE 1394 compliant fault-tolerant COTS-based bus architecture. While the COTS standard IEEE 1394 adequately supports power management, high performance and scalability, its topological criteria impose restrictions on fault tolerance realization. To circumvent the difficulties, we derive a "stack-tree" topology that not only complies with the IEEE 1394 standard but also facilitates fault tolerance realization in a spaceborne system with limited dedicated resource redundancies. Moreover, by exploiting pertinent standard features of the 1394 interface which are not purposely designed for fault tolerance, we devise a comprehensive set of fault detection mechanisms to support the fault-tolerant bus architecture.
Fault-tolerant quantum error detection.
Linke, Norbert M; Gutierrez, Mauricio; Landsman, Kevin A; Figgatt, Caroline; Debnath, Shantanu; Brown, Kenneth R; Monroe, Christopher
2017-10-01
Quantum computers will eventually reach a size at which quantum error correction becomes imperative. Quantum information can be protected from qubit imperfections and flawed control operations by encoding a single logical qubit in multiple physical qubits. This redundancy allows the extraction of error syndromes and the subsequent detection or correction of errors without destroying the logical state itself through direct measurement. We show the encoding and syndrome measurement of a fault-tolerantly prepared logical qubit via an error detection protocol on four physical qubits, represented by trapped atomic ions. This demonstrates the robustness of a logical qubit to imperfections in the very operations used to encode it. The advantage persists in the face of large added error rates and experimental calibration errors.
A probabilistic dynamic energy model for ad-hoc wireless sensors network with varying topology
NASA Astrophysics Data System (ADS)
Al-Husseini, Amal
In this dissertation we investigate the behavior of Wireless Sensor Networks (WSNs) from the degree distribution and evolution perspective. In specific, we focus on implementation of a scale-free degree distribution topology for energy efficient WSNs. WSNs is an emerging technology that finds its applications in different areas such as environment monitoring, agricultural crop monitoring, forest fire monitoring, and hazardous chemical monitoring in war zones. This technology allows us to collect data without human presence or intervention. Energy conservation/efficiency is one of the major issues in prolonging the active life WSNs. Recently, many energy aware and fault tolerant topology control algorithms have been presented, but there is dearth of research focused on energy conservation/efficiency of WSNs. Therefore, we study energy efficiency and fault-tolerance in WSNs from the degree distribution and evolution perspective. Self-organization observed in natural and biological systems has been directly linked to their degree distribution. It is widely known that scale-free distribution bestows robustness, fault-tolerance, and access efficiency to system. Fascinated by these properties, we propose two complex network theoretic self-organizing models for adaptive WSNs. In particular, we focus on adopting the Barabasi and Albert scale-free model to fit into the constraints and limitations of WSNs. We developed simulation models to conduct numerical experiments and network analysis. The main objective of studying these models is to find ways to reducing energy usage of each node and balancing the overall network energy disrupted by faulty communication among nodes. The first model constructs the wireless sensor network relative to the degree (connectivity) and remaining energy of every individual node. We observed that it results in a scale-free network structure which has good fault tolerance properties in face of random node failures. The second model considers additional constraints on the maximum degree of each node as well as the energy consumption relative to degree changes. This gives more realistic results from a dynamical network perspective. It results in balanced network-wide energy consumption. The results show that networks constructed using the proposed approach have good properties for different centrality measures. The outcomes of the presented research are beneficial to building WSN control models with greater self-organization properties which leads to optimal energy consumption.
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Butler, Bryan P.
1990-01-01
The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Diagnosing a Failed Proof in Fault-Tolerance: A Disproving Challenge Problem
NASA Technical Reports Server (NTRS)
Pike, Lee; Miner, Paul; Torres-Pomales, Wilfredo
2006-01-01
This paper proposes a challenge problem in disproving. We describe a fault-tolerant distributed protocol designed at NASA for use in a fly-by-wire system for next-generation commercial aircraft. An early design of the protocol contains a subtle bug that is highly unlikely to be caught in fault injection testing. We describe a failed proof of the protocol's correctness in a mechanical theorem prover (PVS) with a complex unfinished proof conjecture. We use a model checking suite (SAL) to generate a concrete counterexample to the unproven conjecture to demonstrate the existence of a bug. However, we argue that the effort required in our approach is too high and propose what conditions a better solution would satisfy. We carefully describe the protocol and bug to provide a challenging but feasible case study for disproving research.
Evaluating Application Resilience with XRay
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Sui; Bronevetsky, Greg; Li, Bin
2015-05-07
The rising count and shrinking feature size of transistors within modern computers is making them increasingly vulnerable to various types of soft faults. This problem is especially acute in high-performance computing (HPC) systems used for scientific computing, because these systems include many thousands of compute cores and nodes, all of which may be utilized in a single large-scale run. The increasing vulnerability of HPC applications to errors induced by soft faults is motivating extensive work on techniques to make these applications more resiilent to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithmspecific error detection andmore » tolerance techniques. Effective use of such techniques requires a detailed understanding of how a given application is affected by soft faults to ensure that (i) efforts to improve application resilience are spent in the code regions most vulnerable to faults and (ii) the appropriate resilience technique is applied to each code region. This paper presents XRay, a tool to view the application vulnerability to soft errors, and illustrates how XRay can be used in the context of a representative application. In addition to providing actionable insights into application behavior XRay automatically selects the number of fault injection experiments required to provide an informative view of application behavior, ensuring that the information is statistically well-grounded without performing unnecessary experiments.« less
Advanced Launch System Multi-Path Redundant Avionics Architecture Analysis and Characterization
NASA Technical Reports Server (NTRS)
Baker, Robert L.
1993-01-01
The objective of the Multi-Path Redundant Avionics Suite (MPRAS) program is the development of a set of avionic architectural modules which will be applicable to the family of launch vehicles required to support the Advanced Launch System (ALS). To enable ALS cost/performance requirements to be met, the MPRAS must support autonomy, maintenance, and testability capabilities which exceed those present in conventional launch vehicles. The multi-path redundant or fault tolerance characteristics of the MPRAS are necessary to offset a reduction in avionics reliability due to the increased complexity needed to support these new cost reduction and performance capabilities and to meet avionics reliability requirements which will provide cost-effective reductions in overall ALS recurring costs. A complex, real-time distributed computing system is needed to meet the ALS avionics system requirements. General Dynamics, Boeing Aerospace, and C.S. Draper Laboratory have proposed system architectures as candidates for the ALS MPRAS. The purpose of this document is to report the results of independent performance and reliability characterization and assessment analyses of each proposed candidate architecture and qualitative assessments of testability, maintainability, and fault tolerance mechanisms. These independent analyses were conducted as part of the MPRAS Part 2 program and were carried under NASA Langley Research Contract NAS1-17964, Task Assignment 28.
Fault tolerant computer control for a Maglev transportation system
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan H.; Nagle, Gail A.; Anagnostopoulos, George
1994-01-01
Magnetically levitated (Maglev) vehicles operating on dedicated guideways at speeds of 500 km/hr are an emerging transportation alternative to short-haul air and high-speed rail. They have the potential to offer a service significantly more dependable than air and with less operating cost than both air and high-speed rail. Maglev transportation derives these benefits by using magnetic forces to suspend a vehicle 8 to 200 mm above the guideway. Magnetic forces are also used for propulsion and guidance. The combination of high speed, short headways, stringent ride quality requirements, and a distributed offboard propulsion system necessitates high levels of automation for the Maglev control and operation. Very high levels of safety and availability will be required for the Maglev control system. This paper describes the mission scenario, functional requirements, and dependability and performance requirements of the Maglev command, control, and communications system. A distributed hierarchical architecture consisting of vehicle on-board computers, wayside zone computers, a central computer facility, and communication links between these entities was synthesized to meet the functional and dependability requirements on the maglev. Two variations of the basic architecture are described: the Smart Vehicle Architecture (SVA) and the Zone Control Architecture (ZCA). Preliminary dependability modeling results are also presented.
Tien, Nguyen Xuan; Kim, Semog; Rhee, Jong Myung; Park, Sang Yoon
2017-07-25
Fault tolerance has long been a major concern for sensor communications in fault-tolerant cyber physical systems (CPSs). Network failure problems often occur in wireless sensor networks (WSNs) due to various factors such as the insufficient power of sensor nodes, the dislocation of sensor nodes, the unstable state of wireless links, and unpredictable environmental interference. Fault tolerance is thus one of the key requirements for data communications in WSN applications. This paper proposes a novel path redundancy-based algorithm, called dual separate paths (DSP), that provides fault-tolerant communication with the improvement of the network traffic performance for WSN applications, such as fault-tolerant CPSs. The proposed DSP algorithm establishes two separate paths between a source and a destination in a network based on the network topology information. These paths are node-disjoint paths and have optimal path distances. Unicast frames are delivered from the source to the destination in the network through the dual paths, providing fault-tolerant communication and reducing redundant unicast traffic for the network. The DSP algorithm can be applied to wired and wireless networks, such as WSNs, to provide seamless fault-tolerant communication for mission-critical and life-critical applications such as fault-tolerant CPSs. The analyzed and simulated results show that the DSP-based approach not only provides fault-tolerant communication, but also improves network traffic performance. For the case study in this paper, when the DSP algorithm was applied to high-availability seamless redundancy (HSR) networks, the proposed DSP-based approach reduced the network traffic by 80% to 88% compared with the standard HSR protocol, thus improving network traffic performance.
cost and benefits optimization model for fault-tolerant aircraft electronic systems
NASA Technical Reports Server (NTRS)
1983-01-01
The factors involved in economic assessment of fault tolerant systems (FTS) and fault tolerant flight control systems (FTFCS) are discussed. Algorithms for optimization and economic analysis of FTFCS are documented.
BESIU Physical Analysis on Hadoop Platform
NASA Astrophysics Data System (ADS)
Huo, Jing; Zang, Dongsong; Lei, Xiaofeng; Li, Qiang; Sun, Gongxing
2014-06-01
In the past 20 years, computing cluster has been widely used for High Energy Physics data processing. The jobs running on the traditional cluster with a Data-to-Computing structure, have to read large volumes of data via the network to the computing nodes for analysis, thereby making the I/O latency become a bottleneck of the whole system. The new distributed computing technology based on the MapReduce programming model has many advantages, such as high concurrency, high scalability and high fault tolerance, and it can benefit us in dealing with Big Data. This paper brings the idea of using MapReduce model to do BESIII physical analysis, and presents a new data analysis system structure based on Hadoop platform, which not only greatly improve the efficiency of data analysis, but also reduces the cost of system building. Moreover, this paper establishes an event pre-selection system based on the event level metadata(TAGs) database to optimize the data analyzing procedure.
The implementation and use of Ada on distributed systems with high reliability requirements
NASA Technical Reports Server (NTRS)
Knight, J. C.
1986-01-01
The general inadequacy of Ada for programming systems that must survive processor loss was shown. A solution to the problem was proposed in which there are no syntatic changes to Ada. The approach was evaluated using a full-scale, realistic application. The application used was the Advanced Transport Operating System (ATOPS), an experimental computer control system developed for a modified Boeing 737 aircraft. The ATOPS system is a full authority, real-time avionics system providing a large variety of advanced features. Methods of building fault tolerance into concurrent systems were explored. A set of criteria by which the proposed method will be judged was examined. Extensive interaction with personnel from Computer Sciences Corporation and NASA Langley occurred to determine the requirements of the ATOPS software. Backward error recovery in concurrent systems was assessed.
Fault tolerant attitude sensing and force feedback control for unmanned aerial vehicles
NASA Astrophysics Data System (ADS)
Jagadish, Chirag
Two aspects of an unmanned aerial vehicle are studied in this work. One is fault tolerant attitude determination and the other is to provide force feedback to the joy-stick of the UAV so as to prevent faulty inputs from the pilot. Determination of attitude plays an important role in control of aerial vehicles. One way of defining the attitude is through Euler angles. These angles can be determined based on the measurements of the projections of the gravity and earth magnetic fields on the three body axes of the vehicle. Attitude determination in unmanned aerial vehicles poses additional challenges due to limitations of space, payload, power and cost. Therefore it provides for almost no room for any bulky sensors or extra sensor hardware for backup and as such leaves no room for sensor fault issues either. In the face of these limitations, this study proposes a fault tolerant computing of Euler angles by utilizing multiple different computation methods, with each method utilizing a different subset of the available sensor measurement data. Twenty-five such methods have been presented in this document. The capability of computing the Euler angles in multiple ways provides a diversified redundancy required for fault tolerance. The proposed approach can identify certain sets of sensor failures and even separate the reference fields from the disturbances. A bank-to-turn maneuver of the NASA GTM UAV is used to demonstrate the fault tolerance provided by the proposed method as well as to demonstrate the method of determining the correct Euler angles despite interferences by inertial acceleration disturbances. Attitude computation is essential for stability. But as of today most UAVs are commanded remotely by human pilots. While basic stability control is entrusted to machine or the on-board automatic controller, overall guidance is usually with humans. It is therefore the pilot who sets the command/references through a joy-stick. While this is a good compromise between complete automation and complete human control, it still poses some unique challenges. Pilots of manned aircraft are present inside the cockpit of the aircraft they fly and thus have a better feel of the flying environment and also the limitations of the flight. The same might not be true for UAV pilots stationed on the ground. A major handicap is that visual feedback is the only one available for the UAV pilot. An additional parameter like force feedback on the remote control joy-stick can help the UAV pilot to physically feel the limitation of the safe flight envelope. This can make the flying itself easier and safer. A method proposed here is to design a joy-stick assembly with an additional actuator. This actuator is controlled so as to generate a force feedback on the joy-stick. The control developed for this system is such that the actuator allows free movement for the pilot as long as the UAV is within the safe flight envelope. On the other hand, if it is outside this safe range, the actuator opposes the pilot's applied torque and prevents him/her from giving erroneous commands to the UAV.
Computer-aided operations engineering with integrated models of systems and operations
NASA Technical Reports Server (NTRS)
Malin, Jane T.; Ryan, Dan; Fleming, Land
1994-01-01
CONFIG 3 is a prototype software tool that supports integrated conceptual design evaluation from early in the product life cycle, by supporting isolated or integrated modeling, simulation, and analysis of the function, structure, behavior, failures and operation of system designs. Integration and reuse of models is supported in an object-oriented environment providing capabilities for graph analysis and discrete event simulation. Integration is supported among diverse modeling approaches (component view, configuration or flow path view, and procedure view) and diverse simulation and analysis approaches. Support is provided for integrated engineering in diverse design domains, including mechanical and electro-mechanical systems, distributed computer systems, and chemical processing and transport systems. CONFIG supports abstracted qualitative and symbolic modeling, for early conceptual design. System models are component structure models with operating modes, with embedded time-related behavior models. CONFIG supports failure modeling and modeling of state or configuration changes that result in dynamic changes in dependencies among components. Operations and procedure models are activity structure models that interact with system models. CONFIG is designed to support evaluation of system operability, diagnosability and fault tolerance, and analysis of the development of system effects of problems over time, including faults, failures, and procedural or environmental difficulties.
Investigation, Development, and Evaluation of Performance Proving for Fault-tolerant Computers
NASA Technical Reports Server (NTRS)
Levitt, K. N.; Schwartz, R.; Hare, D.; Moore, J. S.; Melliar-Smith, P. M.; Shostak, R. E.; Boyer, R. S.; Green, M. W.; Elliott, W. D.
1983-01-01
A number of methodologies for verifying systems and computer based tools that assist users in verifying their systems were developed. These tools were applied to verify in part the SIFT ultrareliable aircraft computer. Topics covered included: STP theorem prover; design verification of SIFT; high level language code verification; assembly language level verification; numerical algorithm verification; verification of flight control programs; and verification of hardware logic.
Sliding Mode Fault Tolerant Control with Adaptive Diagnosis for Aircraft Engines
NASA Astrophysics Data System (ADS)
Xiao, Lingfei; Du, Yanbin; Hu, Jixiang; Jiang, Bin
2018-03-01
In this paper, a novel sliding mode fault tolerant control method is presented for aircraft engine systems with uncertainties and disturbances on the basis of adaptive diagnostic observer. By taking both sensors faults and actuators faults into account, the general model of aircraft engine control systems which is subjected to uncertainties and disturbances, is considered. Then, the corresponding augmented dynamic model is established in order to facilitate the fault diagnosis and fault tolerant controller design. Next, a suitable detection observer is designed to detect the faults effectively. Through creating an adaptive diagnostic observer and based on sliding mode strategy, the sliding mode fault tolerant controller is constructed. Robust stabilization is discussed and the closed-loop system can be stabilized robustly. It is also proven that the adaptive diagnostic observer output errors and the estimations of faults converge to a set exponentially, and the converge rate greater than some value which can be adjusted by choosing designable parameters properly. The simulation on a twin-shaft aircraft engine verifies the applicability of the proposed fault tolerant control method.
Programming your way out of the past: ISIS and the META Project
NASA Technical Reports Server (NTRS)
Birman, Kenneth P.; Marzullo, Keith
1989-01-01
The ISIS distributed programming system and the META Project are described. The ISIS programming toolkit is an aid to low-level programming that makes it easy to build fault-tolerant distributed applications that exploit replication and concurrent execution. The META Project is reexamining high-level mechanisms such as the filesystem, shell language, and administration tools in distributed systems.
Fault Detection of Rotating Machinery using the Spectral Distribution Function
NASA Technical Reports Server (NTRS)
Davis, Sanford S.
1997-01-01
The spectral distribution function is introduced to characterize the process leading to faults in rotating machinery. It is shown to be a more robust indicator than conventional power spectral density estimates, but requires only slightly more computational effort. The method is illustrated with examples from seeded gearbox transmission faults and an analytical model of a defective bearing. Procedures are suggested for implementation in realistic environments.
NASA Technical Reports Server (NTRS)
Rasmussen, Robert D. (Inventor); Manning, Robert M. (Inventor); Lewis, Blair F. (Inventor); Bolotin, Gary S. (Inventor); Ward, Richard S. (Inventor)
1990-01-01
This is a distributed computing system providing flexible fault tolerance; ease of software design and concurrency specification; and dynamic balance of the loads. The system comprises a plurality of computers each having a first input/output interface and a second input/output interface for interfacing to communications networks each second input/output interface including a bypass for bypassing the associated computer. A global communications network interconnects the first input/output interfaces for providing each computer the ability to broadcast messages simultaneously to the remainder of the computers. A meshwork communications network interconnects the second input/output interfaces providing each computer with the ability to establish a communications link with another of the computers bypassing the remainder of computers. Each computer is controlled by a resident copy of a common operating system. Communications between respective ones of computers is by means of split tokens each having a moving first portion which is sent from computer to computer and a resident second portion which is disposed in the memory of at least one of computer and wherein the location of the second portion is part of the first portion. The split tokens represent both functions to be executed by the computers and data to be employed in the execution of the functions. The first input/output interfaces each include logic for detecting a collision between messages and for terminating the broadcasting of a message whereby collisions between messages are detected and avoided.
Mission Management Computer and Sequencing Hardware for RLV-TD HEX-01 Mission
NASA Astrophysics Data System (ADS)
Gupta, Sukrat; Raj, Remya; Mathew, Asha Mary; Koshy, Anna Priya; Paramasivam, R.; Mookiah, T.
2017-12-01
Reusable Launch Vehicle-Technology Demonstrator Hypersonic Experiment (RLV-TD HEX-01) mission posed some unique challenges in the design and development of avionics hardware. This work presents the details of mission critical avionics hardware mainly Mission Management Computer (MMC) and sequencing hardware. The Navigation, Guidance and Control (NGC) chain for RLV-TD is dual redundant with cross-strapped Remote Terminals (RTs) interfaced through MIL-STD-1553B bus. MMC is Bus Controller on the 1553 bus, which does the function of GPS aided navigation, guidance, digital autopilot and sequencing for the RLV-TD launch vehicle in different periodicities (10, 20, 500 ms). Digital autopilot execution in MMC with a periodicity of 10 ms (in ascent phase) is introduced for the first time and successfully demonstrated in the flight. MMC is built around Intel i960 processor and has inbuilt fault tolerance features like ECC for memories. Fault Detection and Isolation schemes are implemented to isolate the failed MMC. The sequencing hardware comprises Stage Processing System (SPS) and Command Execution Module (CEM). SPS is `RT' on the 1553 bus which receives the sequencing and control related commands from MMCs and posts to downstream modules after proper error handling for final execution. SPS is designed as a high reliability system by incorporating various fault tolerance and fault detection features. CEM is a relay based module for sequence command execution.
Injecting Artificial Memory Errors Into a Running Computer Program
NASA Technical Reports Server (NTRS)
Bornstein, Benjamin J.; Granat, Robert A.; Wagstaff, Kiri L.
2008-01-01
Single-event upsets (SEUs) or bitflips are computer memory errors caused by radiation. BITFLIPS (Basic Instrumentation Tool for Fault Localized Injection of Probabilistic SEUs) is a computer program that deliberately injects SEUs into another computer program, while the latter is running, for the purpose of evaluating the fault tolerance of that program. BITFLIPS was written as a plug-in extension of the open-source Valgrind debugging and profiling software. BITFLIPS can inject SEUs into any program that can be run on the Linux operating system, without needing to modify the program s source code. Further, if access to the original program source code is available, BITFLIPS offers fine-grained control over exactly when and which areas of memory (as specified via program variables) will be subjected to SEUs. The rate of injection of SEUs is controlled by specifying either a fault probability or a fault rate based on memory size and radiation exposure time, in units of SEUs per byte per second. BITFLIPS can also log each SEU that it injects and, if program source code is available, report the magnitude of effect of the SEU on a floating-point value or other program variable.
Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems
NASA Technical Reports Server (NTRS)
Nair, V. S. S.; Hoskote, Yatin V.; Abraham, Jacob A.
1992-01-01
The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. A probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system is developed. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model. Based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.
Fault recovery characteristics of the fault tolerant multi-processor
NASA Technical Reports Server (NTRS)
Padilla, Peter A.
1990-01-01
The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.
NASA Technical Reports Server (NTRS)
Malekpour, Mahyar R.
2007-01-01
This report presents the mechanical verification of a simplified model of a rapid Byzantine-fault-tolerant self-stabilizing protocol for distributed clock synchronization systems. This protocol does not rely on any assumptions about the initial state of the system. This protocol tolerates bursts of transient failures, and deterministically converges within a time bound that is a linear function of the self-stabilization period. A simplified model of the protocol is verified using the Symbolic Model Verifier (SMV) [SMV]. The system under study consists of 4 nodes, where at most one of the nodes is assumed to be Byzantine faulty. The model checking effort is focused on verifying correctness of the simplified model of the protocol in the presence of a permanent Byzantine fault as well as confirmation of claims of determinism and linear convergence with respect to the self-stabilization period. Although model checking results of the simplified model of the protocol confirm the theoretical predictions, these results do not necessarily confirm that the protocol solves the general case of this problem. Modeling challenges of the protocol and the system are addressed. A number of abstractions are utilized in order to reduce the state space. Also, additional innovative state space reduction techniques are introduced that can be used in future verification efforts applied to this and other protocols.
SLURM: Simple Linux Utility for Resource Management
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jette, M; Dunlap, C; Garlick, J
2002-04-24
Simple Linux Utility for Resource Management (SLURM) is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for Linux clusters of thousands of nodes. Components include machine status, partition management, job management, and scheduling modules. The design also includes a scalable, general-purpose communication infrastructure. Development will take place in four phases: Phase I results in a solid infrastructure; Phase II produces a functional but limited interactive job initiation capability without use of the interconnect/switch; Phase III provides switch support and documentation; Phase IV provides job status, fault-tolerance, and job queuing and control through Livermore's Distributed Productionmore » Control System (DPCS), a meta-batch and resource management system.« less
Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous Vehicles
2007-11-01
Tolerant Overactuated Autonomous Vehicles Casavola, A.; Garone, E. (2007) Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous ...Adaptive Control Allocation for Fault Tolerant Overactuated Autonomous Vehicles 5a. CONTRACT NUMBER 5b. GRANT NUMBER 5c. PROGRAM ELEMENT NUMBER 6...Tolerant Overactuated Autonomous Vehicles 3.2 - 2 RTO-MP-AVT-145 UNCLASSIFIED/UNLIMITED Control allocation problem (CAP) - Given a virtual input v(t
A design approach for ultrareliable real-time systems
NASA Technical Reports Server (NTRS)
Lala, Jaynarayan H.; Harper, Richard E.; Alger, Linda S.
1991-01-01
A design approach developed over the past few years to formalize redundancy management and validation is described. Redundant elements are partitioned into individual fault-containment regions (FCRs). An FCR is a collection of components that operates correctly regardless of any arbitrary logical or electrical fault outside the region. Conversely, a fault in an FCR cannot cause hardware outside the region to fail. The outputs of all channels are required to agree bit-for-bit under no-fault conditions (exact bitwise consensus). Synchronization, input agreement, and input validity conditions are discussed. The Advanced Information Processing System (AIPS), which is a fault-tolerant distributed architecture based on this approach, is described. A brief overview of recent applications of these systems and current research is presented.
Galileo spacecraft power distribution and autonomous fault recovery
NASA Technical Reports Server (NTRS)
Detwiler, R. C.
1982-01-01
There is a trend in current spacecraft design to achieve greater fault tolerance through the implemenation of on-board software dedicated to detecting and isolating failures. A combination of hardware and software is utilized in the Galileo power system for autonomous fault recovery. Galileo is a dual-spun spacecraft designed to carry a number of scientific instruments into a series of orbits around the planet Jupiter. In addition to its self-contained scientific payload, it will also carry a probe system which will be separated from the spacecraft some 150 days prior to Jupiter encounter. The Galileo spacecraft is scheduled to be launched in 1985. Attention is given to the power system, the fault protection requirements, and the power fault recovery implementation.
Fault tolerance techniques to assure data integrity in high-volume PACS image archives
NASA Astrophysics Data System (ADS)
He, Yutao; Huang, Lu J.; Valentino, Daniel J.; Wingate, W. Keith; Avizienis, Algirdas
1995-05-01
Picture archiving and communication systems (PACS) perform the systematic acquisition, archiving, and presentation of large quantities of radiological image and text data. In the UCLA Radiology PACS, for example, the volume of image data archived currently exceeds 2500 gigabytes. Furthermore, the distributed heterogeneous PACS is expected to have near real-time response, be continuously available, and assure the integrity and privacy of patient data. The off-the-shelf subsystems that compose the current PACS cannot meet these expectations; therefore fault tolerance techniques had to be incorporated into the system. This paper is to report our first-step efforts towards the goal and is organized as follows: First we discuss data integrity and identify fault classes under the PACS operational environment, then we describe auditing and accounting schemes developed for error-detection and analyze operational data collected. Finally, we outline plans for future research.
NASA Technical Reports Server (NTRS)
Trivedi, K. S.; Geist, R. M.
1981-01-01
The CARE 3 reliability model for aircraft avionics and control systems is described by utilizing a number of examples which frequently use state-of-the-art mathematical modeling techniques as a basis for their exposition. Behavioral decomposition followed by aggregration were used in an attempt to deal with reliability models with a large number of states. A comprehensive set of models of the fault-handling processes in a typical fault-tolerant system was used. These models were semi-Markov in nature, thus removing the usual restrictions of exponential holding times within the coverage model. The aggregate model is a non-homogeneous Markov chain, thus allowing the times to failure to posses Weibull-like distributions. Because of the departures from traditional models, the solution method employed is that of Kolmogorov integral equations, which are evaluated numerically.
Fault-tolerant quantum error detection
Linke, Norbert M.; Gutierrez, Mauricio; Landsman, Kevin A.; Figgatt, Caroline; Debnath, Shantanu; Brown, Kenneth R.; Monroe, Christopher
2017-01-01
Quantum computers will eventually reach a size at which quantum error correction becomes imperative. Quantum information can be protected from qubit imperfections and flawed control operations by encoding a single logical qubit in multiple physical qubits. This redundancy allows the extraction of error syndromes and the subsequent detection or correction of errors without destroying the logical state itself through direct measurement. We show the encoding and syndrome measurement of a fault-tolerantly prepared logical qubit via an error detection protocol on four physical qubits, represented by trapped atomic ions. This demonstrates the robustness of a logical qubit to imperfections in the very operations used to encode it. The advantage persists in the face of large added error rates and experimental calibration errors. PMID:29062889
Disjointness of Stabilizer Codes and Limitations on Fault-Tolerant Logical Gates
NASA Astrophysics Data System (ADS)
Jochym-O'Connor, Tomas; Kubica, Aleksander; Yoder, Theodore J.
2018-04-01
Stabilizer codes are among the most successful quantum error-correcting codes, yet they have important limitations on their ability to fault tolerantly compute. Here, we introduce a new quantity, the disjointness of the stabilizer code, which, roughly speaking, is the number of mostly nonoverlapping representations of any given nontrivial logical Pauli operator. The notion of disjointness proves useful in limiting transversal gates on any error-detecting stabilizer code to a finite level of the Clifford hierarchy. For code families, we can similarly restrict logical operators implemented by constant-depth circuits. For instance, we show that it is impossible, with a constant-depth but possibly geometrically nonlocal circuit, to implement a logical non-Clifford gate on the standard two-dimensional surface code.
NASA Astrophysics Data System (ADS)
Ortega, R.; Gutierrez, E.; Carciumaru, D. D.; Huesca-Perez, E.
2017-12-01
We present a method to compute the conditional and no-conditional probability density function (PDF) of the finite fault distance distribution (FFDD). Two cases are described: lines and areas. The case of lines has a simple analytical solution while, in the case of areas, the geometrical probability of a fault based on the strike, dip, and fault segment vertices is obtained using the projection of spheres in a piecewise rectangular surface. The cumulative distribution is computed by measuring the projection of a sphere of radius r in an effective area using an algorithm that estimates the area of a circle within a rectangle. In addition, we introduce the finite fault distance metrics. This distance is the distance where the maximum stress release occurs within the fault plane and generates a peak ground motion. Later, we can apply the appropriate ground motion prediction equations (GMPE) for PSHA. The conditional probability of distance given magnitude is also presented using different scaling laws. A simple model of constant distribution of the centroid at the geometrical mean is discussed, in this model hazard is reduced at the edges because the effective size is reduced. Nowadays there is a trend of using extended source distances in PSHA, however it is not possible to separate the fault geometry from the GMPE. With this new approach, it is possible to add fault rupture models separating geometrical and propagation effects.
2016-10-13
enielse@sandia.gov and a.morello@unsw.edu.au Keywords: quantum computing , silicon, tomography Supplementarymaterial for this article is available online...Abstract State of the art qubit systems are reaching the gatefidelities required for scalable quantum computation architectures. Further improvements in...and addressedwhen the qubit is usedwithin a fault-tolerant quantum computation scheme. 1. Introduction One of themain challenges in the physical
Noise thresholds for optical quantum computers.
Dawson, Christopher M; Haselgrove, Henry L; Nielsen, Michael A
2006-01-20
In this Letter we numerically investigate the fault-tolerant threshold for optical cluster-state quantum computing. We allow both photon loss noise and depolarizing noise (as a general proxy for all local noise), and obtain a threshold region of allowed pairs of values for the two types of noise. Roughly speaking, our results show that scalable optical quantum computing is possible for photon loss probabilities <3 x 10(-3), and for depolarization probabilities <10(-4).
NASA Technical Reports Server (NTRS)
Szatkowski, G. P.
1983-01-01
A computer simulation system has been developed for the Space Shuttle's advanced Centaur liquid fuel booster rocket, in order to conduct systems safety verification and flight operations training. This simulation utility is designed to analyze functional system behavior by integrating control avionics with mechanical and fluid elements, and is able to emulate any system operation, from simple relay logic to complex VLSI components, with wire-by-wire detail. A novel graphics data entry system offers a pseudo-wire wrap data base that can be easily updated. Visual subsystem operations can be selected and displayed in color on a six-monitor graphics processor. System timing and fault verification analyses are conducted by injecting component fault modes and min/max timing delays, and then observing system operation through a red line monitor.
Distributed controller clustering in software defined networks
Gani, Abdullah; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond
2017-01-01
Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability. PMID:28384312
Thermodynamic method for generating random stress distributions on an earthquake fault
Barall, Michael; Harris, Ruth A.
2012-01-01
This report presents a new method for generating random stress distributions on an earthquake fault, suitable for use as initial conditions in a dynamic rupture simulation. The method employs concepts from thermodynamics and statistical mechanics. A pattern of fault slip is considered to be analogous to a micro-state of a thermodynamic system. The energy of the micro-state is taken to be the elastic energy stored in the surrounding medium. Then, the Boltzmann distribution gives the probability of a given pattern of fault slip and stress. We show how to decompose the system into independent degrees of freedom, which makes it computationally feasible to select a random state. However, due to the equipartition theorem, straightforward application of the Boltzmann distribution leads to a divergence which predicts infinite stress. To avoid equipartition, we show that the finite strength of the fault acts to restrict the possible states of the system. By analyzing a set of earthquake scaling relations, we derive a new formula for the expected power spectral density of the stress distribution, which allows us to construct a computer algorithm free of infinities. We then present a new technique for controlling the extent of the rupture by generating a random stress distribution thousands of times larger than the fault surface, and selecting a portion which, by chance, has a positive stress perturbation of the desired size. Finally, we present a new two-stage nucleation method that combines a small zone of forced rupture with a larger zone of reduced fracture energy.
NASA Technical Reports Server (NTRS)
Migneault, G. E.
1979-01-01
Emulation techniques are proposed as a solution to a difficulty arising in the analysis of the reliability of highly reliable computer systems for future commercial aircraft. The difficulty, viz., the lack of credible precision in reliability estimates obtained by analytical modeling techniques are established. The difficulty is shown to be an unavoidable consequence of: (1) a high reliability requirement so demanding as to make system evaluation by use testing infeasible, (2) a complex system design technique, fault tolerance, (3) system reliability dominated by errors due to flaws in the system definition, and (4) elaborate analytical modeling techniques whose precision outputs are quite sensitive to errors of approximation in their input data. The technique of emulation is described, indicating how its input is a simple description of the logical structure of a system and its output is the consequent behavior. The use of emulation techniques is discussed for pseudo-testing systems to evaluate bounds on the parameter values needed for the analytical techniques.
Network-Physics(NP) Bec DIGITAL(#)-VULNERABILITY Versus Fault-Tolerant Analog
NASA Astrophysics Data System (ADS)
Alexander, G. K.; Hathaway, M.; Schmidt, H. E.; Siegel, E.
2011-03-01
Siegel[AMS Joint Mtg.(2002)-Abs.973-60-124] digits logarithmic-(Newcomb(1881)-Weyl(1914; 1916)-Benford(1938)-"NeWBe"/"OLDbe")-law algebraic-inversion to ONLY BEQS BEC:Quanta/Bosons= digits: Synthesis reveals EMP-like SEVERE VULNERABILITY of ONLY DIGITAL-networks(VS. FAULT-TOLERANT ANALOG INvulnerability) via Barabasi "Network-Physics" relative-``statics''(VS.dynamics-[Willinger-Alderson-Doyle(Not.AMS(5/09)]-]critique); (so called)"Quantum-computing is simple-arithmetic(sans division/ factorization); algorithmic-complexities: INtractibility/ UNdecidability/ INefficiency/NONcomputability / HARDNESS(so MIScalled) "noise"-induced-phase-transitions(NITS) ACCELERATION: Cook-Levin theorem Reducibility is Renormalization-(Semi)-Group fixed-points; number-Randomness DEFINITION via WHAT? Query(VS. Goldreich[Not.AMS(02)] How? mea culpa)can ONLY be MBCS "hot-plasma" versus digit-clumping NON-random BEC; Modular-arithmetic Congruences= Signal X Noise PRODUCTS = clock-model; NON-Shor[Physica A,341,586(04)] BEC logarithmic-law inversion factorization:Watkins number-thy. U stat.-phys.); P=/=NP TRIVIAL Proof: Euclid!!! [(So Miscalled) computational-complexity J-O obviation via geometry.
Quantum computing with Majorana fermion codes
NASA Astrophysics Data System (ADS)
Litinski, Daniel; von Oppen, Felix
2018-05-01
We establish a unified framework for Majorana-based fault-tolerant quantum computation with Majorana surface codes and Majorana color codes. All logical Clifford gates are implemented with zero-time overhead. This is done by introducing a protocol for Pauli product measurements with tetrons and hexons which only requires local 4-Majorana parity measurements. An analogous protocol is used in the fault-tolerant setting, where tetrons and hexons are replaced by Majorana surface code patches, and parity measurements are replaced by lattice surgery, still only requiring local few-Majorana parity measurements. To this end, we discuss twist defects in Majorana fermion surface codes and adapt the technique of twist-based lattice surgery to fermionic codes. Moreover, we propose a family of codes that we refer to as Majorana color codes, which are obtained by concatenating Majorana surface codes with small Majorana fermion codes. Majorana surface and color codes can be used to decrease the space overhead and stabilizer weight compared to their bosonic counterparts.
Hierarchical specification of the SIFT fault tolerant flight control system
NASA Technical Reports Server (NTRS)
Melliar-Smith, P. M.; Schwartz, R. L.
1981-01-01
The specification and mechanical verification of the Software Implemented Fault Tolerance (SIFT) flight control system is described. The methodology employed in the verification effort is discussed, and a description of the hierarchical models of the SIFT system is given. To meet the objective of NASA for the reliability of safety critical flight control systems, the SIFT computer must achieve a reliability well beyond the levels at which reliability can be actually measured. The methodology employed to demonstrate rigorously that the SIFT computer meets as reliability requirements is described. The hierarchy of design specifications from very abstract descriptions of system function down to the actual implementation is explained. The most abstract design specifications can be used to verify that the system functions correctly and with the desired reliability since almost all details of the realization were abstracted out. A succession of lower level models refine these specifications to the level of the actual implementation, and can be used to demonstrate that the implementation has the properties claimed of the abstract design specifications.
Software fault tolerance for real-time avionics systems
NASA Technical Reports Server (NTRS)
Anderson, T.; Knight, J. C.
1983-01-01
Avionics systems have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be very expensive for systems which utilize concurrent processes. The concurrency present in most avionics systems and the further difficulties introduced by timing constraints imply that providing tolerance for software faults may be inordinately expensive or complex. A straightforward pragmatic approach to software fault tolerance which is believed to be applicable to many real-time avionics systems is proposed. A classification system for software errors is presented together with approaches to recovery and continued service for each error type.
Switch failure diagnosis based on inductor current observation for boost converters
NASA Astrophysics Data System (ADS)
Jamshidpour, E.; Poure, P.; Saadate, S.
2016-09-01
Face to the growing number of applications using DC-DC power converters, the improvement of their reliability is subject to an increasing number of studies. Especially in safety critical applications, designing fault-tolerant converters is becoming mandatory. In this paper, a switch fault-tolerant DC-DC converter is studied. First, some of the fastest Fault Detection Algorithms (FDAs) are recalled. Then, a fast switch FDA is proposed which can detect both types of failures; open circuit fault as well as short circuit fault can be detected in less than one switching period. Second, a fault-tolerant converter which can be reconfigured under those types of fault is introduced. Hardware-In-the-Loop (HIL) results and experimental validations are given to verify the validity of the proposed switch fault-tolerant approach in the case of a single switch DC-DC boost converter with one redundant switch.
Design and Analysis of Linear Fault-Tolerant Permanent-Magnet Vernier Machines
Xu, Liang; Liu, Guohai; Du, Yi; Liu, Hu
2014-01-01
This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis. PMID:24982959
Design and analysis of linear fault-tolerant permanent-magnet vernier machines.
Xu, Liang; Ji, Jinghua; Liu, Guohai; Du, Yi; Liu, Hu
2014-01-01
This paper proposes a new linear fault-tolerant permanent-magnet (PM) vernier (LFTPMV) machine, which can offer high thrust by using the magnetic gear effect. Both PMs and windings of the proposed machine are on short mover, while the long stator is only manufactured from iron. Hence, the proposed machine is very suitable for long stroke system applications. The key of this machine is that the magnetizer splits the two movers with modular and complementary structures. Hence, the proposed machine offers improved symmetrical and sinusoidal back electromotive force waveform and reduced detent force. Furthermore, owing to the complementary structure, the proposed machine possesses favorable fault-tolerant capability, namely, independent phases. In particular, differing from the existing fault-tolerant machines, the proposed machine offers fault tolerance without sacrificing thrust density. This is because neither fault-tolerant teeth nor the flux-barriers are adopted. The electromagnetic characteristics of the proposed machine are analyzed using the time-stepping finite-element method, which verifies the effectiveness of the theoretical analysis.
NASA Astrophysics Data System (ADS)
Belapurkar, Rohit K.
Future aircraft engine control systems will be based on a distributed architecture, in which, the sensors and actuators will be connected to the Full Authority Digital Engine Control (FADEC) through an engine area network. Distributed engine control architecture will allow the implementation of advanced, active control techniques along with achieving weight reduction, improvement in performance and lower life cycle cost. The performance of a distributed engine control system is predominantly dependent on the performance of the communication network. Due to the serial data transmission policy, network-induced time delays and sampling jitter are introduced between the sensor/actuator nodes and the distributed FADEC. Communication network faults and transient node failures may result in data dropouts, which may not only degrade the control system performance but may even destabilize the engine control system. Three different architectures for a turbine engine control system based on a distributed framework are presented. A partially distributed control system for a turbo-shaft engine is designed based on ARINC 825 communication protocol. Stability conditions and control design methodology are developed for the proposed partially distributed turbo-shaft engine control system to guarantee the desired performance under the presence of network-induced time delay and random data loss due to transient sensor/actuator failures. A fault tolerant control design methodology is proposed to benefit from the availability of an additional system bandwidth and from the broadcast feature of the data network. It is shown that a reconfigurable fault tolerant control design can help to reduce the performance degradation in presence of node failures. A T-700 turbo-shaft engine model is used to validate the proposed control methodology based on both single input and multiple-input multiple-output control design techniques.
Space station data system analysis/architecture study. Task 3: Trade studies, DR-5, volume 2
NASA Technical Reports Server (NTRS)
1985-01-01
Results of a Space Station Data System Analysis/Architecture Study for the Goddard Space Flight Center are presented. This study, which emphasized a system engineering design for a complete, end-to-end data system, was divided into six tasks: (1); Functional requirements definition; (2) Options development; (3) Trade studies; (4) System definitions; (5) Program plan; and (6) Study maintenance. The Task inter-relationship and documentation flow are described. Information in volume 2 is devoted to Task 3: trade Studies. Trade Studies have been carried out in the following areas: (1) software development test and integration capability; (2) fault tolerant computing; (3) space qualified computers; (4) distributed data base management system; (5) system integration test and verification; (6) crew workstations; (7) mass storage; (8) command and resource management; and (9) space communications. Results are presented for each task.
The Design of a Fault-Tolerant COTS-Based Bus Architecture
NASA Technical Reports Server (NTRS)
Chau, Savio N.; Alkalai, Leon; Burt, John B.; Tai, Ann T.
1999-01-01
In this paper, we report our experiences and findings on the design of a fault-tolerant bus architecture comprised of two COTS buses, the IEEE 1394 and the 12C. This fault-tolerant bus is the backbone system bus for the avionics architecture of the X2000 program at the Jet Propulsion Laboratory. COTS buses are attractive because of the availability of low cost commercial products. However, they are not specifically designed for highly reliable applications such as long-life deep-space missions. The X2000 design team has devised a multi-level fault tolerance approach to compensate for this shortcoming of COTS buses. First, the approach enhances the fault tolerance capabilities of the IEEE 1394 and 12 C buses by adding a layer of fault handling hardware and software. Second, algorithms are developed to enable the IEEE 1394 and the 12 C buses assist each other to isolate and recovery from faults. Third, the set of IEEE 1394 and 12 C buses is duplicated to further enhance system reliability. The X2000 design team has paid special attention to guarantee that all fault tolerance provisions will not cause the bus design to deviate from the commercial standard specifications. Otherwise, the economic attractiveness of using COTS will be diminished. The hardware and software design of the X2000 fault-tolerant bus are being implemented and flight hardware will be delivered to the ST4 and Europa Orbiter missions.
Digital avionics design and reliability analyzer
NASA Technical Reports Server (NTRS)
1981-01-01
The description and specifications for a digital avionics design and reliability analyzer are given. Its basic function is to provide for the simulation and emulation of the various fault-tolerant digital avionic computer designs that are developed. It has been established that hardware emulation at the gate-level will be utilized. The primary benefit of emulation to reliability analysis is the fact that it provides the capability to model a system at a very detailed level. Emulation allows the direct insertion of faults into the system, rather than waiting for actual hardware failures to occur. This allows for controlled and accelerated testing of system reaction to hardware failures. There is a trade study which leads to the decision to specify a two-machine system, including an emulation computer connected to a general-purpose computer. There is also an evaluation of potential computers to serve as the emulation computer.
Modeling the Fault Tolerant Capability of a Flight Control System: An Exercise in SCR Specification
NASA Technical Reports Server (NTRS)
Alexander, Chris; Cortellessa, Vittorio; DelGobbo, Diego; Mili, Ali; Napolitano, Marcello
2000-01-01
In life-critical and mission-critical applications, it is important to make provisions for a wide range of contingencies, by providing means for fault tolerance. In this paper, we discuss the specification of a flight control system that is fault tolerant with respect to sensor faults. Redundancy is provided by analytical relations that hold between sensor readings; depending on the conditions, this redundancy can be used to detect, identify and accommodate sensor faults.
Advanced techniques in reliability model representation and solution
NASA Technical Reports Server (NTRS)
Palumbo, Daniel L.; Nicol, David M.
1992-01-01
The current tendency of flight control system designs is towards increased integration of applications and increased distribution of computational elements. The reliability analysis of such systems is difficult because subsystem interactions are increasingly interdependent. Researchers at NASA Langley Research Center have been working for several years to extend the capability of Markov modeling techniques to address these problems. This effort has been focused in the areas of increased model abstraction and increased computational capability. The reliability model generator (RMG) is a software tool that uses as input a graphical object-oriented block diagram of the system. RMG uses a failure-effects algorithm to produce the reliability model from the graphical description. The ASSURE software tool is a parallel processing program that uses the semi-Markov unreliability range evaluator (SURE) solution technique and the abstract semi-Markov specification interface to the SURE tool (ASSIST) modeling language. A failure modes-effects simulation is used by ASSURE. These tools were used to analyze a significant portion of a complex flight control system. The successful combination of the power of graphical representation, automated model generation, and parallel computation leads to the conclusion that distributed fault-tolerant system architectures can now be analyzed.
NASA Astrophysics Data System (ADS)
Caldwell, Douglas Wyche
Commercial microcontrollers--monolithic integrated circuits containing microprocessor, memory and various peripheral functions--such as are used in industrial, automotive and military applications, present spacecraft avionics system designers an appealing mix of higher performance and lower power together with faster system-development time and lower unit costs. However, these parts are not radiation-hardened for application in the space environment and Single-Event Effects (SEE) caused by high-energy, ionizing radiation present a significant challenge. Mitigating these effects with techniques which require minimal additional support logic, and thereby preserve the high functional density of these devices, can allow their benefits to be realized. This dissertation uses fault-tolerance to mitigate the transient errors and occasional latchups that non-hardened microcontrollers can experience in the space radiation environment. Space systems requirements and the historical use of fault-tolerant computers in spacecraft provide context. Space radiation and its effects in semiconductors define the fault environment. A reference architecture is presented which uses two or three microcontrollers with a combination of hardware and software voting techniques to mitigate SEE. A prototypical spacecraft function (an inertial measurement unit) is used to illustrate the techniques and to explore how real application requirements impact the fault-tolerance approach. Low-cost approaches which leverage features of existing commercial microcontrollers are analyzed. A high-speed serial bus is used for voting among redundant devices and a novel wire-OR output voting scheme exploits the bidirectional controls of I/O pins. A hardware testbed and prototype software were constructed to evaluate two- and three-processor configurations. Simulated Single-Event Upsets (SEUs) were injected at high rates and the response of the system monitored. The resulting statistics were used to evaluate technical effectiveness. Fault-recovery probabilities (coverages) higher than 99.99% were experimentally demonstrated. The greater than thousand-fold reduction in observed effects provides performance comparable with SEE tolerance of tested, rad-hard devices. Technical results were combined with cost data to assess the cost-effectiveness of the techniques. It was found that a three-processor system was only marginally more effective than a two-device system at detecting and recovering from faults, but consumed substantially more resources, suggesting that simpler configurations are generally more cost-effective.
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements
Code of Federal Regulations, 2014 CFR
2014-01-01
... 14 Aeronautics and Space 1 2014-01-01 2014-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements
Code of Federal Regulations, 2011 CFR
2011-01-01
... 14 Aeronautics and Space 1 2011-01-01 2011-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements
Code of Federal Regulations, 2012 CFR
2012-01-01
... 14 Aeronautics and Space 1 2012-01-01 2012-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements
Code of Federal Regulations, 2010 CFR
2010-01-01
... 14 Aeronautics and Space 1 2010-01-01 2010-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
14 CFR Special Federal Aviation... - Fuel Tank System Fault Tolerance Evaluation Requirements
Code of Federal Regulations, 2013 CFR
2013-01-01
... 14 Aeronautics and Space 1 2013-01-01 2013-01-01 false Fuel Tank System Fault Tolerance Evaluation Requirements Federal Special Federal Aviation Regulation No. 88 Aeronautics and Space FEDERAL AVIATION..., SFAR No. 88 Special Federal Aviation Regulation No. 88—Fuel Tank System Fault Tolerance Evaluation...
A second generation experiment in fault-tolerant software
NASA Technical Reports Server (NTRS)
Knight, J. C.
1986-01-01
The primary goal was to determine whether the application of fault tolerance to software increases its reliability if the cost of production is the same as for an equivalent nonfault tolerance version derived from the same requirements specification. Software development protocols are discussed. The feasibility of adapting to software design fault tolerance the technique of N-fold Modular Redundancy with majority voting was studied.
Coordinated scheduling for dynamic real-time systems
NASA Technical Reports Server (NTRS)
Natarajan, Swaminathan; Zhao, Wei
1994-01-01
In this project, we addressed issues in coordinated scheduling for dynamic real-time systems. In particular, we concentrated on design and implementation of a new distributed real-time system called R-Shell. The design objective of R-Shell is to provide computing support for space programs that have large, complex, fault-tolerant distributed real-time applications. In R-shell, the approach is based on the concept of scheduling agents, which reside in the application run-time environment, and are customized to provide just those resource management functions which are needed by the specific application. With this approach, we avoid the need for a sophisticated OS which provides a variety of generalized functionality, while still not burdening application programmers with heavy responsibility for resource management. In this report, we discuss the R-Shell approach, summarize the achievement of the project, and describe a preliminary prototype of R-Shell system.
Analysis and design of algorithm-based fault-tolerant systems
NASA Technical Reports Server (NTRS)
Nair, V. S. Sukumaran
1990-01-01
An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems.
The Dangers of Failure Masking in Fault-Tolerant Software: Aspects of a Recent In-Flight Upset Event
NASA Technical Reports Server (NTRS)
Johnson, C. W.; Holloway, C. M.
2007-01-01
On 1 August 2005, a Boeing Company 777-200 aircraft, operating on an international passenger flight from Australia to Malaysia, was involved in a significant upset event while flying on autopilot. The Australian Transport Safety Bureau's investigation into the event discovered that an anomaly existed in the component software hierarchy that allowed inputs from a known faulty accelerometer to be processed by the air data inertial reference unit (ADIRU) and used by the primary flight computer, autopilot and other aircraft systems. This anomaly had existed in original ADIRU software, and had not been detected in the testing and certification process for the unit. This paper describes the software aspects of the incident in detail, and suggests possible implications concerning complex, safety-critical, fault-tolerant software.
A hierarchical approach to reliability modeling of fault-tolerant systems. M.S. Thesis
NASA Technical Reports Server (NTRS)
Gossman, W. E.
1986-01-01
A methodology for performing fault tolerant system reliability analysis is presented. The method decomposes a system into its subsystems, evaluates vent rates derived from the subsystem's conditional state probability vector and incorporates those results into a hierarchical Markov model of the system. This is done in a manner that addresses failure sequence dependence associated with the system's redundancy management strategy. The method is derived for application to a specific system definition. Results are presented that compare the hierarchical model's unreliability prediction to that of a more complicated tandard Markov model of the system. The results for the example given indicate that the hierarchical method predicts system unreliability to a desirable level of accuracy while achieving significant computational savings relative to component level Markov model of the system.
Distributed System Design Checklist
NASA Technical Reports Server (NTRS)
Hall, Brendan; Driscoll, Kevin
2014-01-01
This report describes a design checklist targeted to fault-tolerant distributed electronic systems. Many of the questions and discussions in this checklist may be generally applicable to the development of any safety-critical system. However, the primary focus of this report covers the issues relating to distributed electronic system design. The questions that comprise this design checklist were created with the intent to stimulate system designers' thought processes in a way that hopefully helps them to establish a broader perspective from which they can assess the system's dependability and fault-tolerance mechanisms. While best effort was expended to make this checklist as comprehensive as possible, it is not (and cannot be) complete. Instead, we expect that this list of questions and the associated rationale for the questions will continue to evolve as lessons are learned and further knowledge is established. In this regard, it is our intent to post the questions of this checklist on a suitable public web-forum, such as the NASA DASHLink AFCS repository. From there, we hope that it can be updated, extended, and maintained after our initial research has been completed.
Real-time inversions for finite fault slip models and rupture geometry based on high-rate GPS data
Minson, Sarah E.; Murray, Jessica R.; Langbein, John O.; Gomberg, Joan S.
2015-01-01
We present an inversion strategy capable of using real-time high-rate GPS data to simultaneously solve for a distributed slip model and fault geometry in real time as a rupture unfolds. We employ Bayesian inference to find the optimal fault geometry and the distribution of possible slip models for that geometry using a simple analytical solution. By adopting an analytical Bayesian approach, we can solve this complex inversion problem (including calculating the uncertainties on our results) in real time. Furthermore, since the joint inversion for distributed slip and fault geometry can be computed in real time, the time required to obtain a source model of the earthquake does not depend on the computational cost. Instead, the time required is controlled by the duration of the rupture and the time required for information to propagate from the source to the receivers. We apply our modeling approach, called Bayesian Evidence-based Fault Orientation and Real-time Earthquake Slip, to the 2011 Tohoku-oki earthquake, 2003 Tokachi-oki earthquake, and a simulated Hayward fault earthquake. In all three cases, the inversion recovers the magnitude, spatial distribution of slip, and fault geometry in real time. Since our inversion relies on static offsets estimated from real-time high-rate GPS data, we also present performance tests of various approaches to estimating quasi-static offsets in real time. We find that the raw high-rate time series are the best data to use for determining the moment magnitude of the event, but slightly smoothing the raw time series helps stabilize the inversion for fault geometry.
Fault-tolerant dynamic task graph scheduling
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kurt, Mehmet C.; Krishnamoorthy, Sriram; Agrawal, Kunal
2014-11-16
In this paper, we present an approach to fault tolerant execution of dynamic task graphs scheduled using work stealing. In particular, we focus on selective and localized recovery of tasks in the presence of soft faults. We elicit from the user the basic task graph structure in terms of successor and predecessor relationships. The work stealing-based algorithm to schedule such a task graph is augmented to enable recovery when the data and meta-data associated with a task get corrupted. We use this redundancy, and the knowledge of the task graph structure, to selectively recover from faults with low space andmore » time overheads. We show that the fault tolerant design retains the essential properties of the underlying work stealing-based task scheduling algorithm, and that the fault tolerant execution is asymptotically optimal when task re-execution is taken into account. Experimental evaluation demonstrates the low cost of recovery under various fault scenarios.« less
Spacecraft fault tolerance: The Magellan experience
NASA Technical Reports Server (NTRS)
Kasuda, Rick; Packard, Donna Sexton
1993-01-01
Interplanetary and earth orbiting missions are now imposing unique fault tolerant requirements upon spacecraft design. Mission success is the prime motivator for building spacecraft with fault tolerant systems. The Magellan spacecraft had many such requirements imposed upon its design. Magellan met these requirements by building redundancy into all the major subsystem components and designing the onboard hardware and software with the capability to detect a fault, isolate it to a component, and issue commands to achieve a back-up configuration. This discussion is limited to fault protection, which is the autonomous capability to respond to a fault. The Magellan fault protection design is discussed, as well as the developmental and flight experiences and a summary of the lessons learned.
Control of large flexible space structures
NASA Technical Reports Server (NTRS)
Vandervelde, W. E.
1986-01-01
Progress in robust design of generalized parity relations, design of failure sensitive observers using the geometric system theory of Wonham, computational techniques for evaluation of the performance of control systems with fault tolerance and redundancy management features, and the design and evaluation od control systems for structures having nonlinear joints are described.
Performance and Fault-Tolerance of Neural Networks for Optimization
1991-06-01
initialization to overcome the unstable equilibrium point at uij--O. "’ used the initial values Vij--0.5+6 with small, uniform noise _10-7 -7 . The...connectionist network: Investigations of acquired dyslexia . Technical Report CRG-TR-89-3, Dept. of Computer Science, University of Toronto, May 1989
An optimized implementation of a fault-tolerant clock synchronization circuit
NASA Technical Reports Server (NTRS)
Torres-Pomales, Wilfredo
1995-01-01
A fault-tolerant clock synchronization circuit was designed and tested. A comparison to a previous design and the procedure followed to achieve the current optimization are included. The report also includes a description of the system and the results of tests performed to study the synchronization and fault-tolerant characteristics of the implementation.
An Integrated Fault Tolerant Robotic Controller System for High Reliability and Safety
NASA Technical Reports Server (NTRS)
Marzwell, Neville I.; Tso, Kam S.; Hecht, Myron
1994-01-01
This paper describes the concepts and features of a fault-tolerant intelligent robotic control system being developed for applications that require high dependability (reliability, availability, and safety). The system consists of two major elements: a fault-tolerant controller and an operator workstation. The fault-tolerant controller uses a strategy which allows for detection and recovery of hardware, operating system, and application software failures.The fault-tolerant controller can be used by itself in a wide variety of applications in industry, process control, and communications. The controller in combination with the operator workstation can be applied to robotic applications such as spaceborne extravehicular activities, hazardous materials handling, inspection and maintenance of high value items (e.g., space vehicles, reactor internals, or aircraft), medicine, and other tasks where a robot system failure poses a significant risk to life or property.
Reliability of Fault Tolerant Control Systems. Part 1
NASA Technical Reports Server (NTRS)
Wu, N. Eva
2001-01-01
This paper reports Part I of a two part effort, that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability analysis of fault-tolerant control systems is performed using Markov models. Reliability properties, peculiar to fault-tolerant control systems are emphasized. As a consequence, coverage of failures through redundancy management can be severely limited. It is shown that in the early life of a syi1ein composed of highly reliable subsystems, the reliability of the overall system is affine with respect to coverage, and inadequate coverage induces dominant single point failures. The utility of some existing software tools for assessing the reliability of fault tolerant control systems is also discussed. Coverage modeling is attempted in Part II in a way that captures its dependence on the control performance and on the diagnostic resolution.
[Advanced Development for Space Robotics With Emphasis on Fault Tolerance Technology
NASA Technical Reports Server (NTRS)
Tesar, Delbert
1997-01-01
This report describes work developing fault tolerant redundant robotic architectures and adaptive control strategies for robotic manipulator systems which can dynamically accommodate drastic robot manipulator mechanism, sensor or control failures and maintain stable end-point trajectory control with minimum disturbance. Kinematic designs of redundant, modular, reconfigurable arms for fault tolerance were pursued at a fundamental level. The approach developed robotic testbeds to evaluate disturbance responses of fault tolerant concepts in robotic mechanisms and controllers. The development was implemented in various fault tolerant mechanism testbeds including duality in the joint servo motor modules, parallel and serial structural architectures, and dual arms. All have real-time adaptive controller technologies to react to mechanism or controller disturbances (failures) to perform real-time reconfiguration to continue the task operations. The developments fall into three main areas: hardware, software, and theoretical.
Implementation of Virtualization Oriented Architecture: A Healthcare Industry Case Study
NASA Astrophysics Data System (ADS)
Rao, G. Subrahmanya Vrk; Parthasarathi, Jinka; Karthik, Sundararaman; Rao, Gvn Appa; Ganesan, Suresh
This paper presents a Virtualization Oriented Architecture (VOA) and an implementation of VOA for Hridaya - a Telemedicine initiative. Hadoop Compute cloud was established at our labs and jobs which require a massive computing capability such as ECG signal analysis were submitted and the study is presented in this current paper. VOA takes advantage of inexpensive community PCs and provides added advantages such as Fault Tolerance, Scalability, Performance, High Availability.
Simulated fault injection - A methodology to evaluate fault tolerant microprocessor architectures
NASA Technical Reports Server (NTRS)
Choi, Gwan S.; Iyer, Ravishankar K.; Carreno, Victor A.
1990-01-01
A simulation-based fault-injection method for validating fault-tolerant microprocessor architectures is described. The approach uses mixed-mode simulation (electrical/logic analysis), and injects transient errors in run-time to assess the resulting fault impact. As an example, a fault-tolerant architecture which models the digital aspects of a dual-channel real-time jet-engine controller is used. The level of effectiveness of the dual configuration with respect to single and multiple transients is measured. The results indicate 100 percent coverage of single transients. Approximately 12 percent of the multiple transients affect both channels; none result in controller failure since two additional levels of redundancy exist.
A cascaded Schwarz converter for high frequency power distribution
NASA Technical Reports Server (NTRS)
Ray, Biswajit; Stuart, Thomas A.
1988-01-01
It is shown that two Schwarz converters in cascade provide a very reliable 20-kHz source that features zero current commutation, constant frequency, and fault-tolerant operation, meeting requirements for spacecraft applications. A steady-state analysis of the converter is presented, and equations for the steady-state performance are derived. Fault-current limiting is discussed. Experimental results are presented for a 900-W version, which has been successfully tested under no-load, full-load, and short-circut conditions.
Technologies for unattended network operations
NASA Technical Reports Server (NTRS)
Jaworski, Allan; Odubiyi, Jide; Holdridge, Mark; Zuzek, John
1991-01-01
The necessary network management functions for a telecommunications, navigation and information management (TNIM) system in the framework of an extension of the ISO model for communications network management are described. Various technologies that could substantially reduce the need for TNIM network management, automate manpower intensive functions, and deal with synchronization and control at interplanetary distances are presented. Specific technologies addressed include the use of the ISO Common Management Interface Protocol, distributed artificial intelligence for network synchronization and fault management, and fault-tolerant systems engineering.
Study on fault-tolerant processors for advanced launch system
NASA Technical Reports Server (NTRS)
Shin, Kang G.; Liu, Jyh-Charn
1990-01-01
Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.
Multi-version software reliability through fault-avoidance and fault-tolerance
NASA Technical Reports Server (NTRS)
Vouk, Mladen A.; Mcallister, David F.
1989-01-01
A number of experimental and theoretical issues associated with the practical use of multi-version software to provide run-time tolerance to software faults were investigated. A specialized tool was developed and evaluated for measuring testing coverage for a variety of metrics. The tool was used to collect information on the relationships between software faults and coverage provided by the testing process as measured by different metrics (including data flow metrics). Considerable correlation was found between coverage provided by some higher metrics and the elimination of faults in the code. Back-to-back testing was continued as an efficient mechanism for removal of un-correlated faults, and common-cause faults of variable span. Software reliability estimation methods was also continued based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. New fault tolerance models were formulated. Simulation studies of the Acceptance Voting and Multi-stage Voting algorithms were finished and it was found that these two schemes for software fault tolerance are superior in many respects to some commonly used schemes. Particularly encouraging are the safety properties of the Acceptance testing scheme.
Computer program determines exact two-sided tolerance limits for normal distributions
NASA Technical Reports Server (NTRS)
Friedman, H. A.; Webb, S. R.
1968-01-01
Computer program determines by numerical integration the exact statistical two-sided tolerance limits, when the proportion between the limits is at least a specified number. The program is limited to situations in which the underlying probability distribution for the population sampled is the normal distribution with unknown mean and variance.
Modeling and Verification of Dependable Electronic Power System Architecture
NASA Astrophysics Data System (ADS)
Yuan, Ling; Fan, Ping; Zhang, Xiao-fang
The electronic power system can be viewed as a system composed of a set of concurrently interacting subsystems to generate, transmit, and distribute electric power. The complex interaction among sub-systems makes the design of electronic power system complicated. Furthermore, in order to guarantee the safe generation and distribution of electronic power, the fault tolerant mechanisms are incorporated in the system design to satisfy high reliability requirements. As a result, the incorporation makes the design of such system more complicated. We propose a dependable electronic power system architecture, which can provide a generic framework to guide the development of electronic power system to ease the development complexity. In order to provide common idioms and patterns to the system *designers, we formally model the electronic power system architecture by using the PVS formal language. Based on the PVS model of this system architecture, we formally verify the fault tolerant properties of the system architecture by using the PVS theorem prover, which can guarantee that the system architecture can satisfy high reliability requirements.
NASA Astrophysics Data System (ADS)
Namegaya, Y.; Satake, K.
2012-12-01
We re-examined the magnitude of the AD 869 Jogan earthquake by comparing the inland limit of tsunami deposit and computed inundation distance for various fault models. The 869 tsunami deposit is distributed 3-4 km inland from the estimated past shorelines in Ishinomaki and Sendai plains (Shishikura et al., 2007, Annual Report on Active Fault and Paleoearthquake Researches; Sawai et al., 2007 ibid). In the previous studies (Satake et al., 2008 and Namegaya et al. 2010, ibid), we assumed 14 fault models of the Jogan earthquake including outer-rise normal fault, tsunami earthquake, interplate earthquakes, and an active fault in Sendai bay. The computed inundation area from an interplate earthquake with Mw of 8.4 (length: 200 km, width: 100 km, slip 7 m) covers the distribution of tsunami deposits in Ishinomaki and Sendai plains. However, the previous studies yielded the minimum magnitude, because we assumed that the inland limit of tsunami deposits and the computed inundation limit were the same. A post-2011 field survey indicate that the 2011 tsunami inundation distance was about 1.6 times the inland limit of tsunami deposits (e.g. Goto et al., 2011, Marine Geology). In this study, we computed tsunami inundation areas from interplate earthquake with different magnitude, fault length, and slip amount. The moment magnitude ranges from 8.0 to 8.7, the fault length ranges from 100 to 400 km, and the slip ranged from 3 to 9 m. The fault width is fixed at 100 km. The distance ratios of computed inundation to the inland limit of tsunami deposit (Inundation to Deposit Ratio or IDR) were calculated along 8 transects on Sendai and Ishinomaki plains. The results show that IDR increases with magnitude, up to Mw=8.4, when IDR becomes one, or the computed inundation is almost the same as the inland limit of tsunami deposit. IDR increases for a larger magnitude, but at a much smaller rate. This confirms that the magnitude of the 869 Jogan earthquake was at least 8.4, but it could be larger. When we compute the tsunami inundation from the 2011 Tohoku earthquake model (Satake et al., submitted to BSSA) using the 869 topography, IDR becomes 1.5. Considering the observed ratio of 2011 inundation to the deposit was 1.6, the magnitude of the 869 earthquake could have been similar to that of the 2011 earthquake.
NASA Technical Reports Server (NTRS)
Brock, L. D.; Lala, J.
1986-01-01
The Advanced Information Processing System (AIPS) is designed to provide a fault tolerant and damage tolerant data processing architecture for a broad range of aerospace vehicles. The AIPS architecture also has attributes to enhance system effectiveness such as graceful degradation, growth and change tolerance, integrability, etc. Two key building blocks being developed by the AIPS program are a fault and damage tolerant processor and communication network. A proof-of-concept system is now being built and will be tested to demonstrate the validity and performance of the AIPS concepts.
Making Classical Ground State Spin Computing Fault-Tolerant
2010-06-24
approaches to perebor (brute-force searches) algorithms,” IEEE Annals of the History of Computing, 6, 384–400 (1984). [24] D. Bacon and S . T. Flammia ...Adiabatic gate teleportation,” Phys. Rev. Lett., 103, 120504 (2009). [25] D. Bacon and S . T. Flammia , “Adiabatic cluster state quantum computing...v1 [ co nd -m at . s ta t- m ec h] 2 2 Ju n 20 10 Report Documentation Page Form ApprovedOMB No. 0704-0188 Public reporting burden for the
Scalable Optical-Fiber Communication Networks
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Peterson, John C.
1993-01-01
Scalable arbitrary fiber extension network (SAFEnet) is conceptual fiber-optic communication network passing digital signals among variety of computers and input/output devices at rates from 200 Mb/s to more than 100 Gb/s. Intended for use with very-high-speed computers and other data-processing and communication systems in which message-passing delays must be kept short. Inherent flexibility makes it possible to match performance of network to computers by optimizing configuration of interconnections. In addition, interconnections made redundant to provide tolerance to faults.
Care 3 model overview and user's guide, first revision
NASA Technical Reports Server (NTRS)
Bavuso, S. J.; Petersen, P. L.
1985-01-01
A manual was written to introduce the CARE III (Computer-Aided Reliability Estimation) capability to reliability and design engineers who are interested in predicting the reliability of highly reliable fault-tolerant systems. It was also structured to serve as a quick-look reference manual for more experienced users. The guide covers CARE III modeling and reliability predictions for execution in the CDC CYber 170 series computers, DEC VAX-11/700 series computer, and most machines that compile ANSI Standard FORTRAN 77.
A survey of NASA and military standards on fault tolerance and reliability applied to robotics
NASA Technical Reports Server (NTRS)
Cavallaro, Joseph R.; Walker, Ian D.
1994-01-01
There is currently increasing interest and activity in the area of reliability and fault tolerance for robotics. This paper discusses the application of Standards in robot reliability, and surveys the literature of relevant existing standards. A bibliography of relevant Military and NASA standards for reliability and fault tolerance is included.
NASA Technical Reports Server (NTRS)
Stiffler, J. J.; Bryant, L. A.; Guccione, L.
1979-01-01
A computer program to aid in accessing the reliability of fault tolerant avionics systems was developed. A simple mathematical expression was used to evaluate the reliability of any redundant configuration over any interval during which the failure rates and coverage parameters remained unaffected by configuration changes. Provision was made for convolving such expressions in order to evaluate the reliability of a dual mode system. A coverage model was also developed to determine the various relevant coverage coefficients as a function of the available hardware and software fault detector characteristics, and subsequent isolation and recovery delay statistics.
Adaptive robust fault-tolerant control for linear MIMO systems with unmatched uncertainties
NASA Astrophysics Data System (ADS)
Zhang, Kangkang; Jiang, Bin; Yan, Xing-Gang; Mao, Zehui
2017-10-01
In this paper, two novel fault-tolerant control design approaches are proposed for linear MIMO systems with actuator additive faults, multiplicative faults and unmatched uncertainties. For time-varying multiplicative and additive faults, new adaptive laws and additive compensation functions are proposed. A set of conditions is developed such that the unmatched uncertainties are compensated by actuators in control. On the other hand, for unmatched uncertainties with their projection in unmatched space being not zero, based on a (vector) relative degree condition, additive functions are designed to compensate for the uncertainties from output channels in the presence of actuator faults. The developed fault-tolerant control schemes are applied to two aircraft systems to demonstrate the efficiency of the proposed approaches.
A fault-tolerant strategy based on SMC for current-controlled converters
NASA Astrophysics Data System (ADS)
Azer, Peter M.; Marei, Mostafa I.; Sattar, Ahmed A.
2018-05-01
The sliding mode control (SMC) is used to control variable structure systems such as power electronics converters. This paper presents a fault-tolerant strategy based on the SMC for current-controlled AC-DC converters. The proposed SMC is based on three sliding surfaces for the three legs of the AC-DC converter. Two sliding surfaces are assigned to control the phase currents since the input three-phase currents are balanced. Hence, the third sliding surface is considered as an extra degree of freedom which is utilised to control the neutral voltage. This action is utilised to enhance the performance of the converter during open-switch faults. The proposed fault-tolerant strategy is based on allocating the sliding surface of the faulty leg to control the neutral voltage. Consequently, the current waveform is improved. The behaviour of the current-controlled converter during different types of open-switch faults is analysed. Double switch faults include three cases: two upper switch fault; upper and lower switch fault at different legs; and two switches of the same leg. The dynamic performance of the proposed system is evaluated during healthy and open-switch fault operations. Simulation results exhibit the various merits of the proposed SMC-based fault-tolerant strategy.
Characterizing quantum supremacy in near-term devices
NASA Astrophysics Data System (ADS)
Boixo, Sergio; Isakov, Sergei V.; Smelyanskiy, Vadim N.; Babbush, Ryan; Ding, Nan; Jiang, Zhang; Bremner, Michael J.; Martinis, John M.; Neven, Hartmut
2018-06-01
A critical question for quantum computing in the near future is whether quantum devices without error correction can perform a well-defined computational task beyond the capabilities of supercomputers. Such a demonstration of what is referred to as quantum supremacy requires a reliable evaluation of the resources required to solve tasks with classical approaches. Here, we propose the task of sampling from the output distribution of random quantum circuits as a demonstration of quantum supremacy. We extend previous results in computational complexity to argue that this sampling task must take exponential time in a classical computer. We introduce cross-entropy benchmarking to obtain the experimental fidelity of complex multiqubit dynamics. This can be estimated and extrapolated to give a success metric for a quantum supremacy demonstration. We study the computational cost of relevant classical algorithms and conclude that quantum supremacy can be achieved with circuits in a two-dimensional lattice of 7 × 7 qubits and around 40 clock cycles. This requires an error rate of around 0.5% for two-qubit gates (0.05% for one-qubit gates), and it would demonstrate the basic building blocks for a fault-tolerant quantum computer.
ECFS: A decentralized, distributed and fault-tolerant FUSE filesystem for the LHCb online farm
NASA Astrophysics Data System (ADS)
Rybczynski, Tomasz; Bonaccorsi, Enrico; Neufeld, Niko
2014-06-01
The LHCb experiment records millions of proton collisions every second, but only a fraction of them are useful for LHCb physics. In order to filter out the "bad events" a large farm of x86-servers (~2000 nodes) has been put in place. These servers boot from and run from NFS, however they use their local disk to temporarily store data, which cannot be processed in real-time ("data-deferring"). These events are subsequently processed, when there are no live-data coming in. The effective CPU power is thus greatly increased. This gain in CPU power depends critically on the availability of the local disks. For cost and power-reasons, mirroring (RAID-1) is not used, leading to a lot of operational headache with failing disks and disk-errors or server failures induced by faulty disks. To mitigate these problems and increase the reliability of the LHCb farm, while at same time keeping cost and power-consumption low, an extensive research and study of existing highly available and distributed file systems has been done. While many distributed file systems are providing reliability by "file replication", none of the evaluated ones supports erasure algorithms. A decentralised, distributed and fault-tolerant "write once read many" file system has been designed and implemented as a proof of concept providing fault tolerance without using expensive - in terms of disk space - file replication techniques and providing a unique namespace as a main goals. This paper describes the design and the implementation of the Erasure Codes File System (ECFS) and presents the specialised FUSE interface for Linux. Depending on the encoding algorithm ECFS will use a certain number of target directories as a backend to store the segments that compose the encoded data. When target directories are mounted via nfs/autofs - ECFS will act as a file-system over network/block-level raid over multiple servers.
Fault-tolerant locomotion of the hexapod robot.
Yang, J M; Kim, J H
1998-01-01
In this paper, we propose a scheme for fault detection and tolerance of the hexapod robot locomotion on even terrain. The fault stability margin is defined to represent potential stability which a gait can have in case a sudden fault event occurs to one leg. Based on this, the fault-tolerant quadruped periodic gaits of the hexapod walking over perfectly even terrain are derived. It is demonstrated that the derived quadruped gait is the optimal one the hexapod can have maintaining fault stability margin nonnegative and a geometric condition should be satisfied for the optimal locomotion. By this scheme, when one leg is in failure, the hexapod robot has the modified tripod gait to continue the optimal locomotion.
Parallel and fault-tolerant algorithms for hypercube multiprocessors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aykanat, C.
1988-01-01
Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Distributed Systems: Interconnection and Fault Tolerance Studies
1992-01-01
real - time operating system , a number of new techniques have to be...problem is at the heart of a successful implementation of a real - time operating system in a distributed environment. Our studies of the issues...land, College Park MD 20742, January 1991. [i1] 6 lafur Gudmundsson, Daniel Moss6, Ashok K. Agrawala, and Satish K. Tripathi. MARUTI a hard real - time operating system .
Stability Analysis of Distributed Engine Control Systems Under Communication Packet Drop (Postprint)
2008-07-01
use, modify, reproduce, release, perform, display, or disclose the work. 14. ABSTRACT Currently, Full Authority Digital Engine Control ( FADEC ...based on a centralized architecture framework is being widely used for gas turbine engine control. However, current FADEC is not able to meet the...system (DEC). FADEC based on Distributed Control Systems (DCS) offers modularity, improved control systems prognostics and fault tolerance along with
Fault-Tolerant Signal Processing Architectures with Distributed Error Control.
1985-01-01
Zm, Revisited," Information and Control, Vol. 37, pp. 100-104, 1978. 13. J. Wakerly , Error Detecting Codes. SeIf-Checkino Circuits and Applications ...However, the newer results concerning applications of real codes are still in the publication process. Hence, two very detailed appendices are included to...significant entities to be protected. While the distributed finite field approach afforded adequate protection, its applicability was restricted and
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Elks, Carl
1995-01-01
An Army Fault Tolerant Architecture (AFTA) has been developed to meet real-time fault tolerant processing requirements of future Army applications. AFTA is the enabling technology that will allow the Army to configure existing processors and other hardware to provide high throughput and ultrahigh reliability necessary for TF/TA/NOE flight control and other advanced Army applications. A comprehensive conceptual study of AFTA has been completed that addresses a wide range of issues including requirements, architecture, hardware, software, testability, producibility, analytical models, validation and verification, common mode faults, VHDL, and a fault tolerant data bus. A Brassboard AFTA for demonstration and validation has been fabricated, and two operating systems and a flight-critical Army application have been ported to it. Detailed performance measurements have been made of fault tolerance and operating system overheads while AFTA was executing the flight application in the presence of faults.
Fault-tolerant rotary actuator
Tesar, Delbert
2006-10-17
A fault-tolerant actuator module, in a single containment shell, containing two actuator subsystems that are either asymmetrically or symmetrically laid out is provided. Fault tolerance in the actuators of the present invention is achieved by the employment of dual sets of equal resources. Dual resources are integrated into single modules, with each having the external appearance and functionality of a single set of resources.
Fault-tolerant wait-free shared objects
NASA Technical Reports Server (NTRS)
Jayanti, Prasad; Chandra, Tushar D.; Toueg, Sam
1992-01-01
A concurrent system consists of processes communicating via shared objects, such as shared variables, queues, etc. The concept of wait-freedom was introduced to cope with process failures: each process that accesses a wait-free object is guaranteed to get a response even if all the other processes crash. However, if a wait-free object 'crashes,' all the processes that access that object are prevented from making progress. In this paper, we introduce the concept of fault-tolerant wait-free objects, and study the problem of implementing them. We give a universal method to construct fault-tolerant wait-free objects, for all types of 'responsive' failures (including one in which faulty objects may 'lie'). In sharp contrast, we prove that many common and interesting types (such as queues, sets, and test&set) have no fault-tolerant wait-free implementations even under the most benign of the 'non-responsive' types of failure. We also introduce several concepts and techniques that are central to the design of fault-tolerant concurrent systems: the concepts of self-implementation and graceful degradation, and techniques to automatically increase the fault-tolerance of implementations. We prove matching lower bounds on the resource complexity of most of our algorithms.
Towards the formal verification of the requirements and design of a processor interface unit
NASA Technical Reports Server (NTRS)
Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.
1993-01-01
The formal verification of the design and partial requirements for a Processor Interface Unit (PIU) using the Higher Order Logic (HOL) theorem-proving system is described. The processor interface unit is a single-chip subsystem within a fault-tolerant embedded system under development within the Boeing Defense and Space Group. It provides the opportunity to investigate the specification and verification of a real-world subsystem within a commercially-developed fault-tolerant computer. An overview of the PIU verification effort is given. The actual HOL listing from the verification effort are documented in a companion NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings' including the general-purpose HOL theories and definitions that support the PIU verification as well as tactics used in the proofs.
Fault tolerance in an inner-outer solver: A GVR-enabled case study
Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita
2015-04-18
Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-streammore » snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.« less
Fault-Tolerant Local-Area Network
NASA Technical Reports Server (NTRS)
Morales, Sergio; Friedman, Gary L.
1988-01-01
Local-area network (LAN) for computers prevents single-point failure from interrupting communication between nodes of network. Includes two complete cables, LAN 1 and LAN 2. Microprocessor-based slave switches link cables to network-node devices as work stations, print servers, and file servers. Slave switches respond to commands from master switch, connecting nodes to two cable networks or disconnecting them so they are completely isolated. System monitor and control computer (SMC) acts as gateway, allowing nodes on either cable to communicate with each other and ensuring that LAN 1 and LAN 2 are fully used when functioning properly. Network monitors and controls itself, automatically routes traffic for efficient use of resources, and isolates and corrects its own faults, with potential dramatic reduction in time out of service.
Salehifar, Mehdi; Moreno-Equilaz, Manuel
2016-01-01
Due to its fault tolerance, a multiphase brushless direct current (BLDC) motor can meet high reliability demand for application in electric vehicles. The voltage-source inverter (VSI) supplying the motor is subjected to open circuit faults. Therefore, it is necessary to design a fault-tolerant (FT) control algorithm with an embedded fault diagnosis (FD) block. In this paper, finite control set-model predictive control (FCS-MPC) is developed to implement the fault-tolerant control algorithm of a five-phase BLDC motor. The developed control method is fast, simple, and flexible. A FD method based on available information from the control block is proposed; this method is simple, robust to common transients in motor and able to localize multiple open circuit faults. The proposed FD and FT control algorithm are embedded in a five-phase BLDC motor drive. In order to validate the theory presented, simulation and experimental results are conducted on a five-phase two-level VSI supplying a five-phase BLDC motor. Copyright © 2015 ISA. Published by Elsevier Ltd. All rights reserved.
Measurement-based quantum computation on two-body interacting qubits with adiabatic evolution.
Kyaw, Thi Ha; Li, Ying; Kwek, Leong-Chuan
2014-10-31
A cluster state cannot be a unique ground state of a two-body interacting Hamiltonian. Here, we propose the creation of a cluster state of logical qubits encoded in spin-1/2 particles by adiabatically weakening two-body interactions. The proposal is valid for any spatial dimensional cluster states. Errors induced by thermal fluctuations and adiabatic evolution within finite time can be eliminated ensuring fault-tolerant quantum computing schemes.
Investigation of Air Transportation Technology at Princeton University, 1989-1990
NASA Technical Reports Server (NTRS)
Stengel, Robert F.
1990-01-01
The Air Transportation Technology Program at Princeton University proceeded along six avenues during the past year: microburst hazards to aircraft; machine-intelligent, fault tolerant flight control; computer aided heuristics for piloted flight; stochastic robustness for flight control systems; neural networks for flight control; and computer aided control system design. These topics are briefly discussed, and an annotated bibliography of publications that appeared between January 1989 and June 1990 is given.
Reconfiguration Schemes for Fault-Tolerant Processor Arrays
1992-10-15
partially notion of linear schedule are easily related to similar ordered subset of a multidimensional integer lattice models and concepts used in [11-[131...and several other (called indec set). The points of this lattice correspond works. to (i.e.. are the indices of) computations, and the partial There are...These data dependencies are represented as vectors that of all computations of the algorithm is to be minimized. connect points of the lattice . If a
Phased models for evaluating the performability of computing systems
NASA Technical Reports Server (NTRS)
Wu, L. T.; Meyer, J. F.
1979-01-01
A phase-by-phase modelling technique is introduced to evaluate a fault tolerant system's ability to execute different sets of computational tasks during different phases of the control process. Intraphase processes are allowed to differ from phase to phase. The probabilities of interphase state transitions are specified by interphase transition matrices. Based on constraints imposed on the intraphase and interphase transition probabilities, various iterative solution methods are developed for calculating system performability.
Fault and Defect Tolerant Computer Architectures: Reliable Computing with Unreliable Devices
2006-08-31
supply voltage, the delay of the inverter increases parabolically . 2.2.2.5 High Field Effects. A consequence of maintaining a higher Vdd than...be explained by dispro- portionate scaling of QCRIT with respect to collector efficiency. 78 Technology trends, then, indicate a moderate increase in...using clustered defects, a compounding procedure is used. Compounding considers λ as a random variable rather than a constant. Let l be this defect
Fault Tolerant Parallel Implementations of Iterative Algorithms for Optimal Control Problems
1988-01-21
p/.V)] steps, but did not discuss any specific parallel implementation. Gajski [51 improved upon this result by performing the SIMD computation in...N = p2. our approach reduces to that of [51, except that Gajski presents the coefficient computation and partial solution phases as a single...8217>. the SIMD algo- rithm presented by Gajski [5] can be most efficiently mapped to a unidirec- tional ring network with broadcasting capability. Based
Federal Register 2010, 2011, 2012, 2013, 2014
2012-07-02
... Controls for Conventional Arms and Dual-Use Goods and Technologies is a group of 41 like-minded states... specified and packaged as medical products, are not subject to control. ECCN 1C008 (Non-Fluorinated... technology and computer system design have made control of fault tolerance neither warranted nor feasible...
Analysis of Multi-State Systems with Multi-State Components Using EVMDDs
2012-05-01
Fault-Tolerant Computing (FTCS), pp. 249– 258, June 1995. [5] T. Kam, T. Villa, R. K. Brayton , and A. L. Sangiovanni- Vincentelli, “Multi-valued...Shmerko, and R. S. Stankovic, Decision Diagram Techniques for Micro- and Nanoelectronic Design, CRC Press, Taylor & Francis Group, 2006. [16] X. Zang, D
Using a Multicore Processor for Rover Autonomous Science
NASA Technical Reports Server (NTRS)
Bornstein, Benjamin; Estlin, Tara; Clement, Bradley; Springer, Paul
2011-01-01
Multicore processing promises to be a critical component of future spacecraft. It provides immense increases in onboard processing power and provides an environment for directly supporting fault-tolerant computing. This paper discusses using a state-of-the-art multicore processor to efficiently perform image analysis onboard a Mars rover in support of autonomous science activities.
AADL and Model-based Engineering
2014-10-20
and MBE Feiler, Oct 20, 2014 © 2014 Carnegie Mellon University We Rely on Software for Safe Aircraft Operation Embedded software systems ...D eveloper Compute Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency...confusion Hardware Engineer Why do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software
Graphical workstation capability for reliability modeling
NASA Technical Reports Server (NTRS)
Bavuso, Salvatore J.; Koppen, Sandra V.; Haley, Pamela J.
1992-01-01
In addition to computational capabilities, software tools for estimating the reliability of fault-tolerant digital computer systems must also provide a means of interfacing with the user. Described here is the new graphical interface capability of the hybrid automated reliability predictor (HARP), a software package that implements advanced reliability modeling techniques. The graphics oriented (GO) module provides the user with a graphical language for modeling system failure modes through the selection of various fault-tree gates, including sequence-dependency gates, or by a Markov chain. By using this graphical input language, a fault tree becomes a convenient notation for describing a system. In accounting for any sequence dependencies, HARP converts the fault-tree notation to a complex stochastic process that is reduced to a Markov chain, which it can then solve for system reliability. The graphics capability is available for use on an IBM-compatible PC, a Sun, and a VAX workstation. The GO module is written in the C programming language and uses the graphical kernal system (GKS) standard for graphics implementation. The PC, VAX, and Sun versions of the HARP GO module are currently in beta-testing stages.