tolerant computer system: Topics by Science.gov

Sample records for tolerant computer system

Fault-tolerant software - Experiment with the sift operating system. [Software Implemented Fault Tolerance computer

NASA Technical Reports Server (NTRS)

Brunelle, J. E.; Eckhardt, D. E., Jr.

1985-01-01

Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.
Fault Tolerant Software Technology for Distributed Computer Systems

DTIC Science & Technology

1989-03-01

RAY.) &-TR-88-296 I Fin;.’ Technical Report ,r 19,39 i A28 3329 F’ULT TOLERANT SOFTWARE TECHNOLOGY FOR DISTRIBUTED COMPUTER SYSTEMS Georgia Institute...GrfisABN 34-70IiWftlI NO0. IN?3. NO IACCESSION NO. 158 21 7 11. TITLE (Incld security Cassification) FAULT TOLERANT SOFTWARE FOR DISTRIBUTED COMPUTER ...Technology for Distributed Computing Systems," a two year effort performed at Georgia Institute of Technology as part of the Clouds Project. The Clouds
Method and system for environmentally adaptive fault tolerant computing

NASA Technical Reports Server (NTRS)

Copenhaver, Jason L. (Inventor); Jeremy, Ramos (Inventor); Wolfe, Jeffrey M. (Inventor); Brenner, Dean (Inventor)

2010-01-01

A method and system for adapting fault tolerant computing. The method includes the steps of measuring an environmental condition representative of an environment. An on-board processing system's sensitivity to the measured environmental condition is measured. It is determined whether to reconfigure a fault tolerance of the on-board processing system based in part on the measured environmental condition. The fault tolerance of the on-board processing system may be reconfigured based in part on the measured environmental condition.
Noise tolerant spatiotemporal chaos computing.

PubMed

Kia, Behnam; Kia, Sarvenaz; Lindner, John F; Sinha, Sudeshna; Ditto, William L

2014-12-01

We introduce and design a noise tolerant chaos computing system based on a coupled map lattice (CML) and the noise reduction capabilities inherent in coupled dynamical systems. The resulting spatiotemporal chaos computing system is more robust to noise than a single map chaos computing system. In this CML based approach to computing, under the coupled dynamics, the local noise from different nodes of the lattice diffuses across the lattice, and it attenuates each other's effects, resulting in a system with less noise content and a more robust chaos computing architecture.
Noise tolerant spatiotemporal chaos computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kia, Behnam; Kia, Sarvenaz; Ditto, William L.

We introduce and design a noise tolerant chaos computing system based on a coupled map lattice (CML) and the noise reduction capabilities inherent in coupled dynamical systems. The resulting spatiotemporal chaos computing system is more robust to noise than a single map chaos computing system. In this CML based approach to computing, under the coupled dynamics, the local noise from different nodes of the lattice diffuses across the lattice, and it attenuates each other's effects, resulting in a system with less noise content and a more robust chaos computing architecture.
Error Mitigation of Point-to-Point Communication for Fault-Tolerant Computing

NASA Technical Reports Server (NTRS)

Akamine, Robert L.; Hodson, Robert F.; LaMeres, Brock J.; Ray, Robert E.

2011-01-01

Fault tolerant systems require the ability to detect and recover from physical damage caused by the hardware s environment, faulty connectors, and system degradation over time. This ability applies to military, space, and industrial computing applications. The integrity of Point-to-Point (P2P) communication, between two microcontrollers for example, is an essential part of fault tolerant computing systems. In this paper, different methods of fault detection and recovery are presented and analyzed.
Reliability model derivation of a fault-tolerant, dual, spare-switching, digital computer system

NASA Technical Reports Server (NTRS)

1974-01-01

A computer based reliability projection aid, tailored specifically for application in the design of fault-tolerant computer systems, is described. Its more pronounced characteristics include the facility for modeling systems with two distinct operational modes, measuring the effect of both permanent and transient faults, and calculating conditional system coverage factors. The underlying conceptual principles, mathematical models, and computer program implementation are presented.
Fault tolerant architectures for integrated aircraft electronics systems

NASA Technical Reports Server (NTRS)

Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

1983-01-01

Work into possible architectures for future flight control computer systems is described. Ada for Fault-Tolerant Systems, the NETS Network Error-Tolerant System architecture, and voting in asynchronous systems are covered.
Computer-Aided Reliability Estimation

NASA Technical Reports Server (NTRS)

Bavuso, S. J.; Stiffler, J. J.; Bryant, L. A.; Petersen, P. L.

1986-01-01

CARE III (Computer-Aided Reliability Estimation, Third Generation) helps estimate reliability of complex, redundant, fault-tolerant systems. Program specifically designed for evaluation of fault-tolerant avionics systems. However, CARE III general enough for use in evaluation of other systems as well.
Radiation Tolerant, FPGA-Based SmallSat Computer System

NASA Technical Reports Server (NTRS)

LaMeres, Brock J.; Crum, Gary A.; Martinez, Andres; Petro, Andrew

2015-01-01

The Radiation Tolerant, FPGA-based SmallSat Computer System (RadSat) computing platform exploits a commercial off-the-shelf (COTS) Field Programmable Gate Array (FPGA) with real-time partial reconfiguration to provide increased performance, power efficiency and radiation tolerance at a fraction of the cost of existing radiation hardened computing solutions. This technology is ideal for small spacecraft that require state-of-the-art on-board processing in harsh radiation environments but where using radiation hardened processors is cost prohibitive.
Soft-error tolerance and energy consumption evaluation of embedded computer with magnetic random access memory in practical systems using computer simulations

NASA Astrophysics Data System (ADS)

Nebashi, Ryusuke; Sakimura, Noboru; Sugibayashi, Tadahiko

2017-08-01

We evaluated the soft-error tolerance and energy consumption of an embedded computer with magnetic random access memory (MRAM) using two computer simulators. One is a central processing unit (CPU) simulator of a typical embedded computer system. We simulated the radiation-induced single-event-upset (SEU) probability in a spin-transfer-torque MRAM cell and also the failure rate of a typical embedded computer due to its main memory SEU error. The other is a delay tolerant network (DTN) system simulator. It simulates the power dissipation of wireless sensor network nodes of the system using a revised CPU simulator and a network simulator. We demonstrated that the SEU effect on the embedded computer with 1 Gbit MRAM-based working memory is less than 1 failure in time (FIT). We also demonstrated that the energy consumption of the DTN sensor node with MRAM-based working memory can be reduced to 1/11. These results indicate that MRAM-based working memory enhances the disaster tolerance of embedded computers.
Validation Methods for Fault-Tolerant avionics and control systems, working group meeting 1

NASA Technical Reports Server (NTRS)

1979-01-01

The proceedings of the first working group meeting on validation methods for fault tolerant computer design are presented. The state of the art in fault tolerant computer validation was examined in order to provide a framework for future discussions concerning research issues for the validation of fault tolerant avionics and flight control systems. The development of positions concerning critical aspects of the validation process are given.
Ultrareliable fault-tolerant control systems

NASA Technical Reports Server (NTRS)

Webster, L. D.; Slykhouse, R. A.; Booth, L. A., Jr.; Carson, T. M.; Davis, G. J.; Howard, J. C.

1984-01-01

It is demonstrated that fault-tolerant computer systems, such as on the Shuttles, based on redundant, independent operation are a viable alternative in fault tolerant system designs. The ultrareliable fault-tolerant control system (UFTCS) was developed and tested in laboratory simulations of an UH-1H helicopter. UFTCS includes asymptotically stable independent control elements in a parallel, cross-linked system environment. Static redundancy provides the fault tolerance. A polling is performed among the computers, with results allowing for time-delay channel variations with tight bounds. When compared with the laboratory and actual flight data for the helicopter, the probability of a fault was, for the first 10 hr of flight given a quintuple computer redundancy, found to be 1 in 290 billion. Two weeks of untended Space Station operations would experience a fault probability of 1 in 24 million. Techniques for avoiding channel divergence problems are identified.
Airborne Advanced Reconfigurable Computer System (ARCS)

NASA Technical Reports Server (NTRS)

Bjurman, B. E.; Jenkins, G. M.; Masreliez, C. J.; Mcclellan, K. L.; Templeman, J. E.

1976-01-01

A digital computer subsystem fault-tolerant concept was defined, and the potential benefits and costs of such a subsystem were assessed when used as the central element of a new transport's flight control system. The derived advanced reconfigurable computer system (ARCS) is a triple-redundant computer subsystem that automatically reconfigures, under multiple fault conditions, from triplex to duplex to simplex operation, with redundancy recovery if the fault condition is transient. The study included criteria development covering factors at the aircraft's operation level that would influence the design of a fault-tolerant system for commercial airline use. A new reliability analysis tool was developed for evaluating redundant, fault-tolerant system availability and survivability; and a stringent digital system software design methodology was used to achieve design/implementation visibility.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

1984-01-01

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
Testing For EM Upsets In Aircraft Control Computers

NASA Technical Reports Server (NTRS)

Belcastro, Celeste M.

1994-01-01

Effects of transient electrical signals evaluated in laboratory tests. Method of evaluating nominally fault-tolerant, aircraft-type digital-computer-based control system devised. Provides for evaluation of susceptibility of system to upset and evaluation of integrity of control when system subjected to transient electrical signals like those induced by electromagnetic (EM) source, in this case lightning. Beyond aerospace applications, fault-tolerant control systems becoming more wide-spread in industry; such as in automobiles. Method supports practical, systematic tests for evaluation of designs of fault-tolerant control systems.
Noise Threshold and Resource Cost of Fault-Tolerant Quantum Computing with Majorana Fermions in Hybrid Systems.

PubMed

Li, Ying

2016-09-16

Fault-tolerant quantum computing in systems composed of both Majorana fermions and topologically unprotected quantum systems, e.g., superconducting circuits or quantum dots, is studied in this Letter. Errors caused by topologically unprotected quantum systems need to be corrected with error-correction schemes, for instance, the surface code. We find that the error-correction performance of such a hybrid topological quantum computer is not superior to a normal quantum computer unless the topological charge of Majorana fermions is insusceptible to noise. If errors changing the topological charge are rare, the fault-tolerance threshold is much higher than the threshold of a normal quantum computer and a surface-code logical qubit could be encoded in only tens of topological qubits instead of about 1,000 normal qubits.
Fault tolerant architectures for integrated aircraft electronics systems, task 2

NASA Technical Reports Server (NTRS)

Levitt, K. N.; Melliar-Smith, P. M.; Schwartz, R. L.

1984-01-01

The architectural basis for an advanced fault tolerant on-board computer to succeed the current generation of fault tolerant computers is examined. The network error tolerant system architecture is studied with particular attention to intercluster configurations and communication protocols, and to refined reliability estimates. The diagnosis of faults, so that appropriate choices for reconfiguration can be made is discussed. The analysis relates particularly to the recognition of transient faults in a system with tasks at many levels of priority. The demand driven data-flow architecture, which appears to have possible application in fault tolerant systems is described and work investigating the feasibility of automatic generation of aircraft flight control programs from abstract specifications is reported.
Survivable algorithms and redundancy management in NASA's distributed computing systems

NASA Technical Reports Server (NTRS)

Malek, Miroslaw

1992-01-01

The design of survivable algorithms requires a solid foundation for executing them. While hardware techniques for fault-tolerant computing are relatively well understood, fault-tolerant operating systems, as well as fault-tolerant applications (survivable algorithms), are, by contrast, little understood, and much more work in this field is required. We outline some of our work that contributes to the foundation of ultrareliable operating systems and fault-tolerant algorithm design. We introduce our consensus-based framework for fault-tolerant system design. This is followed by a description of a hierarchical partitioning method for efficient consensus. A scheduler for redundancy management is introduced, and application-specific fault tolerance is described. We give an overview of our hybrid algorithm technique, which is an alternative to the formal approach given.
Enhanced fault-tolerant quantum computing in d-level systems.

PubMed

Campbell, Earl T

2014-12-05

Error-correcting codes protect quantum information and form the basis of fault-tolerant quantum computing. Leading proposals for fault-tolerant quantum computation require codes with an exceedingly rare property, a transversal non-Clifford gate. Codes with the desired property are presented for d-level qudit systems with prime d. The codes use n=d-1 qudits and can detect up to ∼d/3 errors. We quantify the performance of these codes for one approach to quantum computation known as magic-state distillation. Unlike prior work, we find performance is always enhanced by increasing d.

Definition and trade-off study of reconfigurable airborne digital computer system organizations

NASA Technical Reports Server (NTRS)

Conn, R. B.

1974-01-01

A highly-reliable, fault-tolerant reconfigurable computer system for aircraft applications was developed. The development and application reliability and fault-tolerance assessment techniques are described. Particular emphasis is placed on the needs of an all-digital, fly-by-wire control system appropriate for a passenger-carrying airplane.
An approximation formula for a class of fault-tolerant computers

NASA Technical Reports Server (NTRS)

White, A. L.

1986-01-01

An approximation formula is derived for the probability of failure for fault-tolerant process-control computers. These computers use redundancy and reconfiguration to achieve high reliability. Finite-state Markov models capture the dynamic behavior of component failure and system recovery, and the approximation formula permits an estimation of system reliability by an easy examination of the model.
Design study of Software-Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.

1982-01-01

Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.
Provable Transient Recovery for Frame-Based, Fault-Tolerant Computing Systems

NASA Technical Reports Server (NTRS)

DiVito, Ben L.; Butler, Ricky W.

1992-01-01

We present a formal verification of the transient fault recovery aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system architecture for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization accommodates a wide variety of voting schemes for purging the effects of transients.
Fault tolerant computing: A preamble for assuring viability of large computer systems

NASA Technical Reports Server (NTRS)

Lim, R. S.

1977-01-01

The need for fault-tolerant computing is addressed from the viewpoints of (1) why it is needed, (2) how to apply it in the current state of technology, and (3) what it means in the context of the Phoenix computer system and other related systems. To this end, the value of concurrent error detection and correction is described. User protection, program retry, and repair are among the factors considered. The technology of algebraic codes to protect memory systems and arithmetic codes to protect memory systems and arithmetic codes to protect arithmetic operations is discussed.
Advanced information processing system: Local system services

NASA Technical Reports Server (NTRS)

Burkhardt, Laura; Alger, Linda; Whittredge, Roy; Stasiowski, Peter

1989-01-01

The Advanced Information Processing System (AIPS) is a multi-computer architecture composed of hardware and software building blocks that can be configured to meet a broad range of application requirements. The hardware building blocks are fault-tolerant, general-purpose computers, fault-and damage-tolerant networks (both computer and input/output), and interfaces between the networks and the computers. The software building blocks are the major software functions: local system services, input/output, system services, inter-computer system services, and the system manager. The foundation of the local system services is an operating system with the functions required for a traditional real-time multi-tasking computer, such as task scheduling, inter-task communication, memory management, interrupt handling, and time maintenance. Resting on this foundation are the redundancy management functions necessary in a redundant computer and the status reporting functions required for an operator interface. The functional requirements, functional design and detailed specifications for all the local system services are documented.
Advanced reliability modeling of fault-tolerant computer-based systems

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1982-01-01

Two methodologies for the reliability assessment of fault tolerant digital computer based systems are discussed. The computer-aided reliability estimation 3 (CARE 3) and gate logic software simulation (GLOSS) are assessment technologies that were developed to mitigate a serious weakness in the design and evaluation process of ultrareliable digital systems. The weak link is based on the unavailability of a sufficiently powerful modeling technique for comparing the stochastic attributes of one system against others. Some of the more interesting attributes are reliability, system survival, safety, and mission success.
Reliability modeling of fault-tolerant computer based systems

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.

1987-01-01

Digital fault-tolerant computer-based systems have become commonplace in military and commercial avionics. These systems hold the promise of increased availability, reliability, and maintainability over conventional analog-based systems through the application of replicated digital computers arranged in fault-tolerant configurations. Three tightly coupled factors of paramount importance, ultimately determining the viability of these systems, are reliability, safety, and profitability. Reliability, the major driver affects virtually every aspect of design, packaging, and field operations, and eventually produces profit for commercial applications or increased national security. However, the utilization of digital computer systems makes the task of producing credible reliability assessment a formidable one for the reliability engineer. The root of the problem lies in the digital computer's unique adaptability to changing requirements, computational power, and ability to test itself efficiently. Addressed here are the nuances of modeling the reliability of systems with large state sizes, in the Markov sense, which result from systems based on replicated redundant hardware and to discuss the modeling of factors which can reduce reliability without concomitant depletion of hardware. Advanced fault-handling models are described and methods of acquiring and measuring parameters for these models are delineated.
Advanced information processing system

NASA Technical Reports Server (NTRS)

Lala, J. H.

1984-01-01

Design and performance details of the advanced information processing system (AIPS) for fault and damage tolerant data processing on aircraft and spacecraft are presented. AIPS comprises several computers distributed throughout the vehicle and linked by a damage tolerant data bus. Most I/O functions are available to all the computers, which run in a TDMA mode. Each computer performs separate specific tasks in normal operation and assumes other tasks in degraded modes. Redundant software assures that all fault monitoring, logging and reporting are automated, together with control functions. Redundant duplex links and damage-spread limitation provide the fault tolerance. Details of an advanced design of a laboratory-scale proof-of-concept system are described, including functional operations.
Application of Fault-Tolerant Computing For Spacecraft Using Commercial-Off-The-Shelf Microprocessors

DTIC Science & Technology

2000-06-01

real - time operating system and design of a human-computer interface (HCI) for a triple modular redundant (TMR) fault-tolerant microprocessor for use in space-based applications. Once disadvantage of using COTS hardware components is their susceptibility to the radiation effects present in the space environment. and specifically, radiation-induced single-event upsets (SEUs). In the event of an SEU, a fault-tolerant system can mitigate the effects of the upset and continue to process from the last known correct system state. The TMR basic hardware
Verifiable fault tolerance in measurement-based quantum computation

NASA Astrophysics Data System (ADS)

Fujii, Keisuke; Hayashi, Masahito

2017-09-01

Quantum systems, in general, cannot be simulated efficiently by a classical computer, and hence are useful for solving certain mathematical problems and simulating quantum many-body systems. This also implies, unfortunately, that verification of the output of the quantum systems is not so trivial, since predicting the output is exponentially hard. As another problem, the quantum system is very delicate for noise and thus needs an error correction. Here, we propose a framework for verification of the output of fault-tolerant quantum computation in a measurement-based model. In contrast to existing analyses on fault tolerance, we do not assume any noise model on the resource state, but an arbitrary resource state is tested by using only single-qubit measurements to verify whether or not the output of measurement-based quantum computation on it is correct. Verifiability is equipped by a constant time repetition of the original measurement-based quantum computation in appropriate measurement bases. Since full characterization of quantum noise is exponentially hard for large-scale quantum computing systems, our framework provides an efficient way to practically verify the experimental quantum error correction.
Fault-tolerant clock synchronization validation methodology. [in computer systems

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Palumbo, Daniel L.; Johnson, Sally C.

1987-01-01

A validation method for the synchronization subsystem of a fault-tolerant computer system is presented. The high reliability requirement of flight-crucial systems precludes the use of most traditional validation methods. The method presented utilizes formal design proof to uncover design and coding errors and experimentation to validate the assumptions of the design proof. The experimental method is described and illustrated by validating the clock synchronization system of the Software Implemented Fault Tolerance computer. The design proof of the algorithm includes a theorem that defines the maximum skew between any two nonfaulty clocks in the system in terms of specific system parameters. Most of these parameters are deterministic. One crucial parameter is the upper bound on the clock read error, which is stochastic. The probability that this upper bound is exceeded is calculated from data obtained by the measurement of system parameters. This probability is then included in a detailed reliability analysis of the system.
Tolerance analysis through computational imaging simulations

NASA Astrophysics Data System (ADS)

Birch, Gabriel C.; LaCasse, Charles F.; Stubbs, Jaclynn J.; Dagel, Amber L.; Bradley, Jon

2017-11-01

The modeling and simulation of non-traditional imaging systems require holistic consideration of the end-to-end system. We demonstrate this approach through a tolerance analysis of a random scattering lensless imaging system.
Responsive systems - The challenge for the nineties

NASA Technical Reports Server (NTRS)

Malek, Miroslaw

1990-01-01

A concept of responsive computer systems will be introduced. The emerging responsive systems demand fault-tolerant and real-time performance in parallel and distributed computing environments. The design methodologies for fault-tolerant, real time and responsive systems will be presented. Novel techniques of introducing redundancy for improved performance and dependability will be illustrated. The methods of system responsiveness evaluation will be proposed. The issues of determinism, closed and open systems will also be discussed from the perspective of responsive systems design.
Formal Techniques for Synchronized Fault-Tolerant Systems

NASA Technical Reports Server (NTRS)

DiVito, Ben L.; Butler, Ricky W.

1992-01-01

We present the formal verification of synchronizing aspects of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications. The RCP uses NMR-style redundancy to mask faults and internal majority voting to purge the effects of transient faults. The system design has been formally specified and verified using the EHDM verification system. Our formalization is based on an extended state machine model incorporating snapshots of local processors clocks.
Fault tolerant features and experiments of ANTS distributed real-time system

NASA Astrophysics Data System (ADS)

Dominic-Savio, Patrick; Lo, Jien-Chung; Tufts, Donald W.

1995-01-01

The ANTS project at the University of Rhode Island introduces the concept of Active Nodal Task Seeking (ANTS) as a way to efficiently design and implement dependable, high-performance, distributed computing. This paper presents the fault tolerant design features that have been incorporated in the ANTS experimental system implementation. The results of performance evaluations and fault injection experiments are reported. The fault-tolerant version of ANTS categorizes all computing nodes into three groups. They are: the up-and-running green group, the self-diagnosing yellow group and the failed red group. Each available computing node will be placed in the yellow group periodically for a routine diagnosis. In addition, for long-life missions, ANTS uses a monitoring scheme to identify faulty computing nodes. In this monitoring scheme, the communication pattern of each computing node is monitored by two other nodes.
Modeling and Simulation Reliable Spacecraft On-Board Computing

NASA Technical Reports Server (NTRS)

Park, Nohpill

1999-01-01

The proposed project will investigate modeling and simulation-driven testing and fault tolerance schemes for Spacecraft On-Board Computing, thereby achieving reliable spacecraft telecommunication. A spacecraft communication system has inherent capabilities of providing multipoint and broadcast transmission, connectivity between any two distant nodes within a wide-area coverage, quick network configuration /reconfiguration, rapid allocation of space segment capacity, and distance-insensitive cost. To realize the capabilities above mentioned, both the size and cost of the ground-station terminals have to be reduced by using reliable, high-throughput, fast and cost-effective on-board computing system which has been known to be a critical contributor to the overall performance of space mission deployment. Controlled vulnerability of mission data (measured in sensitivity), improved performance (measured in throughput and delay) and fault tolerance (measured in reliability) are some of the most important features of these systems. The system should be thoroughly tested and diagnosed before employing a fault tolerance into the system. Testing and fault tolerance strategies should be driven by accurate performance models (i.e. throughput, delay, reliability and sensitivity) to find an optimal solution in terms of reliability and cost. The modeling and simulation tools will be integrated with a system architecture module, a testing module and a module for fault tolerance all of which interacting through a centered graphical user interface.
Impact of coverage on the reliability of a fault tolerant computer

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1975-01-01

A mathematical reliability model is established for a reconfigurable fault tolerant avionic computer system utilizing state-of-the-art computers. System reliability is studied in light of the coverage probabilities associated with the first and second independent hardware failures. Coverage models are presented as a function of detection, isolation, and recovery probabilities. Upper and lower bonds are established for the coverage probabilities and the method for computing values for the coverage probabilities is investigated. Further, an architectural variation is proposed which is shown to enhance coverage.
Fault-tolerant computer study. [logic designs for building block circuits

NASA Technical Reports Server (NTRS)

Rennels, D. A.; Avizienis, A. A.; Ercegovac, M. D.

1981-01-01

A set of building block circuits is described which can be used with commercially available microprocessors and memories to implement fault tolerant distributed computer systems. Each building block circuit is intended for VLSI implementation as a single chip. Several building blocks and associated processor and memory chips form a self checking computer module with self contained input output and interfaces to redundant communications buses. Fault tolerance is achieved by connecting self checking computer modules into a redundant network in which backup buses and computer modules are provided to circumvent failures. The requirements and design methodology which led to the definition of the building block circuits are discussed.
Fault Tolerance for VLSI Multicomputers

DTIC Science & Technology

1985-08-01

that consists of hundreds or thousands of VLSI computation nodes interconnected by dedicated links. Some important applications of high-end computers...technology, and intended applications . A proposed fault tolerance scheme combines hardware that performs error detection and system-level protocols for...order to recover from the error and resume correct operation, a valid system state must be restored. A low-overhead, application -transparent error

Algorithm-Based Fault Tolerance Integrated with Replication

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2008-01-01

In a proposed approach to programming and utilization of commercial off-the-shelf computing equipment, a combination of algorithm-based fault tolerance (ABFT) and replication would be utilized to obtain high degrees of fault tolerance without incurring excessive costs. The basic idea of the proposed approach is to integrate ABFT with replication such that the algorithmic portions of computations would be protected by ABFT, and the logical portions by replication. ABFT is an extremely efficient, inexpensive, high-coverage technique for detecting and mitigating faults in computer systems used for algorithmic computations, but does not protect against errors in logical operations surrounding algorithms.
Parallel and distributed computation for fault-tolerant object recognition

NASA Technical Reports Server (NTRS)

Wechsler, Harry

1988-01-01

The distributed associative memory (DAM) model is suggested for distributed and fault-tolerant computation as it relates to object recognition tasks. The fault-tolerance is with respect to geometrical distortions (scale and rotation), noisy inputs, occulsion/overlap, and memory faults. An experimental system was developed for fault-tolerant structure recognition which shows the feasibility of such an approach. The approach is futher extended to the problem of multisensory data integration and applied successfully to the recognition of colored polyhedral objects.
Verification Methodology of Fault-tolerant, Fail-safe Computers Applied to MAGLEV Control Computer Systems

DOT National Transportation Integrated Search

1993-05-01

The Maglev control computer system should be designed to verifiably possess high reliability and safety as well as high availability to make Maglev a dependable and attractive transportation alternative to the public. A Maglev computer system has bee...
Neuromorphic Computing – From Materials Research to Systems Architecture Roundtable

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schuller, Ivan K.; Stevens, Rick; Pino, Robinson

2015-10-29

Computation in its many forms is the engine that fuels our modern civilization. Modern computation—based on the von Neumann architecture—has allowed, until now, the development of continuous improvements, as predicted by Moore’s law. However, computation using current architectures and materials will inevitably—within the next 10 years—reach a limit because of fundamental scientific reasons. DOE convened a roundtable of experts in neuromorphic computing systems, materials science, and computer science in Washington on October 29-30, 2015 to address the following basic questions: Can brain-like (“neuromorphic”) computing devices based on new material concepts and systems be developed to dramatically outperform conventional CMOS basedmore » technology? If so, what are the basic research challenges for materials sicence and computing? The overarching answer that emerged was: The development of novel functional materials and devices incorporated into unique architectures will allow a revolutionary technological leap toward the implementation of a fully “neuromorphic” computer. To address this challenge, the following issues were considered: The main differences between neuromorphic and conventional computing as related to: signaling models, timing/clock, non-volatile memory, architecture, fault tolerance, integrated memory and compute, noise tolerance, analog vs. digital, and in situ learning New neuromorphic architectures needed to: produce lower energy consumption, potential novel nanostructured materials, and enhanced computation Device and materials properties needed to implement functions such as: hysteresis, stability, and fault tolerance Comparisons of different implementations: spin torque, memristors, resistive switching, phase change, and optical schemes for enhanced breakthroughs in performance, cost, fault tolerance, and/or manufacturability.« less
Making classical ground-state spin computing fault-tolerant.

PubMed

Crosson, I J; Bacon, D; Brown, K R

2010-09-01

We examine a model of classical deterministic computing in which the ground state of the classical system is a spatial history of the computation. This model is relevant to quantum dot cellular automata as well as to recent universal adiabatic quantum computing constructions. In its most primitive form, systems constructed in this model cannot compute in an error-free manner when working at nonzero temperature. However, by exploiting a mapping between the partition function for this model and probabilistic classical circuits we are able to show that it is possible to make this model effectively error-free. We achieve this by using techniques in fault-tolerant classical computing and the result is that the system can compute effectively error-free if the temperature is below a critical temperature. We further link this model to computational complexity and show that a certain problem concerning finite temperature classical spin systems is complete for the complexity class Merlin-Arthur. This provides an interesting connection between the physical behavior of certain many-body spin systems and computational complexity.
Fault-Tolerant Computing: An Overview

DTIC Science & Technology

1991-06-01

Addison Wesley:, Reading, MA) 1984. [8] J. Wakerly , Error Detecting Codes, Self-Checking Circuits and Applications , (Elsevier North Holland, Inc.- New York... applicable to bit-sliced organi- zations of hardware. In the first time step, the normal computation is performed on the operands and the results...for error detection and fault tolerance in parallel processor systems while perform- ing specific computation-intensive applications [111. Contrary to
Design of a modular digital computer system, CDRL no. D001, final design plan

NASA Technical Reports Server (NTRS)

Easton, R. A.

1975-01-01

The engineering breadboard implementation for the CDRL no. D001 modular digital computer system developed during design of the logic system was documented. This effort followed the architecture study completed and documented previously, and was intended to verify the concepts of a fault tolerant, automatically reconfigurable, modular version of the computer system conceived during the architecture study. The system has a microprogrammed 32 bit word length, general register architecture and an instruction set consisting of a subset of the IBM System 360 instruction set plus additional fault tolerance firmware. The following areas were covered: breadboard packaging, central control element, central processing element, memory, input/output processor, and maintenance/status panel and electronics.
Analysis of typical fault-tolerant architectures using HARP

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Bechta Dugan, Joanne; Trivedi, Kishor S.; Rothmann, Elizabeth M.; Smith, W. Earl

1987-01-01

Difficulties encountered in the modeling of fault-tolerant systems are discussed. The Hybrid Automated Reliability Predictor (HARP) approach to modeling fault-tolerant systems is described. The HARP is written in FORTRAN, consists of nearly 30,000 lines of codes and comments, and is based on behavioral decomposition. Using the behavioral decomposition, the dependability model is divided into fault-occurrence/repair and fault/error-handling models; the characteristics and combining of these two models are examined. Examples in which the HARP is applied to the modeling of some typical fault-tolerant systems, including a local-area network, two fault-tolerant computer systems, and a flight control system, are presented.
Redundancy management for efficient fault recovery in NASA's distributed computing system

NASA Technical Reports Server (NTRS)

Malek, Miroslaw; Pandya, Mihir; Yau, Kitty

1991-01-01

The management of redundancy in computer systems was studied and guidelines were provided for the development of NASA's fault-tolerant distributed systems. Fault recovery and reconfiguration mechanisms were examined. A theoretical foundation was laid for redundancy management by efficient reconfiguration methods and algorithmic diversity. Algorithms were developed to optimize the resources for embedding of computational graphs of tasks in the system architecture and reconfiguration of these tasks after a failure has occurred. The computational structure represented by a path and the complete binary tree was considered and the mesh and hypercube architectures were targeted for their embeddings. The innovative concept of Hybrid Algorithm Technique was introduced. This new technique provides a mechanism for obtaining fault tolerance while exhibiting improved performance.
Advanced cloud fault tolerance system

NASA Astrophysics Data System (ADS)

Sumangali, K.; Benny, Niketa

2017-11-01

Cloud computing has become a prevalent on-demand service on the internet to store, manage and process data. A pitfall that accompanies cloud computing is the failures that can be encountered in the cloud. To overcome these failures, we require a fault tolerance mechanism to abstract faults from users. We have proposed a fault tolerant architecture, which is a combination of proactive and reactive fault tolerance. This architecture essentially increases the reliability and the availability of the cloud. In the future, we would like to compare evaluations of our proposed architecture with existing architectures and further improve it.
The use of automatic programming techniques for fault tolerant computing systems

NASA Technical Reports Server (NTRS)

Wild, C.

1985-01-01

It is conjectured that the production of software for ultra-reliable computing systems such as required by Space Station, aircraft, nuclear power plants and the like will require a high degree of automation as well as fault tolerance. In this paper, the relationship between automatic programming techniques and fault tolerant computing systems is explored. Initial efforts in the automatic synthesis of code from assertions to be used for error detection as well as the automatic generation of assertions and test cases from abstract data type specifications is outlined. Speculation on the ability to generate truly diverse designs capable of recovery from errors by exploring alternate paths in the program synthesis tree is discussed. Some initial thoughts on the use of knowledge based systems for the global detection of abnormal behavior using expectations and the goal-directed reconfiguration of resources to meet critical mission objectives are given. One of the sources of information for these systems would be the knowledge captured during the automatic programming process.
Verification of fault-tolerant clock synchronization systems. M.S. Thesis - College of William and Mary, 1992

NASA Technical Reports Server (NTRS)

Miner, Paul S.

1993-01-01

A critical function in a fault-tolerant computer architecture is the synchronization of the redundant computing elements. The synchronization algorithm must include safeguards to ensure that failed components do not corrupt the behavior of good clocks. Reasoning about fault-tolerant clock synchronization is difficult because of the possibility of subtle interactions involving failed components. Therefore, mechanical proof systems are used to ensure that the verification of the synchronization system is correct. In 1987, Schneider presented a general proof of correctness for several fault-tolerant clock synchronization algorithms. Subsequently, Shankar verified Schneider's proof by using the mechanical proof system EHDM. This proof ensures that any system satisfying its underlying assumptions will provide Byzantine fault-tolerant clock synchronization. The utility of Shankar's mechanization of Schneider's theory for the verification of clock synchronization systems is explored. Some limitations of Shankar's mechanically verified theory were encountered. With minor modifications to the theory, a mechanically checked proof is provided that removes these limitations. The revised theory also allows for proven recovery from transient faults. Use of the revised theory is illustrated with the verification of an abstract design of a clock synchronization system.
Formal design and verification of a reliable computing platform for real-time control. Phase 1: Results

NASA Technical Reports Server (NTRS)

Divito, Ben L.; Butler, Ricky W.; Caldwell, James L.

1990-01-01

A high-level design is presented for a reliable computing platform for real-time control applications. Design tradeoffs and analyses related to the development of the fault-tolerant computing platform are discussed. The architecture is formalized and shown to satisfy a key correctness property. The reliable computing platform uses replicated processors and majority voting to achieve fault tolerance. Under the assumption of a majority of processors working in each frame, it is shown that the replicated system computes the same results as a single processor system not subject to failures. Sufficient conditions are obtained to establish that the replicated system recovers from transient faults within a bounded amount of time. Three different voting schemes are examined and proved to satisfy the bounded recovery time conditions.
Fault-tolerant building-block computer study

NASA Technical Reports Server (NTRS)

Rennels, D. A.

1978-01-01

Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.
Advanced information processing system: Hosting of advanced guidance, navigation and control algorithms on AIPS using ASTER

NASA Technical Reports Server (NTRS)

Brenner, Richard; Lala, Jaynarayan H.; Nagle, Gail A.; Schor, Andrei; Turkovich, John

1994-01-01

This program demonstrated the integration of a number of technologies that can increase the availability and reliability of launch vehicles while lowering costs. Availability is increased with an advanced guidance algorithm that adapts trajectories in real-time. Reliability is increased with fault-tolerant computers and communication protocols. Costs are reduced by automatically generating code and documentation. This program was realized through the cooperative efforts of academia, industry, and government. The NASA-LaRC coordinated the effort, while Draper performed the integration. Georgia Institute of Technology supplied a weak Hamiltonian finite element method for optimal control problems. Martin Marietta used MATLAB to apply this method to a launch vehicle (FENOC). Draper supplied the fault-tolerant computing and software automation technology. The fault-tolerant technology includes sequential and parallel fault-tolerant processors (FTP & FTPP) and authentication protocols (AP) for communication. Fault-tolerant technology was incrementally incorporated. Development culminated with a heterogeneous network of workstations and fault-tolerant computers using AP. Draper's software automation system, ASTER, was used to specify a static guidance system based on FENOC, navigation, flight control (GN&C), models, and the interface to a user interface for mission control. ASTER generated Ada code for GN&C and C code for models. An algebraic transform engine (ATE) was developed to automatically translate MATLAB scripts into ASTER.
Final Project Report. Scalable fault tolerance runtime technology for petascale computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Krishnamoorthy, Sriram; Sadayappan, P

With the massive number of components comprising the forthcoming petascale computer systems, hardware failures will be routinely encountered during execution of large-scale applications. Due to the multidisciplinary, multiresolution, and multiscale nature of scientific problems that drive the demand for high end systems, applications place increasingly differing demands on the system resources: disk, network, memory, and CPU. In addition to MPI, future applications are expected to use advanced programming models such as those developed under the DARPA HPCS program as well as existing global address space programming models such as Global Arrays, UPC, and Co-Array Fortran. While there has been amore » considerable amount of work in fault tolerant MPI with a number of strategies and extensions for fault tolerance proposed, virtually none of advanced models proposed for emerging petascale systems is currently fault aware. To achieve fault tolerance, development of underlying runtime and OS technologies able to scale to petascale level is needed. This project has evaluated range of runtime techniques for fault tolerance for advanced programming models.« less
Quantum Error Correction

NASA Astrophysics Data System (ADS)

Lidar, Daniel A.; Brun, Todd A.

2013-09-01

Prologue; Preface; Part I. Background: 1. Introduction to decoherence and noise in open quantum systems Daniel Lidar and Todd Brun; 2. Introduction to quantum error correction Dave Bacon; 3. Introduction to decoherence-free subspaces and noiseless subsystems Daniel Lidar; 4. Introduction to quantum dynamical decoupling Lorenza Viola; 5. Introduction to quantum fault tolerance Panos Aliferis; Part II. Generalized Approaches to Quantum Error Correction: 6. Operator quantum error correction David Kribs and David Poulin; 7. Entanglement-assisted quantum error-correcting codes Todd Brun and Min-Hsiu Hsieh; 8. Continuous-time quantum error correction Ognyan Oreshkov; Part III. Advanced Quantum Codes: 9. Quantum convolutional codes Mark Wilde; 10. Non-additive quantum codes Markus Grassl and Martin Rötteler; 11. Iterative quantum coding systems David Poulin; 12. Algebraic quantum coding theory Andreas Klappenecker; 13. Optimization-based quantum error correction Andrew Fletcher; Part IV. Advanced Dynamical Decoupling: 14. High order dynamical decoupling Zhen-Yu Wang and Ren-Bao Liu; 15. Combinatorial approaches to dynamical decoupling Martin Rötteler and Pawel Wocjan; Part V. Alternative Quantum Computation Approaches: 16. Holonomic quantum computation Paolo Zanardi; 17. Fault tolerance for holonomic quantum computation Ognyan Oreshkov, Todd Brun and Daniel Lidar; 18. Fault tolerant measurement-based quantum computing Debbie Leung; Part VI. Topological Methods: 19. Topological codes Héctor Bombín; 20. Fault tolerant topological cluster state quantum computing Austin Fowler and Kovid Goyal; Part VII. Applications and Implementations: 21. Experimental quantum error correction Dave Bacon; 22. Experimental dynamical decoupling Lorenza Viola; 23. Architectures Jacob Taylor; 24. Error correction in quantum communication Mark Wilde; Part VIII. Critical Evaluation of Fault Tolerance: 25. Hamiltonian methods in QEC and fault tolerance Eduardo Novais, Eduardo Mucciolo and Harold Baranger; 26. Critique of fault-tolerant quantum information processing Robert Alicki; References; Index.
Fault tolerant software modules for SIFT

NASA Technical Reports Server (NTRS)

Hecht, M.; Hecht, H.

1982-01-01

The implementation of software fault tolerance is investigated for critical modules of the Software Implemented Fault Tolerance (SIFT) operating system to support the computational and reliability requirements of advanced fly by wire transport aircraft. Fault tolerant designs generated for the error reported and global executive are examined. A description of the alternate routines, implementation requirements, and software validation are included.
Care 3, phase 1, volume 2

NASA Technical Reports Server (NTRS)

Stiffler, J. J.; Bryant, L. A.; Guccione, L.

1979-01-01

A computer program was developed as a general purpose reliability tool for fault tolerant avionics systems. The computer program requirements, together with several appendices containing computer printouts are presented.
Tutorial: Advanced fault tree applications using HARP

NASA Technical Reports Server (NTRS)

Dugan, Joanne Bechta; Bavuso, Salvatore J.; Boyd, Mark A.

1993-01-01

Reliability analysis of fault tolerant computer systems for critical applications is complicated by several factors. These modeling difficulties are discussed and dynamic fault tree modeling techniques for handling them are described and demonstrated. Several advanced fault tolerant computer systems are described, and fault tree models for their analysis are presented. HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that is capable of solving the fault tree models presented.

Fault tolerance in a supercomputer through dynamic repartitioning

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

2007-02-27

A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.

1973-01-01

This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
A forward view on reliable computers for flight control

NASA Technical Reports Server (NTRS)

Goldberg, J.; Wensley, J. H.

1976-01-01

The requirements for fault-tolerant computers for flight control of commercial aircraft are examined; it is concluded that the reliability requirements far exceed those typically quoted for space missions. Examination of circuit technology and alternative computer architectures indicates that the desired reliability can be achieved with several different computer structures, though there are obvious advantages to those that are more economic, more reliable, and, very importantly, more certifiable as to fault tolerance. Progress in this field is expected to bring about better computer systems that are more rigorously designed and analyzed even though computational requirements are expected to increase significantly.
A minimum cost tolerance allocation method for rocket engines and robust rocket engine design

NASA Technical Reports Server (NTRS)

Gerth, Richard J.

1993-01-01

Rocket engine design follows three phases: systems design, parameter design, and tolerance design. Systems design and parameter design are most effectively conducted in a concurrent engineering (CE) environment that utilize methods such as Quality Function Deployment and Taguchi methods. However, tolerance allocation remains an art driven by experience, handbooks, and rules of thumb. It was desirable to develop and optimization approach to tolerancing. The case study engine was the STME gas generator cycle. The design of the major components had been completed and the functional relationship between the component tolerances and system performance had been computed using the Generic Power Balance model. The system performance nominals (thrust, MR, and Isp) and tolerances were already specified, as were an initial set of component tolerances. However, the question was whether there existed an optimal combination of tolerances that would result in the minimum cost without any degradation in system performance.
Design of on-board Bluetooth wireless network system based on fault-tolerant technology

NASA Astrophysics Data System (ADS)

You, Zheng; Zhang, Xiangqi; Yu, Shijie; Tian, Hexiang

2007-11-01

In this paper, the Bluetooth wireless data transmission technology is applied in on-board computer system, to realize wireless data transmission between peripherals of the micro-satellite integrating electronic system, and in view of the high demand of reliability of a micro-satellite, a design of Bluetooth wireless network based on fault-tolerant technology is introduced. The reliability of two fault-tolerant systems is estimated firstly using Markov model, then the structural design of this fault-tolerant system is introduced; several protocols are established to make the system operate correctly, some related problems are listed and analyzed, with emphasis on Fault Auto-diagnosis System, Active-standby switch design and Data-Integrity process.
An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems

PubMed Central

Idris, Hajara; Junaidu, Sahalu B.; Adewumi, Aderemi O.

2017-01-01

The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user’s Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user’s QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time. PMID:28545075
Cost and benefits design optimization model for fault tolerant flight control systems

NASA Technical Reports Server (NTRS)

Rose, J.

1982-01-01

Requirements and specifications for a method of optimizing the design of fault-tolerant flight control systems are provided. Algorithms that could be used for developing new and modifying existing computer programs are also provided, with recommendations for follow-on work.
Agent Based Fault Tolerance for the Mobile Environment

NASA Astrophysics Data System (ADS)

Park, Taesoon

This paper presents a fault-tolerance scheme based on mobile agents for the reliable mobile computing systems. Mobility of the agent is suitable to trace the mobile hosts and the intelligence of the agent makes it efficient to support the fault tolerance services. This paper presents two approaches to implement the mobile agent based fault tolerant service and their performances are evaluated and compared with other fault-tolerant schemes.
Assessment of spare reliability for multi-state computer networks within tolerable packet unreliability

NASA Astrophysics Data System (ADS)

Lin, Yi-Kuei; Huang, Cheng-Fu

2015-04-01

From a quality of service viewpoint, the transmission packet unreliability and transmission time are both critical performance indicators in a computer system when assessing the Internet quality for supervisors and customers. A computer system is usually modelled as a network topology where each branch denotes a transmission medium and each vertex represents a station of servers. Almost every branch has multiple capacities/states due to failure, partial failure, maintenance, etc. This type of network is known as a multi-state computer network (MSCN). This paper proposes an efficient algorithm that computes the system reliability, i.e., the probability that a specified amount of data can be sent through k (k ≥ 2) disjoint minimal paths within both the tolerable packet unreliability and time threshold. Furthermore, two routing schemes are established in advance to indicate the main and spare minimal paths to increase the system reliability (referred to as spare reliability). Thus, the spare reliability can be readily computed according to the routing scheme.
Experimental Demonstration of Fault-Tolerant State Preparation with Superconducting Qubits.

PubMed

Takita, Maika; Cross, Andrew W; Córcoles, A D; Chow, Jerry M; Gambetta, Jay M

2017-11-03

Robust quantum computation requires encoding delicate quantum information into degrees of freedom that are hard for the environment to change. Quantum encodings have been demonstrated in many physical systems by observing and correcting storage errors, but applications require not just storing information; we must accurately compute even with faulty operations. The theory of fault-tolerant quantum computing illuminates a way forward by providing a foundation and collection of techniques for limiting the spread of errors. Here we implement one of the smallest quantum codes in a five-qubit superconducting transmon device and demonstrate fault-tolerant state preparation. We characterize the resulting code words through quantum process tomography and study the free evolution of the logical observables. Our results are consistent with fault-tolerant state preparation in a protected qubit subspace.
MAX - An advanced parallel computer for space applications

NASA Technical Reports Server (NTRS)

Lewis, Blair F.; Bunker, Robert L.

1991-01-01

MAX is a fault-tolerant multicomputer hardware and software architecture designed to meet the needs of NASA spacecraft systems. It consists of conventional computing modules (computers) connected via a dual network topology. One network is used to transfer data among the computers and between computers and I/O devices. This network's topology is arbitrary. The second network operates as a broadcast medium for operating system synchronization messages and supports the operating system's Byzantine resilience. A fully distributed operating system supports multitasking in an asynchronous event and data driven environment. A large grain dataflow paradigm is used to coordinate the multitasking and provide easy control of concurrency. It is the basis of the system's fault tolerance and allows both static and dynamical location of tasks. Redundant execution of tasks with software voting of results may be specified for critical tasks. The dataflow paradigm also supports simplified software design, test and maintenance. A unique feature is a method for reliably patching code in an executing dataflow application.
Verification of the FtCayuga fault-tolerant microprocessor system. Volume 1: A case study in theorem prover-based verification

NASA Technical Reports Server (NTRS)

Srivas, Mandayam; Bickford, Mark

1991-01-01

The design and formal verification of a hardware system for a task that is an important component of a fault tolerant computer architecture for flight control systems is presented. The hardware system implements an algorithm for obtaining interactive consistancy (byzantine agreement) among four microprocessors as a special instruction on the processors. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, provided certain preconditions hold. An assumption is made that the processors execute synchronously. For verification, the authors used a computer aided design hardware design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.
Guest Editor's Introduction: Special section on dependable distributed systems

NASA Astrophysics Data System (ADS)

Fetzer, Christof

1999-09-01

We rely more and more on computers. For example, the Internet reshapes the way we do business. A `computer outage' can cost a company a substantial amount of money. Not only with respect to the business lost during an outage, but also with respect to the negative publicity the company receives. This is especially true for Internet companies. After recent computer outages of Internet companies, we have seen a drastic fall of the shares of the affected companies. There are multiple causes for computer outages. Although computer hardware becomes more reliable, hardware related outages remain an important issue. For example, some of the recent computer outages of companies were caused by failed memory and system boards, and even by crashed disks - a failure type which can easily be masked using disk mirroring. Transient hardware failures might also look like software failures and, hence, might be incorrectly classified as such. However, many outages are software related. Faulty system software, middleware, and application software can crash a system. Dependable computing systems are systems we can rely on. Dependable systems are, by definition, reliable, available, safe and secure [3]. This special section focuses on issues related to dependable distributed systems. Distributed systems have the potential to be more dependable than a single computer because the probability that all computers in a distributed system fail is smaller than the probability that a single computer fails. However, if a distributed system is not built well, it is potentially less dependable than a single computer since the probability that at least one computer in a distributed system fails is higher than the probability that one computer fails. For example, if the crash of any computer in a distributed system can bring the complete system to a halt, the system is less dependable than a single-computer system. Building dependable distributed systems is an extremely difficult task. There is no silver bullet solution. Instead one has to apply a variety of engineering techniques [2]: fault-avoidance (minimize the occurrence of faults, e.g. by using a proper design process), fault-removal (remove faults before they occur, e.g. by testing), fault-evasion (predict faults by monitoring and reconfigure the system before failures occur), and fault-tolerance (mask and/or contain failures). Building a system from scratch is an expensive and time consuming effort. To reduce the cost of building dependable distributed systems, one would choose to use commercial off-the-shelf (COTS) components whenever possible. The usage of COTS components has several potential advantages beyond minimizing costs. For example, through the widespread usage of a COTS component, design failures might be detected and fixed before the component is used in a dependable system. Custom-designed components have to mature without the widespread in-field testing of COTS components. COTS components have various potential disadvantages when used in dependable systems. For example, minimizing the time to market might lead to the release of components with inherent design faults (e.g. use of `shortcuts' that only work most of the time). In addition, the components might be more complex than needed and, hence, potentially have more design faults than simpler components. However, given economic constraints and the ability to cope with some of the problems using fault-evasion and fault-tolerance, only for a small percentage of systems can one justify not using COTS components. Distributed systems built from current COTS components are asynchronous systems in the sense that there exists no a priori known bound on the transmission delay of messages or the execution time of processes. When designing a distributed algorithm, one would like to make sure (e.g. by testing or verification) that it is correct, i.e. satisfies its specification. Many distributed algorithms make use of consensus (eventually all non-crashed processes have to agree on a value), leader election (a crashed leader is eventually replaced by a new leader, but at any time there is at most one leader) or a group membership detection service (a crashed process is eventually suspected to have crashed but only crashed processes are suspected). From a theoretical point of view, the service specifications given for such services are not implementable in asynchronous systems. In particular, for each implementation one can derive a counter example in which the service violates its specification. From a practical point of view, the consensus, the leader election, and the membership detection problem are solvable in asynchronous distributed systems. In this special section, Raynal and Tronel show how to bridge this difference by showing how to implement the group membership detection problem with a negligible probability [1] to fail in an asynchronous system. The group membership detection problem is specified by a liveness condition (L) and a safety property (S): (L) if a process p crashes, then eventually every non-crashed process q has to suspect that p has crashed; and (S) if a process q suspects p, then p has indeed crashed. One can show that either (L) or (S) is implementable, but one cannot implement both (L) and (S) at the same time in an asynchronous system. In practice, one only needs to implement (L) and (S) such that the probability that (L) or (S) is violated becomes negligible. Raynal and Tronel propose and analyse a protocol that implements (L) with certainty and that can be tuned such that the probability that (S) is violated becomes negligible. Designing and implementing distributed fault-tolerant protocols for asynchronous systems is a difficult but not an impossible task. A fault-tolerant protocol has to detect and mask certain failure classes, e.g. crash failures and message omission failures. There is a trade-off between the performance of a fault-tolerant protocol and the failure classes the protocol can tolerate. One wants to tolerate as many failure classes as needed to satisfy the stochastic requirements of the protocol [1] while still maintaining a sufficient performance. Since clients of a protocol have different requirements with respect to the performance/fault-tolerance trade-off, one would like to be able to customize protocols such that one can select an appropriate performance/fault-tolerance trade-off. In this special section Hiltunen et al describe how one can compose protocols from micro-protocols in their Cactus system. They show how a group RPC system can be tailored to the needs of a client. In particular, they show how considering additional failure classes affects the performance of a group RPC system. References [1] Cristian F 1991 Understanding fault-tolerant distributed systems Communications of ACM 34 (2) 56-78 [2] Heimerdinger W L and Weinstock C B 1992 A conceptual framework for system fault tolerance Technical Report 92-TR-33, CMU/SEI [3] Laprie J C (ed) 1992 Dependability: Basic Concepts and Terminology (Vienna: Springer)
Advanced information processing system: The Army Fault-Tolerant Architecture detailed design overview

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Babikyan, Carol A.; Butler, Bryan P.; Clasen, Robert J.; Harris, Chris H.; Lala, Jaynarayan H.; Masotto, Thomas K.; Nagle, Gail A.; Prizant, Mark J.; Treadwell, Steven

1994-01-01

The Army Avionics Research and Development Activity (AVRADA) is pursuing programs that would enable effective and efficient management of large amounts of situational data that occurs during tactical rotorcraft missions. The Computer Aided Low Altitude Night Helicopter Flight Program has identified automated Terrain Following/Terrain Avoidance, Nap of the Earth (TF/TA, NOE) operation as key enabling technology for advanced tactical rotorcraft to enhance mission survivability and mission effectiveness. The processing of critical information at low altitudes with short reaction times is life-critical and mission-critical necessitating an ultra-reliable/high throughput computing platform for dependable service for flight control, fusion of sensor data, route planning, near-field/far-field navigation, and obstacle avoidance operations. To address these needs the Army Fault Tolerant Architecture (AFTA) is being designed and developed. This computer system is based upon the Fault Tolerant Parallel Processor (FTPP) developed by Charles Stark Draper Labs (CSDL). AFTA is hard real-time, Byzantine, fault-tolerant parallel processor which is programmed in the ADA language. This document describes the results of the Detailed Design (Phase 2 and 3 of a 3-year project) of the AFTA development. This document contains detailed descriptions of the program objectives, the TF/TA NOE application requirements, architecture, hardware design, operating systems design, systems performance measurements and analytical models.
Fault Tolerant Real-Time Networks

DTIC Science & Technology

2007-05-30

Alberto Sangiovanni-Vincentelli, editors Hybrid Systems: Computation and Control. Fourth International Workshop (HSCC󈧅, Rome, Italy, March 2001...average dwell time by solving optimization problems. In Ashish Tiwari and Joao P. Hespanha, editors, Hybrid Systems: Computation and Control (HSCC 06
Progressive Fracture and Damage Tolerance of Composite Pressure Vessels

NASA Technical Reports Server (NTRS)

Chamis, Christos C.; Gotsis, Pascal K.; Minnetyan, Levon

1997-01-01

Structural performance (integrity, durability and damage tolerance) of fiber reinforced composite pressure vessels, designed for pressured shelters for planetary exploration, is investigated via computational simulation. An integrated computer code is utilized for the simulation of damage initiation, growth, and propagation under pressure. Aramid fibers are considered in a rubbery polymer matrix for the composite system. Effects of fiber orientation and fabrication defect/accidental damages are investigated with regard to the safety and durability of the shelter. Results show the viability of fiber reinforced pressure vessels as damage tolerant shelters for planetary colonization.
Advanced information processing system: The Army fault tolerant architecture conceptual study. Volume 1: Army fault tolerant architecture overview

NASA Technical Reports Server (NTRS)

Harper, R. E.; Alger, L. S.; Babikyan, C. A.; Butler, B. P.; Friend, S. A.; Ganska, R. J.; Lala, J. H.; Masotto, T. K.; Meyer, A. J.; Morton, D. P.

1992-01-01

Digital computing systems needed for Army programs such as the Computer-Aided Low Altitude Helicopter Flight Program and the Armored Systems Modernization (ASM) vehicles may be characterized by high computational throughput and input/output bandwidth, hard real-time response, high reliability and availability, and maintainability, testability, and producibility requirements. In addition, such a system should be affordable to produce, procure, maintain, and upgrade. To address these needs, the Army Fault Tolerant Architecture (AFTA) is being designed and constructed under a three-year program comprised of a conceptual study, detailed design and fabrication, and demonstration and validation phases. Described here are the results of the conceptual study phase of the AFTA development. Given here is an introduction to the AFTA program, its objectives, and key elements of its technical approach. A format is designed for representing mission requirements in a manner suitable for first order AFTA sizing and analysis, followed by a discussion of the current state of mission requirements acquisition for the targeted Army missions. An overview is given of AFTA's architectural theory of operation.
Software fault tolerance in computer operating systems

NASA Technical Reports Server (NTRS)

Iyer, Ravishankar K.; Lee, Inhwan

1994-01-01

This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.
Techniques for modeling the reliability of fault-tolerant systems with the Markov state-space approach

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Johnson, Sally C.

1995-01-01

This paper presents a step-by-step tutorial of the methods and the tools that were used for the reliability analysis of fault-tolerant systems. The approach used in this paper is the Markov (or semi-Markov) state-space method. The paper is intended for design engineers with a basic understanding of computer architecture and fault tolerance, but little knowledge of reliability modeling. The representation of architectural features in mathematical models is emphasized. This paper does not present details of the mathematical solution of complex reliability models. Instead, it describes the use of several recently developed computer programs SURE, ASSIST, STEM, and PAWS that automate the generation and the solution of these models.
Exploring the use of I/O nodes for computation in a MIMD multiprocessor

NASA Technical Reports Server (NTRS)

Kotz, David; Cai, Ting

1995-01-01

As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.

Holonomic surface codes for fault-tolerant quantum computation

NASA Astrophysics Data System (ADS)

Zhang, Jiang; Devitt, Simon J.; You, J. Q.; Nori, Franco

2018-02-01

Surface codes can protect quantum information stored in qubits from local errors as long as the per-operation error rate is below a certain threshold. Here we propose holonomic surface codes by harnessing the quantum holonomy of the system. In our scheme, the holonomic gates are built via auxiliary qubits rather than the auxiliary levels in multilevel systems used in conventional holonomic quantum computation. The key advantage of our approach is that the auxiliary qubits are in their ground state before and after each gate operation, so they are not involved in the operation cycles of surface codes. This provides an advantageous way to implement surface codes for fault-tolerant quantum computation.
Flight test results of the Strapdown hexad Inertial Reference Unit (SIRU). Volume 1: Flight test summary

NASA Technical Reports Server (NTRS)

Hruby, R. J.; Bjorkman, W. S.

1977-01-01

Flight test results of the strapdown inertial reference unit (SIRU) navigation system are presented. The fault-tolerant SIRU navigation system features a redundant inertial sensor unit and dual computers. System software provides for detection and isolation of inertial sensor failures and continued operation in the event of failures. Flight test results include assessments of the system's navigational performance and fault tolerance.
Partitioning in Avionics Architectures: Requirements, Mechanisms, and Assurance

NASA Technical Reports Server (NTRS)

Rushby, John

1999-01-01

Automated aircraft control has traditionally been divided into distinct "functions" that are implemented separately (e.g., autopilot, autothrottle, flight management); each function has its own fault-tolerant computer system, and dependencies among different functions are generally limited to the exchange of sensor and control data. A by-product of this "federated" architecture is that faults are strongly contained within the computer system of the function where they occur and cannot readily propagate to affect the operation of other functions. More modern avionics architectures contemplate supporting multiple functions on a single, shared, fault-tolerant computer system where natural fault containment boundaries are less sharply defined. Partitioning uses appropriate hardware and software mechanisms to restore strong fault containment to such integrated architectures. This report examines the requirements for partitioning, mechanisms for their realization, and issues in providing assurance for partitioning. Because partitioning shares some concerns with computer security, security models are reviewed and compared with the concerns of partitioning.
Time Triggered Protocol (TTP) for Integrated Modular Avionics

NASA Technical Reports Server (NTRS)

Motzet, Guenter; Gwaltney, David A.; Bauer, Guenther; Jakovljevic, Mirko; Gagea, Leonard

2006-01-01

Traditional avionics computing systems are federated, with each system provided on a number of dedicated hardware units. Federated applications are physically separated from one another and analysis of the systems is undertaken individually. Integrated Modular Avionics (IMA) takes these federated functions and integrates them on a common computing platform in a tightly deterministic distributed real-time network of computing modules in which the different applications can run. IMA supports different levels of criticality in the same computing resource and provides a platform for implementation of fault tolerance through hardware and application redundancy. Modular implementation has distinct benefits in design, testing and system maintainability. This paper covers the requirements for fault tolerant bus systems used to provide reliable communication between IMA computing modules. An overview of the Time Triggered Protocol (TTP) specification and implementation as a reliable solution for IMA systems is presented. Application examples in aircraft avionics and a development system for future space application are covered. The commercially available TTP controller can be also be implemented in an FPGA and the results from implementation studies are covered. Finally future direction for the application of TTP and related development activities are presented.
[Research on the Application of Fuzzy Logic to Systems Analysis and Control

NASA Technical Reports Server (NTRS)

1998-01-01

Research conducted with the support of NASA Grant NCC2-275 has been focused in the main on the development of fuzzy logic and soft computing methodologies and their applications to systems analysis and control. with emphasis 011 problem areas which are of relevance to NASA's missions. One of the principal results of our research has been the development of a new methodology called Computing with Words (CW). Basically, in CW words drawn from a natural language are employed in place of numbers for computing and reasoning. There are two major imperatives for computing with words. First, computing with words is a necessity when the available information is too imprecise to justify the use of numbers, and second, when there is a tolerance for imprecision which can be exploited to achieve tractability, robustness, low solution cost, and better rapport with reality. Exploitation of the tolerance for imprecision is an issue of central importance in CW.
Flight test results of the strapdown hexad inertial reference unit (SIRU). Volume 2: Test report

NASA Technical Reports Server (NTRS)

Hruby, R. J.; Bjorkman, W. S.

1977-01-01

Results of flight tests of the Strapdown Inertial Reference Unit (SIRU) navigation system are presented. The fault tolerant SIRU navigation system features a redundant inertial sensor unit and dual computers. System software provides for detection and isolation of inertial sensor failures and continued operation in the event of failures. Flight test results include assessments of the system's navigational performance and fault tolerance. Performance shortcomings are analyzed.
FTAPE: A fault injection tool to measure fault tolerance

NASA Technical Reports Server (NTRS)

Tsai, Timothy K.; Iyer, Ravishankar K.

1995-01-01

The paper introduces FTAPE (Fault Tolerance And Performance Evaluator), a tool that can be used to compare fault-tolerant computers. The tool combines system-wide fault injection with a controllable workload. A workload generator is used to create high stress conditions for the machine. Faults are injected based on this workload activity in order to ensure a high level of fault propagation. The errors/fault ratio and performance degradation are presented as measures of fault tolerance.
Coordinated Fault Tolerance for High-Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack; Bosilca, George; et al.

2013-04-08

Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.
Software reliability models for fault-tolerant avionics computers and related topics

NASA Technical Reports Server (NTRS)

Miller, Douglas R.

1987-01-01

Software reliability research is briefly described. General research topics are reliability growth models, quality of software reliability prediction, the complete monotonicity property of reliability growth, conceptual modelling of software failure behavior, assurance of ultrahigh reliability, and analysis techniques for fault-tolerant systems.
Full-Authority Fault-Tolerant Electronic Engine Control System for Variable Cycle Engines.

DTIC Science & Technology

1982-04-01

single internally self-checked VLSI micro - processor . The selected configuration is an externally checked pair of com- mercially available...Electronic Engine Control FPMH Failures per Million Hours FTMP Fault Tolerant Multi- Processor FTSC Fault Tolerant Spaceborn Computer GRAMP Generalized...Removal * MTBR Mean Time Between Repair MTTF Mean Time to Failure xiii List of Abbreviations (continued) - NH High Pressure Rotor Speed O&S Operating
Computer Sciences and Data Systems, volume 1

NASA Technical Reports Server (NTRS)

1987-01-01

Topics addressed include: software engineering; university grants; institutes; concurrent processing; sparse distributed memory; distributed operating systems; intelligent data management processes; expert system for image analysis; fault tolerant software; and architecture research.
Analysis of a hardware and software fault tolerant processor for critical applications

NASA Technical Reports Server (NTRS)

Dugan, Joanne B.

1993-01-01

Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input-dependent. Software faults are triggered by a particular set of inputs. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. A methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor. The models consider both transient and permanent faults, hardware and software faults, independent and related software faults, automatic recovery, and reconfiguration.
Fault Injection Campaign for a Fault Tolerant Duplex Framework

NASA Technical Reports Server (NTRS)

Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.

2007-01-01

Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.
Adaptive voting computer system

NASA Technical Reports Server (NTRS)

Koczela, L. J.; Wilgus, D. S. (Inventor)

1974-01-01

A computer system is reported that uses adaptive voting to tolerate failures and operates in a fail-operational, fail-safe manner. Each of four computers is individually connected to one of four external input/output (I/O) busses which interface with external subsystems. Each computer is connected to receive input data and commands from the other three computers and to furnish output data commands to the other three computers. An adaptive control apparatus including a voter-comparator-switch (VCS) is provided for each computer to receive signals from each of the computers and permits adaptive voting among the computers to permit the fail-operational, fail-safe operation.
JPRS Report Science & Technology Europe.

DTIC Science & Technology

1992-10-22

Potatoes for More Sugar [Frankfurt/Main FRANKFURTER ALLEGEMEINE, 12 Aug 92] 26 COMPUTERS French Devise Operating System for Parallel, Failure...Tolerant and Real-Time Systems [Munich COMPUTER WOCHE, 5 Jun 92] 27 Germany Markets External Mass Memory for IBM-Compatible Parallel Interfaces...Infrared Detection System [Thierry Lucas; Paris L’USINE NOUVELLE TECHNOLOGIES, 16 Jul 92] 28 Streamlined ACE Fighter Airplane Approved [Paris AFP
Evaluation of reliability modeling tools for advanced fault tolerant systems

NASA Technical Reports Server (NTRS)

Baker, Robert; Scheper, Charlotte

1986-01-01

The Computer Aided Reliability Estimation (CARE III) and Automated Reliability Interactice Estimation System (ARIES 82) reliability tools for application to advanced fault tolerance aerospace systems were evaluated. To determine reliability modeling requirements, the evaluation focused on the Draper Laboratories' Advanced Information Processing System (AIPS) architecture as an example architecture for fault tolerance aerospace systems. Advantages and limitations were identified for each reliability evaluation tool. The CARE III program was designed primarily for analyzing ultrareliable flight control systems. The ARIES 82 program's primary use was to support university research and teaching. Both CARE III and ARIES 82 were not suited for determining the reliability of complex nodal networks of the type used to interconnect processing sites in the AIPS architecture. It was concluded that ARIES was not suitable for modeling advanced fault tolerant systems. It was further concluded that subject to some limitations (the difficulty in modeling systems with unpowered spare modules, systems where equipment maintenance must be considered, systems where failure depends on the sequence in which faults occurred, and systems where multiple faults greater than a double near coincident faults must be considered), CARE III is best suited for evaluating the reliability of advanced tolerant systems for air transport.
Use of non-adiabatic geometric phase for quantum computing by NMR.

PubMed

Das, Ranabir; Kumar, S K Karthick; Kumar, Anil

2005-12-01

Geometric phases have stimulated researchers for its potential applications in many areas of science. One of them is fault-tolerant quantum computation. A preliminary requisite of quantum computation is the implementation of controlled dynamics of qubits. In controlled dynamics, one qubit undergoes coherent evolution and acquires appropriate phase, depending on the state of other qubits. If the evolution is geometric, then the phase acquired depend only on the geometry of the path executed, and is robust against certain types of error. This phenomenon leads to an inherently fault-tolerant quantum computation. Here we suggest a technique of using non-adiabatic geometric phase for quantum computation, using selective excitation. In a two-qubit system, we selectively evolve a suitable subsystem where the control qubit is in state |1, through a closed circuit. By this evolution, the target qubit gains a phase controlled by the state of the control qubit. Using the non-adiabatic geometric phase we demonstrate implementation of Deutsch-Jozsa algorithm and Grover's search algorithm in a two-qubit system.
The art of fault-tolerant system reliability modeling

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Johnson, Sally C.

1990-01-01

A step-by-step tutorial of the methods and tools used for the reliability analysis of fault-tolerant systems is presented. Emphasis is on the representation of architectural features in mathematical models. Details of the mathematical solution of complex reliability models are not presented. Instead the use of several recently developed computer programs--SURE, ASSIST, STEM, PAWS--which automate the generation and solution of these models is described.
Optimization of Aerospace Structure Subject to Damage Tolerance Criteria

NASA Technical Reports Server (NTRS)

Akgun, Mehmet A.

1999-01-01

The objective of this cooperative agreement was to seek computationally efficient ways to optimize aerospace structures subject to damage tolerance criteria. Optimization was to involve sizing as well as topology optimization. The work was done in collaboration with Steve Scotti, Chauncey Wu and Joanne Walsh at the NASA Langley Research Center. Computation of constraint sensitivity is normally the most time-consuming step of an optimization procedure. The cooperative work first focused on this issue and implemented the adjoint method of sensitivity computation in an optimization code (runstream) written in Engineering Analysis Language (EAL). The method was implemented both for bar and plate elements including buckling sensitivity for the latter. Lumping of constraints was investigated as a means to reduce the computational cost. Adjoint sensitivity computation was developed and implemented for lumped stress and buckling constraints. Cost of the direct method and the adjoint method was compared for various structures with and without lumping. The results were reported in two papers. It is desirable to optimize topology of an aerospace structure subject to a large number of damage scenarios so that a damage tolerant structure is obtained. Including damage scenarios in the design procedure is critical in order to avoid large mass penalties at later stages. A common method for topology optimization is that of compliance minimization which has not been used for damage tolerant design. In the present work, topology optimization is treated as a conventional problem aiming to minimize the weight subject to stress constraints. Multiple damage configurations (scenarios) are considered. Each configuration has its own structural stiffness matrix and, normally, requires factoring of the matrix and solution of the system of equations. Damage that is expected to be tolerated is local and represents a small change in the stiffness matrix compared to the baseline (undamaged) structure. The exact solution to a slightly modified set of equations can be obtained from the baseline solution economically without actually solving the modified system. Sherrnan-Morrison-Woodbury (SMW) formulas are matrix update formulas that allow this. SMW formulas were therefore used here to compute adjoint displacements for sensitivity computation and structural displacements in damaged configurations.
Investigation, Development, and Evaluation of Performance Proving for Fault-tolerant Computers

NASA Technical Reports Server (NTRS)

Levitt, K. N.; Schwartz, R.; Hare, D.; Moore, J. S.; Melliar-Smith, P. M.; Shostak, R. E.; Boyer, R. S.; Green, M. W.; Elliott, W. D.

1983-01-01

A number of methodologies for verifying systems and computer based tools that assist users in verifying their systems were developed. These tools were applied to verify in part the SIFT ultrareliable aircraft computer. Topics covered included: STP theorem prover; design verification of SIFT; high level language code verification; assembly language level verification; numerical algorithm verification; verification of flight control programs; and verification of hardware logic.

An assessment of the real-time application capabilities of the SIFT computer system

NASA Technical Reports Server (NTRS)

Butler, R. W.

1982-01-01

The real-time capabilities of the SIFT computer system, a highly reliable multicomputer architecture developed to support the flight controls of a relaxed static stability aircraft, are discussed. The SIFT computer system was designed to meet extremely high reliability requirements and to facilitate a formal proof of its correctness. Although SIFT represents a significant achievement in fault-tolerant system research it presents an unusual and restrictive interface to its users. The characteristics of the user interface and its impact on application system design are assessed.
Characterization of the faulted behavior of digital computers and fault tolerant systems

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Miner, Paul S.

1989-01-01

A development status evaluation is presented for efforts conducted at NASA-Langley since 1977, toward the characterization of the latent fault in digital fault-tolerant systems. Attention is given to the practical, high speed, generalized gate-level logic system simulator developed, as well as to the validation methodology used for the simulator, on the basis of faultable software and hardware simulations employing a prototype MIL-STD-1750A processor. After validation, latency tests will be performed.
Validation Methods Research for Fault-Tolerant Avionics and Control Systems Sub-Working Group Meeting. CARE 3 peer review

NASA Technical Reports Server (NTRS)

Trivedi, K. S. (Editor); Clary, J. B. (Editor)

1980-01-01

A computer aided reliability estimation procedure (CARE 3), developed to model the behavior of ultrareliable systems required by flight-critical avionics and control systems, is evaluated. The mathematical models, numerical method, and fault-tolerant architecture modeling requirements are examined, and the testing and characterization procedures are discussed. Recommendations aimed at enhancing CARE 3 are presented; in particular, the need for a better exposition of the method and the user interface is emphasized.
Room temperature high-fidelity holonomic single-qubit gate on a solid-state spin.

PubMed

Arroyo-Camejo, Silvia; Lazariev, Andrii; Hell, Stefan W; Balasubramanian, Gopalakrishnan

2014-09-12

At its most fundamental level, circuit-based quantum computation relies on the application of controlled phase shift operations on quantum registers. While these operations are generally compromised by noise and imperfections, quantum gates based on geometric phase shifts can provide intrinsically fault-tolerant quantum computing. Here we demonstrate the high-fidelity realization of a recently proposed fast (non-adiabatic) and universal (non-Abelian) holonomic single-qubit gate, using an individual solid-state spin qubit under ambient conditions. This fault-tolerant quantum gate provides an elegant means for achieving the fidelity threshold indispensable for implementing quantum error correction protocols. Since we employ a spin qubit associated with a nitrogen-vacancy colour centre in diamond, this system is based on integrable and scalable hardware exhibiting strong analogy to current silicon technology. This quantum gate realization is a promising step towards viable, fault-tolerant quantum computing under ambient conditions.
Formal design specification of a Processor Interface Unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1992-01-01

This report describes work to formally specify the requirements and design of a processor interface unit (PIU), a single-chip subsystem providing memory-interface bus-interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. The need for high-quality design assurance in such applications is an undisputed fact, given the disastrous consequences that even a single design flaw can produce. Thus, the further development and application of formal methods to fault-tolerant systems is of critical importance as these systems see increasing use in modern society.
Using Performance Tools to Support Experiments in HPC Resilience

DOE Office of Scientific and Technical Information (OSTI.GOV)

Naughton, III, Thomas J; Boehm, Swen; Engelmann, Christian

2014-01-01

The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environ- ments, specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has also recognized that tools for resilience experimentation are greatly lacking. However, we argue that there are several parallels between performance tools and resilience tools . As such, we believe the rich set of HPC performance-focused tools can be extended (repurposed) to benefit the resilience community. In this paper, we describe the initialmore » motivation to leverage standard HPC per- formance analysis techniques to aid in developing diagnostic tools to assist fault tolerance experiments for HPC applications. These diagnosis procedures help to provide context for the system when the errors (failures) occurred. We describe our initial work in leveraging an MPI performance trace tool to assist in provid- ing global context during fault injection experiments. Such tools will assist the HPC resilience community as they extend existing and new application codes to support fault tolerances.« less
Formal design and verification of a reliable computing platform for real-time control. Phase 2: Results

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Divito, Ben L.

1992-01-01

The design and formal verification of the Reliable Computing Platform (RCP), a fault tolerant computing system for digital flight control applications is presented. The RCP uses N-Multiply Redundant (NMR) style redundancy to mask faults and internal majority voting to flush the effects of transient faults. The system is formally specified and verified using the Ehdm verification system. A major goal of this work is to provide the system with significant capability to withstand the effects of High Intensity Radiated Fields (HIRF).
Evaluation of fault-tolerant parallel-processor architectures over long space missions

NASA Technical Reports Server (NTRS)

Johnson, Sally C.

1989-01-01

The impact of a five year space mission environment on fault-tolerant parallel processor architectures is examined. The target application is a Strategic Defense Initiative (SDI) satellite requiring 256 parallel processors to provide the computation throughput. The reliability requirements are that the system still be operational after five years with .99 probability and that the probability of system failure during one-half hour of full operation be less than 10(-7). The fault tolerance features an architecture must possess to meet these reliability requirements are presented, many potential architectures are briefly evaluated, and one candidate architecture, the Charles Stark Draper Laboratory's Fault-Tolerant Parallel Processor (FTPP) is evaluated in detail. A methodology for designing a preliminary system configuration to meet the reliability and performance requirements of the mission is then presented and demonstrated by designing an FTPP configuration.
Plan for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.

2008-01-01

This report presents the plan for the characterization of the effects of high intensity radiated fields on a prototype implementation of a fault-tolerant data communication system. Various configurations of the communication system will be tested. The prototype system is implemented using off-the-shelf devices. The system will be tested in a closed-loop configuration with extensive real-time monitoring. This test is intended to generate data suitable for the design of avionics health management systems, as well as redundancy management mechanisms and policies for robust distributed processing architectures.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration

NASA Technical Reports Server (NTRS)

Obando, Rodrigo A.; Stoughton, John W.

1995-01-01

The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Toward a Fault Tolerant Architecture for Vital Medical-Based Wearable Computing.

PubMed

Abdali-Mohammadi, Fardin; Bajalan, Vahid; Fathi, Abdolhossein

2015-12-01

Advancements in computers and electronic technologies have led to the emergence of a new generation of efficient small intelligent systems. The products of such technologies might include Smartphones and wearable devices, which have attracted the attention of medical applications. These products are used less in critical medical applications because of their resource constraint and failure sensitivity. This is due to the fact that without safety considerations, small-integrated hardware will endanger patients' lives. Therefore, proposing some principals is required to construct wearable systems in healthcare so that the existing concerns are dealt with. Accordingly, this paper proposes an architecture for constructing wearable systems in critical medical applications. The proposed architecture is a three-tier one, supporting data flow from body sensors to cloud. The tiers of this architecture include wearable computers, mobile computing, and mobile cloud computing. One of the features of this architecture is its high possible fault tolerance due to the nature of its components. Moreover, the required protocols are presented to coordinate the components of this architecture. Finally, the reliability of this architecture is assessed by simulating the architecture and its components, and other aspects of the proposed architecture are discussed.
Investigation of air transportation technology at Princeton University, 1990-1991

NASA Technical Reports Server (NTRS)

Stengel, Robert F.

1991-01-01

The Air Transportation Technology Program at Princeton University is a program that emphasizes graduate and undergraduate student research. The program proceeded along six avenues during the past year: microburst hazards to aircraft, intelligent failure tolerant control, computer-aided heuristics for piloted flight, stochastic robustness of flight control systems, neural networks for flight control, and computer-aided control system design.
2009 fault tolerance for extreme-scale computing workshop, Albuquerque, NM - March 19-20, 2009.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Katz, D. S.; Daly, J.; DeBardeleben, N.

2009-02-01

This is a report on the third in a series of petascale workshops co-sponsored by Blue Waters and TeraGrid to address challenges and opportunities for making effective use of emerging extreme-scale computing. This workshop was held to discuss fault tolerance on large systems for running large, possibly long-running applications. The main point of the workshop was to have systems people, middleware people (including fault-tolerance experts), and applications people talk about the issues and figure out what needs to be done, mostly at the middleware and application levels, to run such applications on the emerging petascale systems, without having faults causemore » large numbers of application failures. The workshop found that there is considerable interest in fault tolerance, resilience, and reliability of high-performance computing (HPC) systems in general, at all levels of HPC. The only way to recover from faults is through the use of some redundancy, either in space or in time. Redundancy in time, in the form of writing checkpoints to disk and restarting at the most recent checkpoint after a fault that cause an application to crash/halt, is the most common tool used in applications today, but there are questions about how long this can continue to be a good solution as systems and memories grow faster than I/O bandwidth to disk. There is interest in both modifications to this, such as checkpoints to memory, partial checkpoints, and message logging, and alternative ideas, such as in-memory recovery using residues. We believe that systematic exploration of these ideas holds the most promise for the scientific applications community. Fault tolerance has been an issue of discussion in the HPC community for at least the past 10 years; but much like other issues, the community has managed to put off addressing it during this period. There is a growing recognition that as systems continue to grow to petascale and beyond, the field is approaching the point where we don't have any choice but to address this through R&D efforts.« less
Tools and Techniques for Adding Fault Tolerance to Distributed and Parallel Programs

DTIC Science & Technology

1991-12-07

is rapidly approaching dimensions where fault tolerance can no longer be ignored. No matter how reliable the i .nd~ividual components May be, the...The scale of parallel computing systems is rapidly approaching dimensions where 41to’- erance can no longer be ignored. No matter how relitble the...those employed in the Tandem [71 and Stratus [35] systems, is clearly impractical. * No matter how reliable the individual components are, the sheer
``DMS-R, the Brain of the ISS'', 10 Years of Continuous Successful Operation in Space

NASA Astrophysics Data System (ADS)

Wolff, Bernd; Scheffers, Peter

2012-08-01

Space industries on both sides of the Atlantic were faced with a new situation of collaboration in the beginning of the 1990s.In 1995, industrial cooperation between ASTRIUM ST, Bremen and RSC-E, Moscow started aiming the outfitting of the Russian Service Module ZVEZDA for the ISS with computers. The requested equipments had to provide not only redundancy but fault tolerance and high availability. The design and development of two fault tolerant computers, (FTCs) responsible for the telemetry (Telemetry Computer: TC) and the central control (CC), as well as the man machine interface CPC were contracted to ASTRIUM ST, Bremen. The computer system is responsible e.g. for the life support system and the ISS re-boost control.In July 2000, the integration of the Russian Service Module ZVEZDA with Russian ZARYA FGB and American Node 1 bears witness for transatlantic and European cooperation.The Russian Service module ZVEZDA provides several basic functions as Avionics Control, the Environmental Control and Life Support (ECLS) in the ISS and control of the docked Automatic Transfer Vehicle (ATV) which includes re-boost of ISS. If these elementary functions fail or do not work reliable the effects for the ISS will be catastrophic with respect to Safety (manned space) and ISS mission.For that reason the responsible computer system Data Management System - Russia (DMS-R) is also called "The brain of the ISS".The Russian Service module ZVEZDA, including DMS-R, was launched on 12th of July, 2000. DMS-R was operational also during launch and docking.The talk provide information about the definition, design and development of DMS-R, the integration of DMS-R in the Russian Service module and the maintenance of the system in space. Besides the technical aspects are also the German - Russian cooperation an important subject of this speech. An outlook finalises the talk providing further development activities and application of fault tolerant systems.The importance of the DMS-R equipment for the ISS related to availability and reliability is reported in paragraph 1.2, describing a serious incident.The DMS-R architecture, consisting of two fault tolerant computers, their interconnection via MIL 1553 STD Bus and the Control Post Computer (CPC) as man- machine interface is given in figure 1. The main data transfer within the ISS and therefore also the Russian segment is managed by the MIL1553 STD bus. The focus of this script is neither the operational concept nor the fault tolerant design according the Byzantine Theorem, but the architectural embedment. One fault tolerant computer consists out of up to four fault containment regions (FCR), comparing in- and output data and deciding by majority voting whether a faulty FCR has to be isolated. For this purpose all data have to pass the so-called fault management element and are distributed to the other participants in the computer pool (FTC). Each fault containment region is connected to the avionic busses of the vehicle avionics system. In case of a faulty FCR (wrong calculation result was detected by the other FCRs or by build-in self-detection) the dedicated FCR will reset itself or will be reset by the others. The bus controller functions of the isolated FCR will be taken over according to a specific deterministic scheme from another FCR. The FTC data throughput will be maintained, the FTC operation will continue without interruption. Each FCR consists of an application CPU board (ALB), the fault management layer (FML), the avionics bus interface board (AVI) and a power supply (PSU), sharing a VME data bus.The FML is fully transparent, in terms of I/O accessibility, to the application S/W and votes the data autonomously received from the avionics busses and transmitted from the application.
Fault tolerant programmable digital attitude control electronics study

NASA Technical Reports Server (NTRS)

Sorensen, A. A.

1974-01-01

The attitude control electronics mechanization study to develop a fault tolerant autonomous concept for a three axis system is reported. Programmable digital electronics are compared to general purpose digital computers. The requirements, constraints, and tradeoffs are discussed. It is concluded that: (1) general fault tolerance can be achieved relatively economically, (2) recovery times of less than one second can be obtained, (3) the number of faulty behavior patterns must be limited, and (4) adjoined processes are the best indicators of faulty operation.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

NASA Technical Reports Server (NTRS)

Smith, T. B., Jr.; Lala, J. H.

1983-01-01

The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Evaluation Applied to Reliability Analysis of Reconfigurable, Highly Reliable, Fault-Tolerant, Computing Systems for Avionics

NASA Technical Reports Server (NTRS)

Migneault, G. E.

1979-01-01

Emulation techniques are proposed as a solution to a difficulty arising in the analysis of the reliability of highly reliable computer systems for future commercial aircraft. The difficulty, viz., the lack of credible precision in reliability estimates obtained by analytical modeling techniques are established. The difficulty is shown to be an unavoidable consequence of: (1) a high reliability requirement so demanding as to make system evaluation by use testing infeasible, (2) a complex system design technique, fault tolerance, (3) system reliability dominated by errors due to flaws in the system definition, and (4) elaborate analytical modeling techniques whose precision outputs are quite sensitive to errors of approximation in their input data. The technique of emulation is described, indicating how its input is a simple description of the logical structure of a system and its output is the consequent behavior. The use of emulation techniques is discussed for pseudo-testing systems to evaluate bounds on the parameter values needed for the analytical techniques.
Secure key storage and distribution

DOEpatents

Agrawal, Punit

2015-06-02

This disclosure describes a distributed, fault-tolerant security system that enables the secure storage and distribution of private keys. In one implementation, the security system includes a plurality of computing resources that independently store private keys provided by publishers and encrypted using a single security system public key. To protect against malicious activity, the security system private key necessary to decrypt the publication private keys is not stored at any of the computing resources. Rather portions, or shares of the security system private key are stored at each of the computing resources within the security system and multiple security systems must communicate and share partial decryptions in order to decrypt the stored private key.

Study of fault-tolerant software technology

NASA Technical Reports Server (NTRS)

Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.

1984-01-01

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.
A design fix to supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks

NASA Astrophysics Data System (ADS)

Devaraj, Rajesh; Sarkar, Arnab; Biswas, Santosh

2015-11-01

In the article 'Supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks', Park and Cho presented a systematic way of computing a largest fault-tolerant and schedulable language that provides information on whether the scheduler (i.e., supervisor) should accept or reject a newly arrived aperiodic task. The computation of such a language is mainly dependent on the task execution model presented in their paper. However, the task execution model is unable to capture the situation when the fault of a processor occurs even before the task has arrived. Consequently, a task execution model that does not capture this fact may possibly be assigned for execution on a faulty processor. This problem has been illustrated with an appropriate example. Then, the task execution model of Park and Cho has been modified to strengthen the requirement that none of the tasks are assigned for execution on a faulty processor.
Advanced Information Processing System (AIPS)

NASA Technical Reports Server (NTRS)

Pitts, Felix L.

1993-01-01

Advanced Information Processing System (AIPS) is a computer systems philosophy, a set of validated hardware building blocks, and a set of validated services as embodied in system software. The goal of AIPS is to provide the knowledgebase which will allow achievement of validated fault-tolerant distributed computer system architectures, suitable for a broad range of applications, having failure probability requirements of 10E-9 at 10 hours. A background and description is given followed by program accomplishments, the current focus, applications, technology transfer, FY92 accomplishments, and funding.
Markov chain algorithms: a template for building future robust low-power systems

PubMed Central

Deka, Biplab; Birklykke, Alex A.; Duwe, Henry; Mansinghka, Vikash K.; Kumar, Rakesh

2014-01-01

Although computational systems are looking towards post CMOS devices in the pursuit of lower power, the expected inherent unreliability of such devices makes it difficult to design robust systems without additional power overheads for guaranteeing robustness. As such, algorithmic structures with inherent ability to tolerate computational errors are of significant interest. We propose to cast applications as stochastic algorithms based on Markov chains (MCs) as such algorithms are both sufficiently general and tolerant to transition errors. We show with four example applications—Boolean satisfiability, sorting, low-density parity-check decoding and clustering—how applications can be cast as MC algorithms. Using algorithmic fault injection techniques, we demonstrate the robustness of these implementations to transition errors with high error rates. Based on these results, we make a case for using MCs as an algorithmic template for future robust low-power systems. PMID:24842030
Redundant Asynchronous Microprocessor System

NASA Technical Reports Server (NTRS)

Meyer, G.; Johnston, J. O.; Dunn, W. R.

1985-01-01

Fault-tolerant computer structure called RAMPS (for redundant asynchronous microprocessor system) has simplicity of static redundancy but offers intermittent-fault handling ability of complex, dynamically redundant systems. New structure useful wherever several microprocessors are employed for control - in aircraft, industrial processes, robotics, and automatic machining, for example.
High Speed, High Temperature, Fault Tolerant Operation of a Combination Magnetic-Hydrostatic Bearing Rotor Support System for Turbomachinery

NASA Technical Reports Server (NTRS)

Jansen, Mark; Montague, Gerald; Provenza, Andrew; Palazzolo, Alan

2004-01-01

Closed loop operation of a single, high temperature magnetic radial bearing to 30,000 RPM (2.25 million DN) and 540 C (1000 F) is discussed. Also, high temperature, fault tolerant operation for the three axis system is examined. A novel, hydrostatic backup bearing system was employed to attain high speed, high temperature, lubrication free support of the entire rotor system. The hydrostatic bearings were made of a high lubricity material and acted as journal-type backup bearings. New, high temperature displacement sensors were successfully employed to monitor shaft position throughout the entire temperature range and are described in this paper. Control of the system was accomplished through a stand alone, high speed computer controller and it was used to run both the fault-tolerant PID and active vibration control algorithms.
Phased models for evaluating the performability of computing systems

NASA Technical Reports Server (NTRS)

Wu, L. T.; Meyer, J. F.

1979-01-01

A phase-by-phase modelling technique is introduced to evaluate a fault tolerant system's ability to execute different sets of computational tasks during different phases of the control process. Intraphase processes are allowed to differ from phase to phase. The probabilities of interphase state transitions are specified by interphase transition matrices. Based on constraints imposed on the intraphase and interphase transition probabilities, various iterative solution methods are developed for calculating system performability.
The PAWS and STEM reliability analysis programs

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Stevenson, Philip H.

1988-01-01

The PAWS and STEM programs are new design/validation tools. These programs provide a flexible, user-friendly, language-based interface for the input of Markov models describing the behavior of fault-tolerant computer systems. These programs produce exact solutions of the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. PAWS uses a Pade approximation as a solution technique; STEM uses a Taylor series as a solution technique. Both programs have the capability to solve numerically stiff models. PAWS and STEM possess complementary properties with regard to their input space; and, an additional strength of these programs is that they accept input compatible with the SURE program. If used in conjunction with SURE, PAWS and STEM provide a powerful suite of programs to analyze the reliability of fault-tolerant computer systems.
SUMC fault tolerant computer system

NASA Technical Reports Server (NTRS)

1980-01-01

The results of the trade studies are presented. These trades cover: establishing the basic configuration, establishing the CPU/memory configuration, establishing an approach to crosstrapping interfaces, defining the requirements of the redundancy management unit (RMU), establishing a spare plane switching strategy for the fault-tolerant memory (FTM), and identifying the most cost effective way of extending the memory addressing capability beyond the 64 K-bytes (K=1024) of SUMC-II B. The results of the design are compiled in Contract End Item (CEI) Specification for the NASA Standard Spacecraft Computer II (NSSC-II), IBM 7934507. The implementation of the FTM and memory address expansion.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Sadayappan, Ponnuswamy

Exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today's machines. Systems software for exascale machines must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massive data analysis in a highly unreliable hardware environment with billions of threads of execution. We propose a new approach to the data and work distribution model provided by system software based on the unifying formalism of an abstract file system. The proposed hierarchical data model providesmore » simple, familiar visibility and access to data structures through the file system hierarchy, while providing fault tolerance through selective redundancy. The hierarchical task model features work queues whose form and organization are represented as file system objects. Data and work are both first class entities. By exposing the relationships between data and work to the runtime system, information is available to optimize execution time and provide fault tolerance. The data distribution scheme provides replication (where desirable and possible) for fault tolerance and efficiency, and it is hierarchical to make it possible to take advantage of locality. The user, tools, and applications, including legacy applications, can interface with the data, work queues, and one another through the abstract file model. This runtime environment will provide multiple interfaces to support traditional Message Passing Interface applications, languages developed under DARPA's High Productivity Computing Systems program, as well as other, experimental programming models. We will validate our runtime system with pilot codes on existing platforms and will use simulation to validate for exascale-class platforms. In this final report, we summarize research results from the work done at the Ohio State University towards the larger goals of the project listed above.« less
QCCM Center for Quantum Algorithms

DTIC Science & Technology

2008-10-17

algorithms (e.g., quantum walks and adiabatic computing ), as well as theoretical advances relating algorithms to physical implementations (e.g...Park, NC 27709-2211 15. SUBJECT TERMS Quantum algorithms, quantum computing , fault-tolerant error correction Richard Cleve MITACS East Academic...0511200 Algebraic results on quantum automata A. Ambainis, M. Beaudry, M. Golovkins, A. Kikusts, M. Mercer, D. Thrien Theory of Computing Systems 39(2006
Enhancing Security by System-Level Virtualization in Cloud Computing Environments

NASA Astrophysics Data System (ADS)

Sun, Dawei; Chang, Guiran; Tan, Chunguang; Wang, Xingwei

Many trends are opening up the era of cloud computing, which will reshape the IT industry. Virtualization techniques have become an indispensable ingredient for almost all cloud computing system. By the virtual environments, cloud provider is able to run varieties of operating systems as needed by each cloud user. Virtualization can improve reliability, security, and availability of applications by using consolidation, isolation, and fault tolerance. In addition, it is possible to balance the workloads by using live migration techniques. In this paper, the definition of cloud computing is given; and then the service and deployment models are introduced. An analysis of security issues and challenges in implementation of cloud computing is identified. Moreover, a system-level virtualization case is established to enhance the security of cloud computing environments.
A three-layer model of natural image statistics.

PubMed

Gutmann, Michael U; Hyvärinen, Aapo

2013-11-01

An important property of visual systems is to be simultaneously both selective to specific patterns found in the sensory input and invariant to possible variations. Selectivity and invariance (tolerance) are opposing requirements. It has been suggested that they could be joined by iterating a sequence of elementary selectivity and tolerance computations. It is, however, unknown what should be selected or tolerated at each level of the hierarchy. We approach this issue by learning the computations from natural images. We propose and estimate a probabilistic model of natural images that consists of three processing layers. Two natural image data sets are considered: image patches, and complete visual scenes downsampled to the size of small patches. For both data sets, we find that in the first two layers, simple and complex cell-like computations are performed. In the third layer, we mainly find selectivity to longer contours; for patch data, we further find some selectivity to texture, while for the downsampled complete scenes, some selectivity to curvature is observed. Copyright © 2013 Elsevier Ltd. All rights reserved.
Design of Test Articles and Monitoring System for the Characterization of HIRF Effects on a Fault-Tolerant Computer Communication System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Miner, Paul S.; Koppen, Sandra V.

2008-01-01

This report describes the design of the test articles and monitoring systems developed to characterize the response of a fault-tolerant computer communication system when stressed beyond the theoretical limits for guaranteed correct performance. A high-intensity radiated electromagnetic field (HIRF) environment was selected as the means of injecting faults, as such environments are known to have the potential to cause arbitrary and coincident common-mode fault manifestations that can overwhelm redundancy management mechanisms. The monitors generate stimuli for the systems-under-test (SUTs) and collect data in real-time on the internal state and the response at the external interfaces. A real-time health assessment capability was developed to support the automation of the test. A detailed description of the nature and structure of the collected data is included. The goal of the report is to provide insight into the design and operation of these systems, and to serve as a reference document for use in post-test analyses.
Hierarchical specification of the SIFT fault tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1981-01-01

The specification and mechanical verification of the Software Implemented Fault Tolerance (SIFT) flight control system is described. The methodology employed in the verification effort is discussed, and a description of the hierarchical models of the SIFT system is given. To meet the objective of NASA for the reliability of safety critical flight control systems, the SIFT computer must achieve a reliability well beyond the levels at which reliability can be actually measured. The methodology employed to demonstrate rigorously that the SIFT computer meets as reliability requirements is described. The hierarchy of design specifications from very abstract descriptions of system function down to the actual implementation is explained. The most abstract design specifications can be used to verify that the system functions correctly and with the desired reliability since almost all details of the realization were abstracted out. A succession of lower level models refine these specifications to the level of the actual implementation, and can be used to demonstrate that the implementation has the properties claimed of the abstract design specifications.
FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft

NASA Technical Reports Server (NTRS)

Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.

1978-01-01

The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.
Evaluation of Advanced Computing Techniques and Technologies: Reconfigurable Computing

NASA Technical Reports Server (NTRS)

Wells, B. Earl

2003-01-01

The focus of this project was to survey the technology of reconfigurable computing determine its level of maturity and suitability for NASA applications. To better understand and assess the effectiveness of the reconfigurable design paradigm that is utilized within the HAL-15 reconfigurable computer system. This system was made available to NASA MSFC for this purpose, from Star Bridge Systems, Inc. To implement on at least one application that would benefit from the performance levels that are possible with reconfigurable hardware. It was originally proposed that experiments in fault tolerance and dynamically reconfigurability would be perform but time constraints mandated that these be pursued as future research.
The embedded operating system project

NASA Technical Reports Server (NTRS)

Campbell, R. H.

1985-01-01

The design and construction of embedded operating systems for real-time advanced aerospace applications was investigated. The applications require reliable operating system support that must accommodate computer networks. Problems that arise in the construction of such operating systems, reconfiguration, consistency and recovery in a distributed system, and the issues of real-time processing are reported. A thesis that provides theoretical foundations for the use of atomic actions to support fault tolerance and data consistency in real-time object-based system is included. The following items are addressed: (1) atomic actions and fault-tolerance issues; (2) operating system structure; (3) program development; (4) a reliable compiler for path Pascal; and (5) mediators, a mechanism for scheduling distributed system processes.
Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

NASA Technical Reports Server (NTRS)

Harper, Richard

1989-01-01

In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.
Verification of the FtCayuga fault-tolerant microprocessor system. Volume 2: Formal specification and correctness theorems

NASA Technical Reports Server (NTRS)

Bickford, Mark; Srivas, Mandayam

1991-01-01

Presented here is a formal specification and verification of a property of a quadruplicately redundant fault tolerant microprocessor system design. A complete listing of the formal specification of the system and the correctness theorems that are proved are given. The system performs the task of obtaining interactive consistency among the processors using a special instruction on the processors. The design is based on an algorithm proposed by Pease, Shostak, and Lamport. The property verified insures that an execution of the special instruction by the processors correctly accomplishes interactive consistency, providing certain preconditions hold, using a computer aided design verification tool, Spectool, and the theorem prover, Clio. A major contribution of the work is the demonstration of a significant fault tolerant hardware design that is mechanically verified by a theorem prover.

Architecture for an artificial immune system.

PubMed

Hofmeyr, S A; Forrest, S

2000-01-01

An artificial immune system (ARTIS) is described which incorporates many properties of natural immune systems, including diversity, distributed computation, error tolerance, dynamic learning and adaptation, and self-monitoring. ARTIS is a general framework for a distributed adaptive system and could, in principle, be applied to many domains. In this paper, ARTIS is applied to computer security in the form of a network intrusion detection system called LISYS. LISYS is described and shown to be effective at detecting intrusions, while maintaining low false positive rates. Finally, similarities and differences between ARTIS and Holland's classifier systems are discussed.
Metacomputing on Commodity Computers

DTIC Science & Technology

1999-05-01

on NOWs, and this has contributed to the popularity of systems such as PVM [59], MPI [67], Linda [33], and TreadMarks [2]. 26 Challenges Given that...presents the performance of Calypso and Persistent Linda (PLinda) [77] programs and compares how they can tolerate failures. A biological pattern...adds fault tolerance to Linda programs by using light-weight transac- tions, whereas Calypso uses the combination of eager scheduling and two-phase
Computer aided reliability, availability, and safety modeling for fault-tolerant computer systems with commentary on the HARP program

NASA Technical Reports Server (NTRS)

Shooman, Martin L.

1991-01-01

Many of the most challenging reliability problems of our present decade involve complex distributed systems such as interconnected telephone switching computers, air traffic control centers, aircraft and space vehicles, and local area and wide area computer networks. In addition to the challenge of complexity, modern fault-tolerant computer systems require very high levels of reliability, e.g., avionic computers with MTTF goals of one billion hours. Most analysts find that it is too difficult to model such complex systems without computer aided design programs. In response to this need, NASA has developed a suite of computer aided reliability modeling programs beginning with CARE 3 and including a group of new programs such as: HARP, HARP-PC, Reliability Analysts Workbench (Combination of model solvers SURE, STEM, PAWS, and common front-end model ASSIST), and the Fault Tree Compiler. The HARP program is studied and how well the user can model systems using this program is investigated. One of the important objectives will be to study how user friendly this program is, e.g., how easy it is to model the system, provide the input information, and interpret the results. The experiences of the author and his graduate students who used HARP in two graduate courses are described. Some brief comparisons were made with the ARIES program which the students also used. Theoretical studies of the modeling techniques used in HARP are also included. Of course no answer can be any more accurate than the fidelity of the model, thus an Appendix is included which discusses modeling accuracy. A broad viewpoint is taken and all problems which occurred in the use of HARP are discussed. Such problems include: computer system problems, installation manual problems, user manual problems, program inconsistencies, program limitations, confusing notation, long run times, accuracy problems, etc.
A novel N-input voting algorithm for X-by-wire fault-tolerant systems.

PubMed

Karimi, Abbas; Zarafshan, Faraneh; Al-Haddad, S A R; Ramli, Abdul Rahman

2014-01-01

Voting is an important operation in multichannel computation paradigm and realization of ultrareliable and real-time control systems that arbitrates among the results of N redundant variants. These systems include N-modular redundant (NMR) hardware systems and diversely designed software systems based on N-version programming (NVP). Depending on the characteristics of the application and the type of selected voter, the voting algorithms can be implemented for either hardware or software systems. In this paper, a novel voting algorithm is introduced for real-time fault-tolerant control systems, appropriate for applications in which N is large. Then, its behavior has been software implemented in different scenarios of error-injection on the system inputs. The results of analyzed evaluations through plots and statistical computations have demonstrated that this novel algorithm does not have the limitations of some popular voting algorithms such as median and weighted; moreover, it is able to significantly increase the reliability and availability of the system in the best case to 2489.7% and 626.74%, respectively, and in the worst case to 3.84% and 1.55%, respectively.
Distributed Fault-Tolerant Control of Networked Uncertain Euler-Lagrange Systems Under Actuator Faults.

PubMed

Chen, Gang; Song, Yongduan; Lewis, Frank L

2016-05-03

This paper investigates the distributed fault-tolerant control problem of networked Euler-Lagrange systems with actuator and communication link faults. An adaptive fault-tolerant cooperative control scheme is proposed to achieve the coordinated tracking control of networked uncertain Lagrange systems on a general directed communication topology, which contains a spanning tree with the root node being the active target system. The proposed algorithm is capable of compensating for the actuator bias fault, the partial loss of effectiveness actuation fault, the communication link fault, the model uncertainty, and the external disturbance simultaneously. The control scheme does not use any fault detection and isolation mechanism to detect, separate, and identify the actuator faults online, which largely reduces the online computation and expedites the responsiveness of the controller. To validate the effectiveness of the proposed method, a test-bed of multiple robot-arm cooperative control system is developed for real-time verification. Experiments on the networked robot-arms are conduced and the results confirm the benefits and the effectiveness of the proposed distributed fault-tolerant control algorithms.
MO-F-CAMPUS-T-03: Data Driven Approaches for Determination of Treatment Table Tolerance Values for Record and Verification Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gupta, N; DiCostanzo, D; Fullenkamp, M

2015-06-15

Purpose: To determine appropriate couch tolerance values for modern radiotherapy linac R&V systems with indexed patient setup. Methods: Treatment table tolerance values have been the most difficult to lower, due to many factors including variations in patient positioning and differences in table tops between machines. We recently installed nine linacs with similar tables and started indexing every patient in our clinic. In this study we queried our R&V database and analyzed the deviation of couch position values from the acquired values at verification simulation for all patients treated with indexed positioning. Mean and standard deviations of daily setup deviations weremore » computed in the longitudinal, lateral and vertical direction for 343 patient plans. The mean, median and standard error of the standard deviations across the whole patient population and for some disease sites were computed to determine tolerance values. Results: The plot of our couch deviation values showed a gaussian distribution, with some small deviations, corresponding to setup uncertainties on non-imaging days, and SRS/SRT/SBRT patients, as well as some large deviations which were spot checked and found to be corresponding to indexing errors that were overriden. Setting our tolerance values based on the median + 1 standard error resulted in tolerance values of 1cm lateral and longitudinal, and 0.5 cm vertical for all non- SRS/SRT/SBRT cases. Re-analizing the data, we found that about 92% of the treated fractions would be within these tolerance values (ignoring the mis-indexed patients). We also analyzed data for disease site based subpopulations and found no difference in the tolerance values that needed to be used. Conclusion: With the use of automation, auto-setup and other workflow efficiency tools being introduced into radiotherapy workflow, it is very essential to set table tolerances that allow safe treatments, but flag setup errors that need to be reassessed before treatments.« less
The scientific data acquisition system of the GAMMA-400 space project

NASA Astrophysics Data System (ADS)

Bobkov, S. G.; Serdin, O. V.; Gorbunov, M. S.; Arkhangelskiy, A. I.; Topchiev, N. P.

2016-02-01

The description of scientific data acquisition system (SDAS) designed by SRISA for the GAMMA-400 space project is presented. We consider the problem of different level electronics unification: the set of reliable fault-tolerant integrated circuits fabricated on Silicon-on-Insulator 0.25 mkm CMOS technology and the high-speed interfaces and reliable modules used in the space instruments. The characteristics of reliable fault-tolerant very large scale integration (VLSI) technology designed by SRISA for the developing of computation systems for space applications are considered. The scalable net structure of SDAS based on Serial RapidIO interface including real-time operating system BAGET is described too.
Control of large flexible space structures

NASA Technical Reports Server (NTRS)

Vandervelde, W. E.

1986-01-01

Progress in robust design of generalized parity relations, design of failure sensitive observers using the geometric system theory of Wonham, computational techniques for evaluation of the performance of control systems with fault tolerance and redundancy management features, and the design and evaluation od control systems for structures having nonlinear joints are described.
On the design of fault-tolerant robotic manipulator systems

NASA Technical Reports Server (NTRS)

Tesar, Delbert

1993-01-01

Robotic systems are finding increasing use in space applications. Many of these devices are going to be operational on board the Space Station Freedom. Fault tolerance has been deemed necessary because of the criticality of the tasks and the inaccessibility of the systems to maintenance and repair. Design for fault tolerance in manipulator systems is an area within robotics that is without precedence in the literature. In this paper, we will attempt to lay down the foundations for such a technology. Design for fault tolerance demands new and special approaches to design, often at considerable variance from established design practices. These design aspects, together with reliability evaluation and modeling tools, are presented. Mechanical architectures that employ protective redundancies at many levels and have a modular architecture are then studied in detail. Once a mechanical architecture for fault tolerance has been derived, the chronological stages of operational fault tolerance are investigated. Failure detection, isolation, and estimation methods are surveyed, and such methods for robot sensors and actuators are derived. Failure recovery methods are also presented for each of the protective layers of redundancy. Failure recovery tactics often span all of the layers of a control hierarchy. Thus, a unified framework for decision-making and control, which orchestrates both the nominal redundancy management tasks and the failure management tasks, has been derived. The well-developed field of fault-tolerant computers is studied next, and some design principles relevant to the design of fault-tolerant robot controllers are abstracted. Conclusions are drawn, and a road map for the design of fault-tolerant manipulator systems is laid out with recommendations for a 10 DOF arm with dual actuators at each joint.
AADL and Model-based Engineering

DTIC Science & Technology

2014-10-20

and MBE Feiler, Oct 20, 2014 © 2014 Carnegie Mellon University We Rely on Software for Safe Aircraft Operation Embedded software systems ...D eveloper Compute Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency...confusion Hardware Engineer Why do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software
Emulation applied to reliability analysis of reconfigurable, highly reliable, fault-tolerant computing systems

NASA Technical Reports Server (NTRS)

Migneault, G. E.

1979-01-01

Emulation techniques applied to the analysis of the reliability of highly reliable computer systems for future commercial aircraft are described. The lack of credible precision in reliability estimates obtained by analytical modeling techniques is first established. The difficulty is shown to be an unavoidable consequence of: (1) a high reliability requirement so demanding as to make system evaluation by use testing infeasible; (2) a complex system design technique, fault tolerance; (3) system reliability dominated by errors due to flaws in the system definition; and (4) elaborate analytical modeling techniques whose precision outputs are quite sensitive to errors of approximation in their input data. Next, the technique of emulation is described, indicating how its input is a simple description of the logical structure of a system and its output is the consequent behavior. Use of emulation techniques is discussed for pseudo-testing systems to evaluate bounds on the parameter values needed for the analytical techniques. Finally an illustrative example is presented to demonstrate from actual use the promise of the proposed application of emulation.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, P.A.

1988-02-01

The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Development and Evaluation of Fault-Tolerant Flight Control Systems

NASA Technical Reports Server (NTRS)

Song, Yong D.; Gupta, Kajal (Technical Monitor)

2004-01-01

The research is concerned with developing a new approach to enhancing fault tolerance of flight control systems. The original motivation for fault-tolerant control comes from the need for safe operation of control elements (e.g. actuators) in the event of hardware failures in high reliability systems. One such example is modem space vehicle subjected to actuator/sensor impairments. A major task in flight control is to revise the control policy to balance impairment detectability and to achieve sufficient robustness. This involves careful selection of types and parameters of the controllers and the impairment detecting filters used. It also involves a decision, upon the identification of some failures, on whether and how a control reconfiguration should take place in order to maintain a certain system performance level. In this project new flight dynamic model under uncertain flight conditions is considered, in which the effects of both ramp and jump faults are reflected. Stabilization algorithms based on neural network and adaptive method are derived. The control algorithms are shown to be effective in dealing with uncertain dynamics due to external disturbances and unpredictable faults. The overall strategy is easy to set up and the computation involved is much less as compared with other strategies. Computer simulation software is developed. A serious of simulation studies have been conducted with varying flight conditions.
Software Implemented Fault-Tolerant (SIFT) user's guide

NASA Technical Reports Server (NTRS)

Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.

1984-01-01

Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.
Damage tolerance of candidate thermoset composites for use on single stage to orbit vehicles

NASA Technical Reports Server (NTRS)

Nettles, A. T.; Lance, D.; Hodge, A.

1994-01-01

Four fiber/resin systems were compared for resistance to damage and damage tolerance. One toughened epoxy and three toughened bismaleimide (BMI) resins were used, all with IM7 carbon fiber reinforcement. A statistical design of experiments technique was used to evaluate the effects of impact energy, specimen thickness, and impactor diameter on the damage area, as computed by C-scans, and residual compression-after-impact (CAI) strength. Results showed that two of the BMI systems sustained relatively large damage zones yet had an excellent retention of CAI strength.
A General theory of Signal Integration for Fault-Tolerant Dynamic Distributed Sensor Networks

DTIC Science & Technology

1993-10-01

related to a) the architecture and fault- tolerance of the distributed sensor network, b) the proper synchronisation of sensor signals, c) the...Computational complexities of the problem of distributed detection. 5) Issues related to recording of events and synchronization in distributed sensor...Intervals for Synchronization in Real Time Distributed Systems", Submitted to Electronic Encyclopedia. 3. V. G. Hegde and S. S. Iyengar "Efficient
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration. Ph.D. Thesis Report, 1 Jan. - 31 Dec. 1992

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Obando, Rodrigo A.

1993-01-01

The modeling and design of a fault-tolerant multiprocessor system is addressed. In particular, the behavior of the system during recovery and restoration after a fault has occurred is investigated. Given that a multicomputer system is designed using the Algorithm to Architecture to Mapping Model (ATAMM), and that a fault (death of a computing resource) occurs during its normal steady-state operation, a model is presented as a viable research tool for predicting the performance bounds of the system during its recovery and restoration phases. Furthermore, the bounds of the performance behavior of the system during this transient mode can be assessed. These bounds include: time to recover from the fault (t(sub rec)), time to restore the system (t(sub rec)) and whether there is a permanent delay in the system's Time Between Input and Output (TBIO) after the system has reached a steady state. An implementation of an ATAMM based computer was developed with the Generic VHSIC Spaceborne Computer (GVSC) as the target system. A simulation of the GVSC was also written based on the code used in ATAMM Multicomputer Operating System (AMOS). The simulation is in turn used to validate the new model in the usefulness and accuracy in tracking the propagation of the delay through the system and predicting the behavior in the transient state of recovery and restoration. The model is validated as an accurate method to predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
The embedded operating system project

NASA Technical Reports Server (NTRS)

Campbell, R. H.

1984-01-01

This progress report describes research towards the design and construction of embedded operating systems for real-time advanced aerospace applications. The applications concerned require reliable operating system support that must accommodate networks of computers. The report addresses the problems of constructing such operating systems, the communications media, reconfiguration, consistency and recovery in a distributed system, and the issues of realtime processing. A discussion is included on suitable theoretical foundations for the use of atomic actions to support fault tolerance and data consistency in real-time object-based systems. In particular, this report addresses: atomic actions, fault tolerance, operating system structure, program development, reliability and availability, and networking issues. This document reports the status of various experiments designed and conducted to investigate embedded operating system design issues.
Fault-tolerance in Two-dimensional Topological Systems

NASA Astrophysics Data System (ADS)

Anderson, Jonas T.

This thesis is a collection of ideas with the general goal of building, at least in the abstract, a local fault-tolerant quantum computer. The connection between quantum information and topology has proven to be an active area of research in several fields. The introduction of the toric code by Alexei Kitaev demonstrated the usefulness of topology for quantum memory and quantum computation. Many quantum codes used for quantum memory are modeled by spin systems on a lattice, with operators that extract syndrome information placed on vertices or faces of the lattice. It is natural to wonder whether the useful codes in such systems can be classified. This thesis presents work that leverages ideas from topology and graph theory to explore the space of such codes. Homological stabilizer codes are introduced and it is shown that, under a set of reasonable assumptions, any qubit homological stabilizer code is equivalent to either a toric code or a color code. Additionally, the toric code and the color code correspond to distinct classes of graphs. Many systems have been proposed as candidate quantum computers. It is very desirable to design quantum computing architectures with two-dimensional layouts and low complexity in parity-checking circuitry. Kitaev's surface codes provided the first example of codes satisfying this property. They provided a new route to fault tolerance with more modest overheads and thresholds approaching 1%. The recently discovered color codes share many properties with the surface codes, such as the ability to perform syndrome extraction locally in two dimensions. Some families of color codes admit a transversal implementation of the entire Clifford group. This work investigates color codes on the 4.8.8 lattice known as triangular codes. I develop a fault-tolerant error-correction strategy for these codes in which repeated syndrome measurements on this lattice generate a three-dimensional space-time combinatorial structure. I then develop an integer program that analyzes this structure and determines the most likely set of errors consistent with the observed syndrome values. I implement this integer program to find the threshold for depolarizing noise on small versions of these triangular codes. Because the threshold for magic-state distillation is likely to be higher than this value and because logical CNOT gates can be performed by code deformation in a single block instead of between pairs of blocks, the threshold for fault-tolerant quantum memory for these codes is also the threshold for fault-tolerant quantum computation with them. Since the advent of a threshold theorem for quantum computers much has been improved upon. Thresholds have increased, architectures have become more local, and gate sets have been simplified. The overhead for magic-state distillation has been studied, but not nearly to the extent of the aforementioned topics. A method for greatly reducing this overhead, known as reusable magic states, is studied here. While examples of reusable magic states exist for Clifford gates, I give strong reasons to believe they do not exist for non-Clifford gates.
Organization of the secure distributed computing based on multi-agent system

NASA Astrophysics Data System (ADS)

Khovanskov, Sergey; Rumyantsev, Konstantin; Khovanskova, Vera

2018-04-01

Nowadays developing methods for distributed computing is received much attention. One of the methods of distributed computing is using of multi-agent systems. The organization of distributed computing based on the conventional network computers can experience security threats performed by computational processes. Authors have developed the unified agent algorithm of control system of computing network nodes operation. Network PCs is used as computing nodes. The proposed multi-agent control system for the implementation of distributed computing allows in a short time to organize using of the processing power of computers any existing network to solve large-task by creating a distributed computing. Agents based on a computer network can: configure a distributed computing system; to distribute the computational load among computers operated agents; perform optimization distributed computing system according to the computing power of computers on the network. The number of computers connected to the network can be increased by connecting computers to the new computer system, which leads to an increase in overall processing power. Adding multi-agent system in the central agent increases the security of distributed computing. This organization of the distributed computing system reduces the problem solving time and increase fault tolerance (vitality) of computing processes in a changing computing environment (dynamic change of the number of computers on the network). Developed a multi-agent system detects cases of falsification of the results of a distributed system, which may lead to wrong decisions. In addition, the system checks and corrects wrong results.

Investigation of Air Transportation Technology at Princeton University, 1989-1990

NASA Technical Reports Server (NTRS)

Stengel, Robert F.

1990-01-01

The Air Transportation Technology Program at Princeton University proceeded along six avenues during the past year: microburst hazards to aircraft; machine-intelligent, fault tolerant flight control; computer aided heuristics for piloted flight; stochastic robustness for flight control systems; neural networks for flight control; and computer aided control system design. These topics are briefly discussed, and an annotated bibliography of publications that appeared between January 1989 and June 1990 is given.
Step-by-step magic state encoding for efficient fault-tolerant quantum computation

PubMed Central

Goto, Hayato

2014-01-01

Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation. PMID:25511387
Step-by-step magic state encoding for efficient fault-tolerant quantum computation.

PubMed

Goto, Hayato

2014-12-16

Quantum error correction allows one to make quantum computers fault-tolerant against unavoidable errors due to decoherence and imperfect physical gate operations. However, the fault-tolerant quantum computation requires impractically large computational resources for useful applications. This is a current major obstacle to the realization of a quantum computer. In particular, magic state distillation, which is a standard approach to universality, consumes the most resources in fault-tolerant quantum computation. For the resource problem, here we propose step-by-step magic state encoding for concatenated quantum codes, where magic states are encoded step by step from the physical level to the logical one. To manage errors during the encoding, we carefully use error detection. Since the sizes of intermediate codes are small, it is expected that the resource overheads will become lower than previous approaches based on the distillation at the logical level. Our simulation results suggest that the resource requirements for a logical magic state will become comparable to those for a single logical controlled-NOT gate. Thus, the present method opens a new possibility for efficient fault-tolerant quantum computation.
Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Panda, Dhabaleswar Kumar; Beckman, Pete

2011-07-28

With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system throughmore » fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on top of existing publish-subscribe tools. We enhanced the intrinsic fault tolerance capabilities representative implementations of a variety of key HPC software subsystems and integrated them with the FTB. Targeting software subsystems included: MPI communication libraries, checkpoint/restart libraries, resource managers and job schedulers, and system monitoring tools. Leveraging the aforementioned infrastructure, as well as developing and utilizing additional tools, we have examined issues associated with expanded, end-to-end fault response from both system and application viewpoints. From the standpoint of system operations, we have investigated log and root cause analysis, anomaly detection and fault prediction, and generalized notification mechanisms. Our applications work has included libraries for fault-tolerance linear algebra, application frameworks for coupled multiphysics applications, and external frameworks to support the monitoring and response for general applications. Our final goal was to engage the high-end computing community to increase awareness of tools and issues around coordinated end-to-end fault management.« less
The tracking performance of distributed recoverable flight control systems subject to high intensity radiated fields

NASA Astrophysics Data System (ADS)

Wang, Rui

It is known that high intensity radiated fields (HIRF) can produce upsets in digital electronics, and thereby degrade the performance of digital flight control systems. Such upsets, either from natural or man-made sources, can change data values on digital buses and memory and affect CPU instruction execution. HIRF environments are also known to trigger common-mode faults, affecting nearly-simultaneously multiple fault containment regions, and hence reducing the benefits of n-modular redundancy and other fault-tolerant computing techniques. Thus, it is important to develop models which describe the integration of the embedded digital system, where the control law is implemented, as well as the dynamics of the closed-loop system. In this dissertation, theoretical tools are presented to analyze the relationship between the design choices for a class of distributed recoverable computing platforms and the tracking performance degradation of a digital flight control system implemented on such a platform while operating in a HIRF environment. Specifically, a tractable hybrid performance model is developed for a digital flight control system implemented on a computing platform inspired largely by the NASA family of fault-tolerant, reconfigurable computer architectures known as SPIDER (scalable processor-independent design for enhanced reliability). The focus will be on the SPIDER implementation, which uses the computer communication system known as ROBUS-2 (reliable optical bus). A physical HIRF experiment was conducted at the NASA Langley Research Center in order to validate the theoretical tracking performance degradation predictions for a distributed Boeing 747 flight control system subject to a HIRF environment. An extrapolation of these results for scenarios that could not be physically tested is also presented.
Human factors aspects of control room design

NASA Technical Reports Server (NTRS)

Jenkins, J. P.

1983-01-01

A plan for the design and analysis of a multistation control room is reviewed. It is found that acceptance of the computer based information system by the uses in the control room is mandatory for mission and system success. Criteria to improve computer/user interface include: match of system input/output with user; reliability, compatibility and maintainability; easy to learn and little training needed; self descriptive system; system under user control; transparent language, format and organization; corresponds to user expectations; adaptable to user experience level; fault tolerant; dialog capability user communications needs reflected in flexibility, complexity, power and information load; integrated system; and documentation.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Edwards, Timothy S.

Normal tolerance limits are frequently used in dynamic environments specifications of aerospace systems as a method to account for aleatory variability in the environments. Upper tolerance limits, when used in this way, are computed from records of the environment and used to enforce conservatism in the specification by describing upper extreme values the environment may take in the future. Components and systems are designed to withstand these extreme loads to ensure they do not fail under normal use conditions. The degree of conservatism in the upper tolerance limits is controlled by specifying the coverage and confidence level (usually written inmore » “coverage/confidence” form). Moreover, in high-consequence systems it is common to specify tolerance limits at 95% or 99% coverage and confidence at the 50% or 90% level. Despite the ubiquity of upper tolerance limits in the aerospace community, analysts and decision-makers frequently misinterpret their meaning. The misinterpretation extends into the standards that govern much of the acceptance and qualification of commercial and government aerospace systems. As a result, the risk of a future observation of the environment exceeding the upper tolerance limit is sometimes significantly underestimated by decision makers. This note explains the meaning of upper tolerance limits and a related measure, the upper prediction limit. So, the objective of this work is to clarify the probability of exceeding these limits in flight so that decision-makers can better understand the risk associated with exceeding design and test levels during flight and balance the cost of design and development with that of mission failure.« less
The evolvability of programmable hardware.

PubMed

Raman, Karthik; Wagner, Andreas

2011-02-06

In biological systems, individual phenotypes are typically adopted by multiple genotypes. Examples include protein structure phenotypes, where each structure can be adopted by a myriad individual amino acid sequence genotypes. These genotypes form vast connected 'neutral networks' in genotype space. The size of such neutral networks endows biological systems not only with robustness to genetic change, but also with the ability to evolve a vast number of novel phenotypes that occur near any one neutral network. Whether technological systems can be designed to have similar properties is poorly understood. Here we ask this question for a class of programmable electronic circuits that compute digital logic functions. The functional flexibility of such circuits is important in many applications, including applications of evolutionary principles to circuit design. The functions they compute are at the heart of all digital computation. We explore a vast space of 10(45) logic circuits ('genotypes') and 10(19) logic functions ('phenotypes'). We demonstrate that circuits that compute the same logic function are connected in large neutral networks that span circuit space. Their robustness or fault-tolerance varies very widely. The vicinity of each neutral network contains circuits with a broad range of novel functions. Two circuits computing different functions can usually be converted into one another via few changes in their architecture. These observations show that properties important for the evolvability of biological systems exist in a commercially important class of electronic circuitry. They also point to generic ways to generate fault-tolerant, adaptable and evolvable electronic circuitry.
The evolvability of programmable hardware

PubMed Central

Raman, Karthik; Wagner, Andreas

2011-01-01

In biological systems, individual phenotypes are typically adopted by multiple genotypes. Examples include protein structure phenotypes, where each structure can be adopted by a myriad individual amino acid sequence genotypes. These genotypes form vast connected ‘neutral networks’ in genotype space. The size of such neutral networks endows biological systems not only with robustness to genetic change, but also with the ability to evolve a vast number of novel phenotypes that occur near any one neutral network. Whether technological systems can be designed to have similar properties is poorly understood. Here we ask this question for a class of programmable electronic circuits that compute digital logic functions. The functional flexibility of such circuits is important in many applications, including applications of evolutionary principles to circuit design. The functions they compute are at the heart of all digital computation. We explore a vast space of 1045 logic circuits (‘genotypes’) and 1019 logic functions (‘phenotypes’). We demonstrate that circuits that compute the same logic function are connected in large neutral networks that span circuit space. Their robustness or fault-tolerance varies very widely. The vicinity of each neutral network contains circuits with a broad range of novel functions. Two circuits computing different functions can usually be converted into one another via few changes in their architecture. These observations show that properties important for the evolvability of biological systems exist in a commercially important class of electronic circuitry. They also point to generic ways to generate fault-tolerant, adaptable and evolvable electronic circuitry. PMID:20534598
Probability of Future Observations Exceeding One-Sided, Normal, Upper Tolerance Limits

DOE PAGES

Edwards, Timothy S.

2014-10-29

Normal tolerance limits are frequently used in dynamic environments specifications of aerospace systems as a method to account for aleatory variability in the environments. Upper tolerance limits, when used in this way, are computed from records of the environment and used to enforce conservatism in the specification by describing upper extreme values the environment may take in the future. Components and systems are designed to withstand these extreme loads to ensure they do not fail under normal use conditions. The degree of conservatism in the upper tolerance limits is controlled by specifying the coverage and confidence level (usually written inmore » “coverage/confidence” form). Moreover, in high-consequence systems it is common to specify tolerance limits at 95% or 99% coverage and confidence at the 50% or 90% level. Despite the ubiquity of upper tolerance limits in the aerospace community, analysts and decision-makers frequently misinterpret their meaning. The misinterpretation extends into the standards that govern much of the acceptance and qualification of commercial and government aerospace systems. As a result, the risk of a future observation of the environment exceeding the upper tolerance limit is sometimes significantly underestimated by decision makers. This note explains the meaning of upper tolerance limits and a related measure, the upper prediction limit. So, the objective of this work is to clarify the probability of exceeding these limits in flight so that decision-makers can better understand the risk associated with exceeding design and test levels during flight and balance the cost of design and development with that of mission failure.« less
A research program in empirical computer science

NASA Technical Reports Server (NTRS)

Knight, J. C.

1991-01-01

During the grant reporting period our primary activities have been to begin preparation for the establishment of a research program in experimental computer science. The focus of research in this program will be safety-critical systems. Many questions that arise in the effort to improve software dependability can only be addressed empirically. For example, there is no way to predict the performance of the various proposed approaches to building fault-tolerant software. Performance models, though valuable, are parameterized and cannot be used to make quantitative predictions without experimental determination of underlying distributions. In the past, experimentation has been able to shed some light on the practical benefits and limitations of software fault tolerance. It is common, also, for experimentation to reveal new questions or new aspects of problems that were previously unknown. A good example is the Consistent Comparison Problem that was revealed by experimentation and subsequently studied in depth. The result was a clear understanding of a previously unknown problem with software fault tolerance. The purpose of a research program in empirical computer science is to perform controlled experiments in the area of real-time, embedded control systems. The goal of the various experiments will be to determine better approaches to the construction of the software for computing systems that have to be relied upon. As such it will validate research concepts from other sources, provide new research results, and facilitate the transition of research results from concepts to practical procedures that can be applied with low risk to NASA flight projects. The target of experimentation will be the production software development activities undertaken by any organization prepared to contribute to the research program. Experimental goals, procedures, data analysis and result reporting will be performed for the most part by the University of Virginia.
A data management system to enable urgent natural disaster computing

NASA Astrophysics Data System (ADS)

Leong, Siew Hoon; Kranzlmüller, Dieter; Frank, Anton

2014-05-01

Civil protection, in particular natural disaster management, is very important to most nations and civilians in the world. When disasters like flash floods, earthquakes and tsunamis are expected or have taken place, it is of utmost importance to make timely decisions for managing the affected areas and reduce casualties. Computer simulations can generate information and provide predictions to facilitate this decision making process. Getting the data to the required resources is a critical requirement to enable the timely computation of the predictions. An urgent data management system to support natural disaster computing is thus necessary to effectively carry out data activities within a stipulated deadline. Since the trigger of a natural disaster is usually unpredictable, it is not always possible to prepare required resources well in advance. As such, an urgent data management system for natural disaster computing has to be able to work with any type of resources. Additional requirements include the need to manage deadlines and huge volume of data, fault tolerance, reliable, flexibility to changes, ease of usage, etc. The proposed data management platform includes a service manager to provide a uniform and extensible interface for the supported data protocols, a configuration manager to check and retrieve configurations of available resources, a scheduler manager to ensure that the deadlines can be met, a fault tolerance manager to increase the reliability of the platform and a data manager to initiate and perform the data activities. These managers will enable the selection of the most appropriate resource, transfer protocol, etc. such that the hard deadline of an urgent computation can be met for a particular urgent activity, e.g. data staging or computation. We associated 2 types of deadlines [2] with an urgent computing system. Soft-hard deadline: Missing a soft-firm deadline will render the computation less useful resulting in a cost that can have severe consequences Hard deadline: Missing a hard deadline renders the computation useless and results in full catastrophic consequences. A prototype of this system has a REST-based service manager. The REST-based implementation provides a uniform interface that is easy to use. New and upcoming file transfer protocols can easily be extended and accessed via the service manager. The service manager interacts with the other four managers to coordinate the data activities so that the fundamental natural disaster urgent computing requirement, i.e. deadline, can be fulfilled in a reliable manner. A data activity can include data storing, data archiving and data storing. Reliability is ensured by the choice of a network of managers organisation model[1] the configuration manager and the fault tolerance manager. With this proposed design, an easy to use, resource-independent data management system that can support and fulfill the computation of a natural disaster prediction within stipulated deadlines can thus be realised. References [1] H. G. Hegering, S. Abeck, and B. Neumair, Integrated management of networked systems - concepts, architectures, and their operational application, Morgan Kaufmann Publishers, 340 Pine Stret, Sixth Floor, San Francisco, CA 94104-3205, USA, 1999. [2] H. Kopetz, Real-time systems design principles for distributed embedded applications, second edition, Springer, LLC, 233 Spring Street, New York, NY 10013, USA, 2011. [3] S. H. Leong, A. Frank, and D. Kranzlmu¨ ller, Leveraging e-infrastructures for urgent computing, Procedia Computer Science 18 (2013), no. 0, 2177 - 2186, 2013 International Conference on Computational Science. [4] N. Trebon, Enabling urgent computing within the existing distributed computing infrastructure, Ph.D. thesis, University of Chicago, August 2011, http://people.cs.uchicago.edu/~ntrebon/docs/dissertation.pdf.
Achieving reliability - The evolution of redundancy in American manned spacecraft computers

NASA Technical Reports Server (NTRS)

Tomayko, J. E.

1985-01-01

The Shuttle is the first launch system deployed by NASA with full redundancy in the on-board computer systems. Fault-tolerance, i.e., restoring to a backup with less capabilities, was the method selected for Apollo. The Gemini capsule was the first to carry a computer, which also served as backup for Titan launch vehicle guidance. Failure of the Gemini computer resulted in manual control of the spacecraft. The Apollo system served vehicle flight control and navigation functions. The redundant computer on Skylab provided attitude control only in support of solar telescope pointing. The STS digital, fly-by-wire avionics system requires 100 percent reliability. The Orbiter carries five general purpose computers, four being fully-redundant and the fifth being soley an ascent-descent tool. The computers are synchronized at input and output points at a rate of about six times a second. The system is projected to cause a loss of an Orbiter only four times in a billion flights.
Local rollback for fault-tolerance in parallel computing systems

DOEpatents

Blumrich, Matthias A [Yorktown Heights, NY; Chen, Dong [Yorktown Heights, NY; Gara, Alan [Yorktown Heights, NY; Giampapa, Mark E [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Ohmacht, Martin [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugavanam, Krishnan [Yorktown Heights, NY

2012-01-24

A control logic device performs a local rollback in a parallel super computing system. The super computing system includes at least one cache memory device. The control logic device determines a local rollback interval. The control logic device runs at least one instruction in the local rollback interval. The control logic device evaluates whether an unrecoverable condition occurs while running the at least one instruction during the local rollback interval. The control logic device checks whether an error occurs during the local rollback. The control logic device restarts the local rollback interval if the error occurs and the unrecoverable condition does not occur during the local rollback interval.
The aircraft energy efficiency active controls technology program

NASA Technical Reports Server (NTRS)

Hood, R. V., Jr.

1977-01-01

Broad outlines of the NASA Aircraft Energy Efficiency Program for expediting the application of active controls technology to civil transport aircraft are presented. Advances in propulsion and airframe technology to cut down on fuel consumption and fuel costs, a program for an energy-efficient transport, and integrated analysis and design technology in aerodynamics, structures, and active controls are envisaged. Fault-tolerant computer systems and fault-tolerant flight control system architectures are under study. Contracts with leading manufacturers for research and development work on wing-tip extensions and winglets for the B-747, a wing load alleviation system, elastic mode suppression, maneuver-load control, and gust alleviation are mentioned.
MAGMA: A Liquid Software Approach to Fault Tolerance, Computer Network Security, and Survivable Networking

DTIC Science & Technology

2001-12-01

and Lieutenant Namik Kaplan , Turkish Navy. Maj Tiefert’s thesis, “Modeling Control Channel Dynamics of SAAM using NS Network Simulation”, helped lay...DEC99] Deconinck , Dr. ir. Geert, Fault Tolerant Systems, ESAT / Division ACCA , Katholieke Universiteit Leuven, October 1999. [FRE00] Freed...Systems”, Addison-Wesley, 1989. [KAP99] Kaplan , Namik, “Prototyping of an Active and Lightweight Router,” March 1999 [KAT99] Kati, Effraim
Imperfect construction of microclusters

NASA Astrophysics Data System (ADS)

Schneider, E.; Zhou, K.; Gilbert, G.; Weinstein, Y. S.

2014-01-01

Microclusters are the basic building blocks used to construct cluster states capable of supporting fault-tolerant quantum computation. In this paper, we explore the consequences of errors on microcluster construction using two error models. To quantify the effect of the errors we calculate the fidelity of the constructed microclusters and the fidelity with which two such microclusters can be fused together. Such simulations are vital for gauging the capability of an experimental system to achieve fault tolerance.
A hierarchical approach to reliability modeling of fault-tolerant systems. M.S. Thesis

NASA Technical Reports Server (NTRS)

Gossman, W. E.

1986-01-01

A methodology for performing fault tolerant system reliability analysis is presented. The method decomposes a system into its subsystems, evaluates vent rates derived from the subsystem's conditional state probability vector and incorporates those results into a hierarchical Markov model of the system. This is done in a manner that addresses failure sequence dependence associated with the system's redundancy management strategy. The method is derived for application to a specific system definition. Results are presented that compare the hierarchical model's unreliability prediction to that of a more complicated tandard Markov model of the system. The results for the example given indicate that the hierarchical method predicts system unreliability to a desirable level of accuracy while achieving significant computational savings relative to component level Markov model of the system.
Integrated Environment for Development and Assurance

DTIC Science & Technology

2015-01-26

Jan 26, 2015 © 2015 Carnegie Mellon University We Rely on Software for Safe Aircraft Operation Embedded software systems introduce a new class of...eveloper Compute Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency jitter affects...Why do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software system as major source of
Optimization of a Solid-State Electron Spin Qubit Using Gate Set Tomography (Open Access, Publisher’s Version)

DTIC Science & Technology

2016-10-13

enielse@sandia.gov and a.morello@unsw.edu.au Keywords: quantum computing , silicon, tomography Supplementarymaterial for this article is available online...Abstract State of the art qubit systems are reaching the gatefidelities required for scalable quantum computation architectures. Further improvements in...and addressedwhen the qubit is usedwithin a fault-tolerant quantum computation scheme. 1. Introduction One of themain challenges in the physical

Experimental fault-tolerant universal quantum gates with solid-state spins under ambient conditions

PubMed Central

Rong, Xing; Geng, Jianpei; Shi, Fazhan; Liu, Ying; Xu, Kebiao; Ma, Wenchao; Kong, Fei; Jiang, Zhen; Wu, Yang; Du, Jiangfeng

2015-01-01

Quantum computation provides great speedup over its classical counterpart for certain problems. One of the key challenges for quantum computation is to realize precise control of the quantum system in the presence of noise. Control of the spin-qubits in solids with the accuracy required by fault-tolerant quantum computation under ambient conditions remains elusive. Here, we quantitatively characterize the source of noise during quantum gate operation and demonstrate strategies to suppress the effect of these. A universal set of logic gates in a nitrogen-vacancy centre in diamond are reported with an average single-qubit gate fidelity of 0.999952 and two-qubit gate fidelity of 0.992. These high control fidelities have been achieved at room temperature in naturally abundant 13C diamond via composite pulses and an optimized control method. PMID:26602456
2nd Generation QUATARA Flight Computer Project

NASA Technical Reports Server (NTRS)

Falker, Jay; Keys, Andrew; Fraticelli, Jose Molina; Capo-Iugo, Pedro; Peeples, Steven

2015-01-01

Single core flight computer boards have been designed, developed, and tested (DD&T) to be flown in small satellites for the last few years. In this project, a prototype flight computer will be designed as a distributed multi-core system containing four microprocessors running code in parallel. This flight computer will be capable of performing multiple computationally intensive tasks such as processing digital and/or analog data, controlling actuator systems, managing cameras, operating robotic manipulators and transmitting/receiving from/to a ground station. In addition, this flight computer will be designed to be fault tolerant by creating both a robust physical hardware connection and by using a software voting scheme to determine the processor's performance. This voting scheme will leverage on the work done for the Space Launch System (SLS) flight software. The prototype flight computer will be constructed with Commercial Off-The-Shelf (COTS) components which are estimated to survive for two years in a low-Earth orbit.
Lessons learned in creating spacecraft computer systems: Implications for using Ada (R) for the space station

NASA Technical Reports Server (NTRS)

Tomayko, James E.

1986-01-01

Twenty-five years of spacecraft onboard computer development have resulted in a better understanding of the requirements for effective, efficient, and fault tolerant flight computer systems. Lessons from eight flight programs (Gemini, Apollo, Skylab, Shuttle, Mariner, Voyager, and Galileo) and three reserach programs (digital fly-by-wire, STAR, and the Unified Data System) are useful in projecting the computer hardware configuration of the Space Station and the ways in which the Ada programming language will enhance the development of the necessary software. The evolution of hardware technology, fault protection methods, and software architectures used in space flight in order to provide insight into the pending development of such items for the Space Station are reviewed.
Development of an ADC radiation tolerance characterization system for the upgrade of the ATLAS LAr calorimeter

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liu, Hong-Bin; Chen, Hu-Cheng; Chen, Kai

ATLAS LAr calorimeter will undergo its Phase-I upgrade during the long shutdown (LS2) in 2018, and a new LAr Trigger Digitizer Board (LTDB) will be designed and installed. Several commercial-off-the-shelf (COTS) multi-channel high-speed ADCs have been selected as possible backups of the radiation tolerant ADC ASICs for the LTDB. Here, to evaluate the radiation tolerance of these backup commercial ADCs, we developed an ADC radiation tolerance characterization system, which includes the ADC boards, data acquisition (DAQ) board, signal generator, external power supplies and a host computer. The ADC board is custom designed for different ADCs, with ADC drivers and clockmore » distribution circuits integrated on board. The Xilinx ZC706 FPGA development board is used as a DAQ board. The data from the ADC are routed to the FPGA through the FMC (FPGA Mezzanine Card) connector, de-serialized and monitored by the FPGA, and then transmitted to the host computer through the Gigabit Ethernet. A software program has been developed with Python, and all the commands are sent to the DAQ board through Gigabit Ethernet by this program. Two ADC boards have been designed for the ADC, ADS52J90 from Texas Instruments and AD9249 from Analog Devices respectively. TID tests for both ADCs have been performed at BNL, and an SEE test for the ADS52J90 has been performed at Massachusetts General Hospital (MGH). Test results have been analyzed and presented. The test results demonstrate that this test system is very versatile, and works well for the radiation tolerance characterization of commercial multi-channel high-speed ADCs for the upgrade of the ATLAS LAr calorimeter. It is applicable to other collider physics experiments where radiation tolerance is required as well.« less
Development of an ADC radiation tolerance characterization system for the upgrade of the ATLAS LAr calorimeter

DOE PAGES

Liu, Hong-Bin; Chen, Hu-Cheng; Chen, Kai; ...

2017-02-01

ATLAS LAr calorimeter will undergo its Phase-I upgrade during the long shutdown (LS2) in 2018, and a new LAr Trigger Digitizer Board (LTDB) will be designed and installed. Several commercial-off-the-shelf (COTS) multi-channel high-speed ADCs have been selected as possible backups of the radiation tolerant ADC ASICs for the LTDB. Here, to evaluate the radiation tolerance of these backup commercial ADCs, we developed an ADC radiation tolerance characterization system, which includes the ADC boards, data acquisition (DAQ) board, signal generator, external power supplies and a host computer. The ADC board is custom designed for different ADCs, with ADC drivers and clockmore » distribution circuits integrated on board. The Xilinx ZC706 FPGA development board is used as a DAQ board. The data from the ADC are routed to the FPGA through the FMC (FPGA Mezzanine Card) connector, de-serialized and monitored by the FPGA, and then transmitted to the host computer through the Gigabit Ethernet. A software program has been developed with Python, and all the commands are sent to the DAQ board through Gigabit Ethernet by this program. Two ADC boards have been designed for the ADC, ADS52J90 from Texas Instruments and AD9249 from Analog Devices respectively. TID tests for both ADCs have been performed at BNL, and an SEE test for the ADS52J90 has been performed at Massachusetts General Hospital (MGH). Test results have been analyzed and presented. The test results demonstrate that this test system is very versatile, and works well for the radiation tolerance characterization of commercial multi-channel high-speed ADCs for the upgrade of the ATLAS LAr calorimeter. It is applicable to other collider physics experiments where radiation tolerance is required as well.« less
Development of an ADC radiation tolerance characterization system for the upgrade of the ATLAS LAr calorimeter

NASA Astrophysics Data System (ADS)

Liu, Hong-Bin; Chen, Hu-Cheng; Chen, Kai; Kierstead, James; Lanni, Francesco; Takai, Helio; Jin, Ge

2017-02-01

ATLAS LAr calorimeter will undergo its Phase-I upgrade during the long shutdown (LS2) in 2018, and a new LAr Trigger Digitizer Board (LTDB) will be designed and installed. Several commercial-off-the-shelf (COTS) multi-channel high-speed ADCs have been selected as possible backups of the radiation tolerant ADC ASICs for the LTDB. To evaluate the radiation tolerance of these backup commercial ADCs, we developed an ADC radiation tolerance characterization system, which includes the ADC boards, data acquisition (DAQ) board, signal generator, external power supplies and a host computer. The ADC board is custom designed for different ADCs, with ADC drivers and clock distribution circuits integrated on board. The Xilinx ZC706 FPGA development board is used as a DAQ board. The data from the ADC are routed to the FPGA through the FMC (FPGA Mezzanine Card) connector, de-serialized and monitored by the FPGA, and then transmitted to the host computer through the Gigabit Ethernet. A software program has been developed with Python, and all the commands are sent to the DAQ board through Gigabit Ethernet by this program. Two ADC boards have been designed for the ADC, ADS52J90 from Texas Instruments and AD9249 from Analog Devices respectively. TID tests for both ADCs have been performed at BNL, and an SEE test for the ADS52J90 has been performed at Massachusetts General Hospital (MGH). Test results have been analyzed and presented. The test results demonstrate that this test system is very versatile, and works well for the radiation tolerance characterization of commercial multi-channel high-speed ADCs for the upgrade of the ATLAS LAr calorimeter. It is applicable to other collider physics experiments where radiation tolerance is required as well. Supported by the U. S. Department of Energy (DE-SC001270)
Navigation Ground Data System Engineering for the Cassini/Huygens Mission

NASA Technical Reports Server (NTRS)

Beswick, R. M.; Antreasian, P. G.; Gillam, S. D.; Hahn, Y.; Roth, D. C.; Jones, J. B.

2008-01-01

The launch of the Cassini/Huygens mission on October 15, 1997, began a seven year journey across the solar system that culminated in the entry of the spacecraft into Saturnian orbit on June 30, 2004. Cassini/Huygens Spacecraft Navigation is the result of a complex interplay between several teams within the Cassini Project, performed on the Ground Data System. The work of Spacecraft Navigation involves rigorous requirements for accuracy and completeness carried out often under uncompromising critical time pressures. To support the Navigation function, a fault-tolerant, high-reliability/high-availability computational environment was necessary to support data processing. Configuration Management (CM) was integrated with fault tolerant design and security engineering, according to the cornerstone principles of Confidentiality, Integrity, and Availability. Integrated with this approach are security benchmarks and validation to meet strict confidence levels. In addition, similar approaches to CM were applied in consideration of the staffing and training of the system administration team supporting this effort. As a result, the current configuration of this computational environment incorporates a secure, modular system, that provides for almost no downtime during tour operations.
The data storage grid: the next generation of fault-tolerant storage for backup and disaster recovery of clinical images

NASA Astrophysics Data System (ADS)

King, Nelson E.; Liu, Brent; Zhou, Zheng; Documet, Jorge; Huang, H. K.

2005-04-01

Grid Computing represents the latest and most exciting technology to evolve from the familiar realm of parallel, peer-to-peer and client-server models that can address the problem of fault-tolerant storage for backup and recovery of clinical images. We have researched and developed a novel Data Grid testbed involving several federated PAC systems based on grid architecture. By integrating a grid computing architecture to the DICOM environment, a failed PACS archive can recover its image data from others in the federation in a timely and seamless fashion. The design reflects the five-layer architecture of grid computing: Fabric, Resource, Connectivity, Collective, and Application Layers. The testbed Data Grid architecture representing three federated PAC systems, the Fault-Tolerant PACS archive server at the Image Processing and Informatics Laboratory, Marina del Rey, the clinical PACS at Saint John's Health Center, Santa Monica, and the clinical PACS at the Healthcare Consultation Center II, USC Health Science Campus, will be presented. The successful demonstration of the Data Grid in the testbed will provide an understanding of the Data Grid concept in clinical image data backup as well as establishment of benchmarks for performance from future grid technology improvements and serve as a road map for expanded research into large enterprise and federation level data grids to guarantee 99.999 % up time.
The process group approach to reliable distributed computing

NASA Technical Reports Server (NTRS)

Birman, Kenneth P.

1991-01-01

The difficulty of developing reliable distributed software is an impediment to applying distributed computing technology in many settings. Experience with the ISIS system suggests that a structured approach based on virtually synchronous process groups yields systems which are substantially easier to develop, fault-tolerance, and self-managing. Six years of research on ISIS are reviewed, describing the model, the types of applications to which ISIS was applied, and some of the reasoning that underlies a recent effort to redesign and reimplement ISIS as a much smaller, lightweight system.
Hyperswitch Communication Network Computer

NASA Technical Reports Server (NTRS)

Peterson, John C.; Chow, Edward T.; Priel, Moshe; Upchurch, Edwin T.

1993-01-01

Hyperswitch Communications Network (HCN) computer is prototype multiple-processor computer being developed. Incorporates improved version of hyperswitch communication network described in "Hyperswitch Network For Hypercube Computer" (NPO-16905). Designed to support high-level software and expansion of itself. HCN computer is message-passing, multiple-instruction/multiple-data computer offering significant advantages over older single-processor and bus-based multiple-processor computers, with respect to price/performance ratio, reliability, availability, and manufacturing. Design of HCN operating-system software provides flexible computing environment accommodating both parallel and distributed processing. Also achieves balance among following competing factors; performance in processing and communications, ease of use, and tolerance of (and recovery from) faults.
Layered Architectures for Quantum Computers and Quantum Repeaters

NASA Astrophysics Data System (ADS)

Jones, Nathan C.

This chapter examines how to organize quantum computers and repeaters using a systematic framework known as layered architecture, where machine control is organized in layers associated with specialized tasks. The framework is flexible and could be used for analysis and comparison of quantum information systems. To demonstrate the design principles in practice, we develop architectures for quantum computers and quantum repeaters based on optically controlled quantum dots, showing how a myriad of technologies must operate synchronously to achieve fault-tolerance. Optical control makes information processing in this system very fast, scalable to large problem sizes, and extendable to quantum communication.
Examples of Nonconservatism in the CARE 3 Program

NASA Technical Reports Server (NTRS)

Dotson, Kelly J.

1988-01-01

This paper presents parameter regions in the CARE 3 (Computer-Aided Reliability Estimation version 3) computer program where the program overestimates the reliability of a modeled system without warning the user. Five simple models of fault-tolerant computer systems are analyzed; and, the parameter regions where reliability is overestimated are given. The source of the error in the reliability estimates for models which incorporate transient fault occurrences was not readily apparent. However, the source of much of the error for models with permanent and intermittent faults can be attributed to the choice of values for the run-time parameters of the program.
Adjoint Techniques for Topology Optimization of Structures Under Damage Conditions

NASA Technical Reports Server (NTRS)

Akgun, Mehmet A.; Haftka, Raphael T.

2000-01-01

The objective of this cooperative agreement was to seek computationally efficient ways to optimize aerospace structures subject to damage tolerance criteria. Optimization was to involve sizing as well as topology optimization. The work was done in collaboration with Steve Scotti, Chauncey Wu and Joanne Walsh at the NASA Langley Research Center. Computation of constraint sensitivity is normally the most time-consuming step of an optimization procedure. The cooperative work first focused on this issue and implemented the adjoint method of sensitivity computation (Haftka and Gurdal, 1992) in an optimization code (runstream) written in Engineering Analysis Language (EAL). The method was implemented both for bar and plate elements including buckling sensitivity for the latter. Lumping of constraints was investigated as a means to reduce the computational cost. Adjoint sensitivity computation was developed and implemented for lumped stress and buckling constraints. Cost of the direct method and the adjoint method was compared for various structures with and without lumping. The results were reported in two papers (Akgun et al., 1998a and 1999). It is desirable to optimize topology of an aerospace structure subject to a large number of damage scenarios so that a damage tolerant structure is obtained. Including damage scenarios in the design procedure is critical in order to avoid large mass penalties at later stages (Haftka et al., 1983). A common method for topology optimization is that of compliance minimization (Bendsoe, 1995) which has not been used for damage tolerant design. In the present work, topology optimization is treated as a conventional problem aiming to minimize the weight subject to stress constraints. Multiple damage configurations (scenarios) are considered. Each configuration has its own structural stiffness matrix and, normally, requires factoring of the matrix and solution of the system of equations. Damage that is expected to be tolerated is local and represents a small change in the stiffness matrix compared to the baseline (undamaged) structure. The exact solution to a slightly modified set of equations can be obtained from the baseline solution economically without actually solving the modified system.. Shennan-Morrison-Woodbury (SMW) formulas are matrix update formulas that allow this (Akgun et al., 1998b). SMW formulas were therefore used here to compute adjoint displacements for sensitivity computation and structural displacements in damaged configurations.
Paralex: An Environment for Parallel Programming in Distributed Systems

DTIC Science & Technology

1991-12-07

distributed systems is coni- parable to assembly language programming for traditional sequential systems - the user must resort to low-level primitives ...to accomplish data encoding/decoding, communication, remote exe- cution, synchronization , failure detection and recovery. It is our belief that... synchronization . Finally, composing parallel programs by interconnecting se- quential computations allows automatic support for heterogeneity and fault tolerance
Fault-tolerant measurement-based quantum computing with continuous-variable cluster states.

PubMed

Menicucci, Nicolas C

2014-03-28

A long-standing open question about Gaussian continuous-variable cluster states is whether they enable fault-tolerant measurement-based quantum computation. The answer is yes. Initial squeezing in the cluster above a threshold value of 20.5 dB ensures that errors from finite squeezing acting on encoded qubits are below the fault-tolerance threshold of known qubit-based error-correcting codes. By concatenating with one of these codes and using ancilla-based error correction, fault-tolerant measurement-based quantum computation of theoretically indefinite length is possible with finitely squeezed cluster states.
Digital avionics design and reliability analyzer

NASA Technical Reports Server (NTRS)

1981-01-01

The description and specifications for a digital avionics design and reliability analyzer are given. Its basic function is to provide for the simulation and emulation of the various fault-tolerant digital avionic computer designs that are developed. It has been established that hardware emulation at the gate-level will be utilized. The primary benefit of emulation to reliability analysis is the fact that it provides the capability to model a system at a very detailed level. Emulation allows the direct insertion of faults into the system, rather than waiting for actual hardware failures to occur. This allows for controlled and accelerated testing of system reaction to hardware failures. There is a trade study which leads to the decision to specify a two-machine system, including an emulation computer connected to a general-purpose computer. There is also an evaluation of potential computers to serve as the emulation computer.
Verification methodology for fault-tolerant, fail-safe computers applied to maglev control computer systems. Final report, July 1991-May 1993

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lala, J.H.; Nagle, G.A.; Harper, R.E.

1993-05-01

The Maglev control computer system should be designed to verifiably possess high reliability and safety as well as high availability to make Maglev a dependable and attractive transportation alternative to the public. A Maglev control computer system has been designed using a design-for-validation methodology developed earlier under NASA and SDIO sponsorship for real-time aerospace applications. The present study starts by defining the maglev mission scenario and ends with the definition of a maglev control computer architecture. Key intermediate steps included definitions of functional and dependability requirements, synthesis of two candidate architectures, development of qualitative and quantitative evaluation criteria, and analyticalmore » modeling of the dependability characteristics of the two architectures. Finally, the applicability of the design-for-validation methodology was also illustrated by applying it to the German Transrapid TR07 maglev control system.« less
Blind topological measurement-based quantum computation.

PubMed

Morimae, Tomoyuki; Fujii, Keisuke

2012-01-01

Blind quantum computation is a novel secure quantum-computing protocol that enables Alice, who does not have sufficient quantum technology at her disposal, to delegate her quantum computation to Bob, who has a fully fledged quantum computer, in such a way that Bob cannot learn anything about Alice's input, output and algorithm. A recent proof-of-principle experiment demonstrating blind quantum computation in an optical system has raised new challenges regarding the scalability of blind quantum computation in realistic noisy conditions. Here we show that fault-tolerant blind quantum computation is possible in a topologically protected manner using the Raussendorf-Harrington-Goyal scheme. The error threshold of our scheme is 4.3 × 10(-3), which is comparable to that (7.5 × 10(-3)) of non-blind topological quantum computation. As the error per gate of the order 10(-3) was already achieved in some experimental systems, our result implies that secure cloud quantum computation is within reach.
Blind topological measurement-based quantum computation

NASA Astrophysics Data System (ADS)

Morimae, Tomoyuki; Fujii, Keisuke

2012-09-01

Blind quantum computation is a novel secure quantum-computing protocol that enables Alice, who does not have sufficient quantum technology at her disposal, to delegate her quantum computation to Bob, who has a fully fledged quantum computer, in such a way that Bob cannot learn anything about Alice's input, output and algorithm. A recent proof-of-principle experiment demonstrating blind quantum computation in an optical system has raised new challenges regarding the scalability of blind quantum computation in realistic noisy conditions. Here we show that fault-tolerant blind quantum computation is possible in a topologically protected manner using the Raussendorf-Harrington-Goyal scheme. The error threshold of our scheme is 4.3×10-3, which is comparable to that (7.5×10-3) of non-blind topological quantum computation. As the error per gate of the order 10-3 was already achieved in some experimental systems, our result implies that secure cloud quantum computation is within reach.
General Monte Carlo reliability simulation code including common mode failures and HARP fault/error-handling

NASA Technical Reports Server (NTRS)

Platt, M. E.; Lewis, E. E.; Boehm, F.

1991-01-01

A Monte Carlo Fortran computer program was developed that uses two variance reduction techniques for computing system reliability applicable to solving very large highly reliable fault-tolerant systems. The program is consistent with the hybrid automated reliability predictor (HARP) code which employs behavioral decomposition and complex fault-error handling models. This new capability is called MC-HARP which efficiently solves reliability models with non-constant failures rates (Weibull). Common mode failure modeling is also a specialty.

Interconnection requirements in avionic systems

NASA Astrophysics Data System (ADS)

Vergnolle, Claude; Houssay, Bruno

1991-04-01

The future aircraft generation will have thousand smart electromagnetic sensors distributed allover. Each sensor is connected with fibers links to the main-frame computer in charge of the real time signal''s correlation. Such a computer must be compactly built and massively parallel: it needs the use of 3 D optical free-space interconnect between neighbouring boards and reconfigurable interconnects via holographic backplane. The optical interconnect facilities will be also used to build fault-tolerant computer through large redundancy.
Fault-tolerant battery system employing intra-battery network architecture

DOEpatents

Hagen, Ronald A.; Chen, Kenneth W.; Comte, Christophe; Knudson, Orlin B.; Rouillard, Jean

2000-01-01

A distributed energy storing system employing a communications network is disclosed. A distributed battery system includes a number of energy storing modules, each of which includes a processor and communications interface. In a network mode of operation, a battery computer communicates with each of the module processors over an intra-battery network and cooperates with individual module processors to coordinate module monitoring and control operations. The battery computer monitors a number of battery and module conditions, including the potential and current state of the battery and individual modules, and the conditions of the battery's thermal management system. An over-discharge protection system, equalization adjustment system, and communications system are also controlled by the battery computer. The battery computer logs and reports various status data on battery level conditions which may be reported to a separate system platform computer. A module transitions to a stand-alone mode of operation if the module detects an absence of communication connectivity with the battery computer. A module which operates in a stand-alone mode performs various monitoring and control functions locally within the module to ensure safe and continued operation.
Tolerance assignment in optical design

NASA Astrophysics Data System (ADS)

Youngworth, Richard Neil

2002-09-01

Tolerance assignment is necessary in any engineering endeavor because fabricated systems---due to the stochastic nature of manufacturing and assembly processes---necessarily deviate from the nominal design. This thesis addresses the problem of optical tolerancing. The work can logically be split into three different components that all play an essential role. The first part addresses the modeling of manufacturing errors in contemporary fabrication and assembly methods. The second component is derived from the design aspect---the development of a cost-based tolerancing procedure. The third part addresses the modeling of image quality in an efficient manner that is conducive to the tolerance assignment process. The purpose of the first component, modeling manufacturing errors, is twofold---to determine the most critical tolerancing parameters and to understand better the effects of fabrication errors. Specifically, mid-spatial-frequency errors, typically introduced in sub-aperture grinding and polishing fabrication processes, are modeled. The implication is that improving process control and understanding better the effects of the errors makes the task of tolerance assignment more manageable. Conventional tolerancing methods do not directly incorporate cost. Consequently, tolerancing approaches tend to focus more on image quality. The goal of the second part of the thesis is to develop cost-based tolerancing procedures that facilitate optimum system fabrication by generating the loosest acceptable tolerances. This work has the potential to impact a wide range of optical designs. The third element, efficient modeling of image quality, is directly related to the cost-based optical tolerancing method. Cost-based tolerancing requires efficient and accurate modeling of the effects of errors on the performance of optical systems. Thus it is important to be able to compute the gradient and the Hessian, with respect to the parameters that need to be toleranced, of the figure of merit that measures the image quality of a system. An algebraic method for computing the gradient and the Hessian is developed using perturbation theory.
Towards an Autonomic Cluster Management System (ACMS) with Reflex Autonomicity

NASA Technical Reports Server (NTRS)

Truszkowski, Walt; Hinchey, Mike; Sterritt, Roy

2005-01-01

Cluster computing, whereby a large number of simple processors or nodes are combined together to apparently function as a single powerful computer, has emerged as a research area in its own right. The approach offers a relatively inexpensive means of providing a fault-tolerant environment and achieving significant computational capabilities for high-performance computing applications. However, the task of manually managing and configuring a cluster quickly becomes daunting as the cluster grows in size. Autonomic computing, with its vision to provide self-management, can potentially solve many of the problems inherent in cluster management. We describe the development of a prototype Autonomic Cluster Management System (ACMS) that exploits autonomic properties in automating cluster management and its evolution to include reflex reactions via pulse monitoring.
Fuzzy logic, neural networks, and soft computing

NASA Technical Reports Server (NTRS)

Zadeh, Lofti A.

1994-01-01

The past few years have witnessed a rapid growth of interest in a cluster of modes of modeling and computation which may be described collectively as soft computing. The distinguishing characteristic of soft computing is that its primary aims are to achieve tractability, robustness, low cost, and high MIQ (machine intelligence quotient) through an exploitation of the tolerance for imprecision and uncertainty. Thus, in soft computing what is usually sought is an approximate solution to a precisely formulated problem or, more typically, an approximate solution to an imprecisely formulated problem. A simple case in point is the problem of parking a car. Generally, humans can park a car rather easily because the final position of the car is not specified exactly. If it were specified to within, say, a few millimeters and a fraction of a degree, it would take hours or days of maneuvering and precise measurements of distance and angular position to solve the problem. What this simple example points to is the fact that, in general, high precision carries a high cost. The challenge, then, is to exploit the tolerance for imprecision by devising methods of computation which lead to an acceptable solution at low cost. By its nature, soft computing is much closer to human reasoning than the traditional modes of computation. At this juncture, the major components of soft computing are fuzzy logic (FL), neural network theory (NN), and probabilistic reasoning techniques (PR), including genetic algorithms, chaos theory, and part of learning theory. Increasingly, these techniques are used in combination to achieve significant improvement in performance and adaptability. Among the important application areas for soft computing are control systems, expert systems, data compression techniques, image processing, and decision support systems. It may be argued that it is soft computing, rather than the traditional hard computing, that should be viewed as the foundation for artificial intelligence. In the years ahead, this may well become a widely held position.
Analysis and design of algorithm-based fault-tolerant systems

NASA Technical Reports Server (NTRS)

Nair, V. S. Sukumaran

1990-01-01

An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems.
Program for computer aided reliability estimation

NASA Technical Reports Server (NTRS)

Mathur, F. P. (Inventor)

1972-01-01

A computer program for estimating the reliability of self-repair and fault-tolerant systems with respect to selected system and mission parameters is presented. The computer program is capable of operation in an interactive conversational mode as well as in a batch mode and is characterized by maintenance of several general equations representative of basic redundancy schemes in an equation repository. Selected reliability functions applicable to any mathematical model formulated with the general equations, used singly or in combination with each other, are separately stored. One or more system and/or mission parameters may be designated as a variable. Data in the form of values for selected reliability functions is generated in a tabular or graphic format for each formulated model.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Meneses, Esteban; Ni, Xiang; Jones, Terry R

The unprecedented computational power of cur- rent supercomputers now makes possible the exploration of complex problems in many scientific fields, from genomic analysis to computational fluid dynamics. Modern machines are powerful because they are massive: they assemble millions of cores and a huge quantity of disks, cards, routers, and other components. But it is precisely the size of these machines that glooms the future of supercomputing. A system that comprises many components has a high chance to fail, and fail often. In order to make the next generation of supercomputers usable, it is imperative to use some type of faultmore » tolerance platform to run applications on large machines. Most fault tolerance strategies can be optimized for the peculiarities of each system and boost efficacy by keeping the system productive. In this paper, we aim to understand how failure characterization can improve resilience in several layers of the software stack: applications, runtime systems, and job schedulers. We examine the Titan supercomputer, one of the fastest systems in the world. We analyze a full year of Titan in production and distill the failure patterns of the machine. By looking into Titan s log files and using the criteria of experts, we provide a detailed description of the types of failures. In addition, we inspect the job submission files and describe how the system is used. Using those two sources, we cross correlate failures in the machine to executing jobs and provide a picture of how failures affect the user experience. We believe such characterization is fundamental in developing appropriate fault tolerance solutions for Cray systems similar to Titan.« less
Solution of hydrogen in accident tolerant fuel candidate material: U3Si2

NASA Astrophysics Data System (ADS)

Middleburgh, S. C.; Claisse, A.; Andersson, D. A.; Grimes, R. W.; Olsson, P.; Mašková, S.

2018-04-01

Hydrogen uptake and accommodation into U3Si2, a candidate accident-tolerant fuel system, has been modelled on the atomic scale using the density functional theory. The solution energy of multiple H atoms is computed, reaching a stoichiometry of U3Si2H2 which has been experimentally observed in previous work (reported as U3Si2H1.8). The absorption of hydrogen is found to be favourable up to U3Si2H2 and the associated volume change is computed, closely matching experimental data. Entropic effects are considered to assess the dissociation temperature of H2, estimated to be at ∼800 K - again in good agreement with the experimentally observed transition temperature.
Modeling And Simulation Of Bar Code Scanners Using Computer Aided Design Software

NASA Astrophysics Data System (ADS)

Hellekson, Ron; Campbell, Scott

1988-06-01

Many optical systems have demanding requirements to package the system in a small 3 dimensional space. The use of computer graphic tools can be a tremendous aid to the designer in analyzing the optical problems created by smaller and less costly systems. The Spectra Physics grocery store bar code scanner employs an especially complex 3 dimensional scan pattern to read bar code labels. By using a specially written program which interfaces with a computer aided design system, we have simulated many of the functions of this complex optical system. In this paper we will illustrate how a recent version of the scanner has been designed. We will discuss the use of computer graphics in the design process including interactive tweaking of the scan pattern, analysis of collected light, analysis of the scan pattern density, and analysis of the manufacturing tolerances used to build the scanner.
Software life cycle methodologies and environments

NASA Technical Reports Server (NTRS)

Fridge, Ernest

1991-01-01

Products of this project will significantly improve the quality and productivity of Space Station Freedom Program software processes by: improving software reliability and safety; and broadening the range of problems that can be solved with computational solutions. Projects brings in Computer Aided Software Engineering (CASE) technology for: Environments such as Engineering Script Language/Parts Composition System (ESL/PCS) application generator, Intelligent User Interface for cost avoidance in setting up operational computer runs, Framework programmable platform for defining process and software development work flow control, Process for bringing CASE technology into an organization's culture, and CLIPS/CLIPS Ada language for developing expert systems; and methodologies such as Method for developing fault tolerant, distributed systems and a method for developing systems for common sense reasoning and for solving expert systems problems when only approximate truths are known.
Model Order Reduction Algorithm for Estimating the Absorption Spectrum

DOE Office of Scientific and Technical Information (OSTI.GOV)

Van Beeumen, Roel; Williams-Young, David B.; Kasper, Joseph M.

The ab initio description of the spectral interior of the absorption spectrum poses both a theoretical and computational challenge for modern electronic structure theory. Due to the often spectrally dense character of this domain in the quantum propagator’s eigenspectrum for medium-to-large sized systems, traditional approaches based on the partial diagonalization of the propagator often encounter oscillatory and stagnating convergence. Electronic structure methods which solve the molecular response problem through the solution of spectrally shifted linear systems, such as the complex polarization propagator, offer an alternative approach which is agnostic to the underlying spectral density or domain location. This generality comesmore » at a seemingly high computational cost associated with solving a large linear system for each spectral shift in some discretization of the spectral domain of interest. In this work, we present a novel, adaptive solution to this high computational overhead based on model order reduction techniques via interpolation. Model order reduction reduces the computational complexity of mathematical models and is ubiquitous in the simulation of dynamical systems and control theory. The efficiency and effectiveness of the proposed algorithm in the ab initio prediction of X-ray absorption spectra is demonstrated using a test set of challenging water clusters which are spectrally dense in the neighborhood of the oxygen K-edge. On the basis of a single, user defined tolerance we automatically determine the order of the reduced models and approximate the absorption spectrum up to the given tolerance. We also illustrate that, for the systems studied, the automatically determined model order increases logarithmically with the problem dimension, compared to a linear increase of the number of eigenvalues within the energy window. Furthermore, we observed that the computational cost of the proposed algorithm only scales quadratically with respect to the problem dimension.« less
Fault-Tolerant, Real-Time, Multi-Core Computer System

NASA Technical Reports Server (NTRS)

Gostelow, Kim P.

2012-01-01

A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
Fault tolerance in computational grids: perspectives, challenges, and issues.

PubMed

Haider, Sajjad; Nazir, Babar

2016-01-01

Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.
Scalable Optical-Fiber Communication Networks

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Peterson, John C.

1993-01-01

Scalable arbitrary fiber extension network (SAFEnet) is conceptual fiber-optic communication network passing digital signals among variety of computers and input/output devices at rates from 200 Mb/s to more than 100 Gb/s. Intended for use with very-high-speed computers and other data-processing and communication systems in which message-passing delays must be kept short. Inherent flexibility makes it possible to match performance of network to computers by optimizing configuration of interconnections. In addition, interconnections made redundant to provide tolerance to faults.
Care 3 model overview and user's guide, first revision

NASA Technical Reports Server (NTRS)

Bavuso, S. J.; Petersen, P. L.

1985-01-01

A manual was written to introduce the CARE III (Computer-Aided Reliability Estimation) capability to reliability and design engineers who are interested in predicting the reliability of highly reliable fault-tolerant systems. It was also structured to serve as a quick-look reference manual for more experienced users. The guide covers CARE III modeling and reliability predictions for execution in the CDC CYber 170 series computers, DEC VAX-11/700 series computer, and most machines that compile ANSI Standard FORTRAN 77.
Fault-tolerant linear optical quantum computing with small-amplitude coherent States.

PubMed

Lund, A P; Ralph, T C; Haselgrove, H L

2008-01-25

Quantum computing using two coherent states as a qubit basis is a proposed alternative architecture with lower overheads but has been questioned as a practical way of performing quantum computing due to the fragility of diagonal states with large coherent amplitudes. We show that using error correction only small amplitudes (alpha>1.2) are required for fault-tolerant quantum computing. We study fault tolerance under the effects of small amplitudes and loss using a Monte Carlo simulation. The first encoding level resources are orders of magnitude lower than the best single photon scheme.
High-Threshold Fault-Tolerant Quantum Computation with Analog Quantum Error Correction

NASA Astrophysics Data System (ADS)

Fukui, Kosuke; Tomita, Akihisa; Okamoto, Atsushi; Fujii, Keisuke

2018-04-01

To implement fault-tolerant quantum computation with continuous variables, the Gottesman-Kitaev-Preskill (GKP) qubit has been recognized as an important technological element. However, it is still challenging to experimentally generate the GKP qubit with the required squeezing level, 14.8 dB, of the existing fault-tolerant quantum computation. To reduce this requirement, we propose a high-threshold fault-tolerant quantum computation with GKP qubits using topologically protected measurement-based quantum computation with the surface code. By harnessing analog information contained in the GKP qubits, we apply analog quantum error correction to the surface code. Furthermore, we develop a method to prevent the squeezing level from decreasing during the construction of the large-scale cluster states for the topologically protected, measurement-based, quantum computation. We numerically show that the required squeezing level can be relaxed to less than 10 dB, which is within the reach of the current experimental technology. Hence, this work can considerably alleviate this experimental requirement and take a step closer to the realization of large-scale quantum computation.
Symposium on the Interface: Computing Science and Statistics (20th). Theme: Computationally Intensive Methods in Statistics Held in Reston, Virginia on April 20-23, 1988

DTIC Science & Technology

1988-08-20

34 William A. Link, Patuxent Wildlife Research Center "Increasing reliability of multiversion fault-tolerant software design by modulation," Junryo 3... Multiversion lault-Tolerant Software Design by Modularization Junryo Miyashita Department of Computer Science California state University at san Bernardino Fault...They shall beE refered to as " multiversion fault-tolerant software design". Onel problem of developing multi-versions of a program is the high cost
Space Shuttle critical function audit

NASA Technical Reports Server (NTRS)

Sacks, Ivan J.; Dipol, John; Su, Paul

1990-01-01

A large fault-tolerance model of the main propulsion system of the US space shuttle has been developed. This model is being used to identify single components and pairs of components that will cause loss of shuttle critical functions. In addition, this model is the basis for risk quantification of the shuttle. The process used to develop and analyze the model is digraph matrix analysis (DMA). The DMA modeling and analysis process is accessed via a graphics-based computer user interface. This interface provides coupled display of the integrated system schematics, the digraph models, the component database, and the results of the fault tolerance and risk analyses.

A Fault-tolerant RISC Microprocessor for Spacecraft Applications

NASA Technical Reports Server (NTRS)

Timoc, Constantin; Benz, Harry

1990-01-01

Viewgraphs on a fault-tolerant RISC microprocessor for spacecraft applications are presented. Topics covered include: reduced instruction set computer; fault tolerant registers; fault tolerant ALU; and double rail CMOS logic.
General linear codes for fault-tolerant matrix operations on processor arrays

NASA Technical Reports Server (NTRS)

Nair, V. S. S.; Abraham, J. A.

1988-01-01

Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.
Universal fault-tolerant quantum computation with only transversal gates and error correction.

PubMed

Paetznick, Adam; Reichardt, Ben W

2013-08-30

Transversal implementations of encoded unitary gates are highly desirable for fault-tolerant quantum computation. Though transversal gates alone cannot be computationally universal, they can be combined with specially distilled resource states in order to achieve universality. We show that "triorthogonal" stabilizer codes, introduced for state distillation by Bravyi and Haah [Phys. Rev. A 86, 052329 (2012)], admit transversal implementation of the controlled-controlled-Z gate. We then construct a universal set of fault-tolerant gates without state distillation by using only transversal controlled-controlled-Z, transversal Hadamard, and fault-tolerant error correction. We also adapt the distillation procedure of Bravyi and Haah to Toffoli gates, improving on existing Toffoli distillation schemes.
A resource management architecture based on complex network theory in cloud computing federation

NASA Astrophysics Data System (ADS)

Zhang, Zehua; Zhang, Xuejie

2011-10-01

Cloud Computing Federation is a main trend of Cloud Computing. Resource Management has significant effect on the design, realization, and efficiency of Cloud Computing Federation. Cloud Computing Federation has the typical characteristic of the Complex System, therefore, we propose a resource management architecture based on complex network theory for Cloud Computing Federation (abbreviated as RMABC) in this paper, with the detailed design of the resource discovery and resource announcement mechanisms. Compare with the existing resource management mechanisms in distributed computing systems, a Task Manager in RMABC can use the historical information and current state data get from other Task Managers for the evolution of the complex network which is composed of Task Managers, thus has the advantages in resource discovery speed, fault tolerance and adaptive ability. The result of the model experiment confirmed the advantage of RMABC in resource discovery performance.
Development of a fault-tolerant microprocessor based computer system for space flight

NASA Technical Reports Server (NTRS)

Montgomery, V. T.

1981-01-01

A methodology for the design of a tightly coupled, highly reliable microprocessor based computer system is described. The concept of triple modular redundancy with sparing is used. The notion of synchronizing by using a single crystal oscillator is examined. The use of decoders to replace voters is also used. The decoders not only isolate the failed module but also allow error identification to be accomplished. Each module is to have its own RAM memory. The necessary circuitry to select a correct memory and the corresponding DMA controller was designed.
High-Intensity Radiated Field Fault-Injection Experiment for a Fault-Tolerant Distributed Communication System

NASA Technical Reports Server (NTRS)

Yates, Amy M.; Torres-Pomales, Wilfredo; Malekpour, Mahyar R.; Gonzalez, Oscar R.; Gray, W. Steven

2010-01-01

Safety-critical distributed flight control systems require robustness in the presence of faults. In general, these systems consist of a number of input/output (I/O) and computation nodes interacting through a fault-tolerant data communication system. The communication system transfers sensor data and control commands and can handle most faults under typical operating conditions. However, the performance of the closed-loop system can be adversely affected as a result of operating in harsh environments. In particular, High-Intensity Radiated Field (HIRF) environments have the potential to cause random fault manifestations in individual avionic components and to generate simultaneous system-wide communication faults that overwhelm existing fault management mechanisms. This paper presents the design of an experiment conducted at the NASA Langley Research Center's HIRF Laboratory to statistically characterize the faults that a HIRF environment can trigger on a single node of a distributed flight control system.
Towards scalable Byzantine fault-tolerant replication

NASA Astrophysics Data System (ADS)

Zbierski, Maciej

2017-08-01

Byzantine fault-tolerant (BFT) replication is a powerful technique, enabling distributed systems to remain available and correct even in the presence of arbitrary faults. Unfortunately, existing BFT replication protocols are mostly load-unscalable, i.e. they fail to respond with adequate performance increase whenever new computational resources are introduced into the system. This article proposes a universal architecture facilitating the creation of load-scalable distributed services based on BFT replication. The suggested approach exploits parallel request processing to fully utilize the available resources, and uses a load balancer module to dynamically adapt to the properties of the observed client workload. The article additionally provides a discussion on selected deployment scenarios, and explains how the proposed architecture could be used to increase the dependability of contemporary large-scale distributed systems.
Formal specification and mechanical verification of SIFT - A fault-tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1982-01-01

The paper describes the methodology being employed to demonstrate rigorously that the SIFT (software-implemented fault-tolerant) computer meets its requirements. The methodology uses a hierarchy of design specifications, expressed in the mathematical domain of multisorted first-order predicate calculus. The most abstract of these, from which almost all details of mechanization have been removed, represents the requirements on the system for reliability and intended functionality. Successive specifications in the hierarchy add design and implementation detail until the PASCAL programs implementing the SIFT executive are reached. A formal proof that a SIFT system in a 'safe' state operates correctly despite the presence of arbitrary faults has been completed all the way from the most abstract specifications to the PASCAL program.
Hyperswitch communication network

NASA Technical Reports Server (NTRS)

Peterson, J.; Pniel, M.; Upchurch, E.

1991-01-01

The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.
Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lumsdaine, Andrew

2013-03-08

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility ormore » control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.« less
Test experience on an ultrareliable computer communication network

NASA Technical Reports Server (NTRS)

Abbott, L. W.

1984-01-01

The dispersed sensor processing mesh (DSPM) is an experimental, ultrareliable, fault-tolerant computer communications network that exhibits an organic-like ability to regenerate itself after suffering damage. The regeneration is accomplished by two routines - grow and repair. This paper discusses the DSPM concept for achieving fault tolerance and provides a brief description of the mechanization of both the experiment and the six-node experimental network. The main topic of this paper is the system performance of the growth algorithm contained in the grow routine. The characteristics imbued to DSPM by the growth algorithm are also discussed. Data from an experimental DSPM network and software simulation of larger DSPM-type networks are used to examine the inherent limitation on growth time by the growth algorithm and the relationship of growth time to network size and topology.
Design and Construction of a Field Capable Snapshot Hyperspectral Imaging Spectrometer

NASA Technical Reports Server (NTRS)

Arik, Glenda H.

2005-01-01

The computed-tomography imaging spectrometer (CTIS) is a device which captures the spatial and spectral content of a rapidly evolving same in a single image frame. The most recent CTIS design is optically all reflective and uses as its dispersive device a stated the-art reflective computer generated hologram (CGH). This project focuses on the instrument's transition from laboratory to field. This design will enable the CTIS to withstand a harsh desert environment. The system is modeled in optical design software using a tolerance analysis. The tolerances guide the design of the athermal mount and component parts. The parts are assembled into a working mount shell where the performance of the mounts is tested for thermal integrity. An interferometric analysis of the reflective CGH is also performed.
Tolerance analysis of optical telescopes using coherent addition of wavefront errors

NASA Technical Reports Server (NTRS)

Davenport, J. W.

1982-01-01

A near diffraction-limited telescope requires that tolerance analysis be done on the basis of system wavefront error. One method of analyzing the wavefront error is to represent the wavefront error function in terms of its Zernike polynomial expansion. A Ramsey-Korsch ray trace package, a computer program that simulates the tracing of rays through an optical telescope system, was expanded to include the Zernike polynomial expansion up through the fifth-order spherical term. An option to determine a 3 dimensional plot of the wavefront error function was also included in the Ramsey-Korsch package. Several assimulation runs were analyzed to determine the particular set of coefficients in the Zernike expansion that are effected by various errors such as tilt, decenter and despace. A 3 dimensional plot of each error up through the fifth-order spherical term was also included in the study. Tolerance analysis data are presented.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Katti, Amogh; Di Fatta, Giuseppe; Naughton III, Thomas J

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum's User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a fault tolerant failure detection and consensus algorithm. This paper presents and compares two novel failure detection and consensus algorithms. The proposed algorithms are based on Gossip protocols and are inherently fault-tolerant and scalable. The proposed algorithms were implementedmore » and tested using the Extreme-scale Simulator. The results show that in both algorithms the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus.« less
Advanced information processing system for advanced launch system: Avionics architecture synthesis

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan H.; Harper, Richard E.; Jaskowiak, Kenneth R.; Rosch, Gene; Alger, Linda S.; Schor, Andrei L.

1991-01-01

The Advanced Information Processing System (AIPS) is a fault-tolerant distributed computer system architecture that was developed to meet the real time computational needs of advanced aerospace vehicles. One such vehicle is the Advanced Launch System (ALS) being developed jointly by NASA and the Department of Defense to launch heavy payloads into low earth orbit at one tenth the cost (per pound of payload) of the current launch vehicles. An avionics architecture that utilizes the AIPS hardware and software building blocks was synthesized for ALS. The AIPS for ALS architecture synthesis process starting with the ALS mission requirements and ending with an analysis of the candidate ALS avionics architecture is described.
Blind topological measurement-based quantum computation

PubMed Central

Morimae, Tomoyuki; Fujii, Keisuke

2012-01-01

Blind quantum computation is a novel secure quantum-computing protocol that enables Alice, who does not have sufficient quantum technology at her disposal, to delegate her quantum computation to Bob, who has a fully fledged quantum computer, in such a way that Bob cannot learn anything about Alice's input, output and algorithm. A recent proof-of-principle experiment demonstrating blind quantum computation in an optical system has raised new challenges regarding the scalability of blind quantum computation in realistic noisy conditions. Here we show that fault-tolerant blind quantum computation is possible in a topologically protected manner using the Raussendorf–Harrington–Goyal scheme. The error threshold of our scheme is 4.3×10−3, which is comparable to that (7.5×10−3) of non-blind topological quantum computation. As the error per gate of the order 10−3 was already achieved in some experimental systems, our result implies that secure cloud quantum computation is within reach. PMID:22948818
DOE Office of Scientific and Technical Information (OSTI.GOV)

Lee, Hsien-Hsin S

The overall objective of this research project is to develop novel architectural techniques as well as system software to achieve a highly secure and intrusion-tolerant computing system. Such system will be autonomous, self-adapting, introspective, with self-healing capability under the circumstances of improper operations, abnormal workloads, and malicious attacks. The scope of this research includes: (1) System-wide, unified introspection techniques for autonomic systems, (2) Secure information-flow microarchitecture, (3) Memory-centric security architecture, (4) Authentication control and its implication to security, (5) Digital right management, (5) Microarchitectural denial-of-service attacks on shared resources. During the period of the project, we developed several architectural techniquesmore » and system software for achieving a robust, secure, and reliable computing system toward our goal.« less
Aerospace Applications of Weibull and Monte Carlo Simulation with Importance Sampling

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.

1998-01-01

Recent developments in reliability modeling and computer technology have made it practical to use the Weibull time to failure distribution to model the system reliability of complex fault-tolerant computer-based systems. These system models are becoming increasingly popular in space systems applications as a result of mounting data that support the decreasing Weibull failure distribution and the expectation of increased system reliability. This presentation introduces the new reliability modeling developments and demonstrates their application to a novel space system application. The application is a proposed guidance, navigation, and control (GN&C) system for use in a long duration manned spacecraft for a possible Mars mission. Comparisons to the constant failure rate model are presented and the ramifications of doing so are discussed.
Quantum simulations with noisy quantum computers

NASA Astrophysics Data System (ADS)

Gambetta, Jay

Quantum computing is a new computational paradigm that is expected to lie beyond the standard model of computation. This implies a quantum computer can solve problems that can't be solved by a conventional computer with tractable overhead. To fully harness this power we need a universal fault-tolerant quantum computer. However the overhead in building such a machine is high and a full solution appears to be many years away. Nevertheless, we believe that we can build machines in the near term that cannot be emulated by a conventional computer. It is then interesting to ask what these can be used for. In this talk we will present our advances in simulating complex quantum systems with noisy quantum computers. We will show experimental implementations of this on some small quantum computers.
Supercomputing '91; Proceedings of the 4th Annual Conference on High Performance Computing, Albuquerque, NM, Nov. 18-22, 1991

NASA Technical Reports Server (NTRS)

1991-01-01

Various papers on supercomputing are presented. The general topics addressed include: program analysis/data dependence, memory access, distributed memory code generation, numerical algorithms, supercomputer benchmarks, latency tolerance, parallel programming, applications, processor design, networks, performance tools, mapping and scheduling, characterization affecting performance, parallelism packaging, computing climate change, combinatorial algorithms, hardware and software performance issues, system issues. (No individual items are abstracted in this volume)

Slime mould foraging behaviour as optically coupled logical operations

NASA Astrophysics Data System (ADS)

Mayne, R.; Adamatzky, A.

2015-04-01

Physarum polycephalum is a macroscopic plasmodial slime mould whose apparently 'intelligent' behaviour patterns may be interpreted as computation. We employ plasmodial phototactic responses to construct laboratory prototypes of NOT and NAND logical gates with electrical inputs/outputs and optical coupling in which the slime mould plays dual roles of computing device and electrical conductor. Slime mould logical gates are fault tolerant and resettable. The results presented here demonstrate the malleability and resilience of biological systems and highlight how the innate behaviour patterns of living substrates may be used to implement useful computation.
A Byzantine-Fault Tolerant Self-Stabilizing Protocol for Distributed Clock Synchronization Systems

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2006-01-01

Embedded distributed systems have become an integral part of safety-critical computing applications, necessitating system designs that incorporate fault tolerant clock synchronization in order to achieve ultra-reliable assurance levels. Many efficient clock synchronization protocols do not, however, address Byzantine failures, and most protocols that do tolerate Byzantine failures do not self-stabilize. Of the Byzantine self-stabilizing clock synchronization algorithms that exist in the literature, they are based on either unjustifiably strong assumptions about initial synchrony of the nodes or on the existence of a common pulse at the nodes. The Byzantine self-stabilizing clock synchronization protocol presented here does not rely on any assumptions about the initial state of the clocks. Furthermore, there is neither a central clock nor an externally generated pulse system. The proposed protocol converges deterministically, is scalable, and self-stabilizes in a short amount of time. The convergence time is linear with respect to the self-stabilization period. Proofs of the correctness of the protocol as well as the results of formal verification efforts are reported.
Bioinspired Concepts: Unified Theory for Complex Biological and Engineering Systems

DTIC Science & Technology

2006-01-01

i.e., data flows of finite size arrive at the system randomly. For such a system , we propose a modified dual scheduling algorithm that stabilizes ...demon. We compute the efficiency of the controller over finite and infinite time intervals, and since the controller is optimal, this yields hard limits...and highly optimized tolerance. PNAS, 102, 2005. 51. G. N. Nair and R. J. Evans. Stabilizability of stochastic linear systems with finite feedback
CSP: A Multifaceted Hybrid Architecture for Space Computing

NASA Technical Reports Server (NTRS)

Rudolph, Dylan; Wilson, Christopher; Stewart, Jacob; Gauvin, Patrick; George, Alan; Lam, Herman; Crum, Gary Alex; Wirthlin, Mike; Wilson, Alex; Stoddard, Aaron

2014-01-01

Research on the CHREC Space Processor (CSP) takes a multifaceted hybrid approach to embedded space computing. Working closely with the NASA Goddard SpaceCube team, researchers at the National Science Foundation (NSF) Center for High-Performance Reconfigurable Computing (CHREC) at the University of Florida and Brigham Young University are developing hybrid space computers that feature an innovative combination of three technologies: commercial-off-the-shelf (COTS) devices, radiation-hardened (RadHard) devices, and fault-tolerant computing. Modern COTS processors provide the utmost in performance and energy-efficiency but are susceptible to ionizing radiation in space, whereas RadHard processors are virtually immune to this radiation but are more expensive, larger, less energy-efficient, and generations behind in speed and functionality. By featuring COTS devices to perform the critical data processing, supported by simpler RadHard devices that monitor and manage the COTS devices, and augmented with novel uses of fault-tolerant hardware, software, information, and networking within and between COTS devices, the resulting system can maximize performance and reliability while minimizing energy consumption and cost. NASA Goddard has adopted the CSP concept and technology with plans underway to feature flight-ready CSP boards on two upcoming space missions.
Closed-Loop HIRF Experiments Performed on a Fault Tolerant Flight Control Computer

NASA Technical Reports Server (NTRS)

Belcastro, Celeste M.

1997-01-01

ABSTRACT Closed-loop HIRF experiments were performed on a fault tolerant flight control computer (FCC) at the NASA Langley Research Center. The FCC used in the experiments was a quad-redundant flight control computer executing B737 Autoland control laws. The FCC was placed in one of the mode-stirred reverberation chambers in the HIRF Laboratory and interfaced to a computer simulation of the B737 flight dynamics, engines, sensors, actuators, and atmosphere in the Closed-Loop Systems Laboratory. Disturbances to the aircraft associated with wind gusts and turbulence were simulated during tests. Electrical isolation between the FCC under test and the simulation computer was achieved via a fiber optic interface for the analog and discrete signals. Closed-loop operation of the FCC enabled flight dynamics and atmospheric disturbances affecting the aircraft to be represented during tests. Upset was induced in the FCC as a result of exposure to HIRF, and the effect of upset on the simulated flight of the aircraft was observed and recorded. This paper presents a description of these closed- loop HIRF experiments, upset data obtained from the FCC during these experiments, and closed-loop effects on the simulated flight of the aircraft.
Cyclic Symmetry Finite Element Forced Response Analysis of a Distortion-Tolerant Fan with Boundary Layer Ingestion

NASA Technical Reports Server (NTRS)

Min, J. B.; Reddy, T. S. R.; Bakhle, M. A.; Coroneos, R. M.; Stefko, G. L.; Provenza, A. J.; Duffy, K. P.

2018-01-01

Accurate prediction of the blade vibration stress is required to determine overall durability of fan blade design under Boundary Layer Ingestion (BLI) distorted flow environments. Traditional single blade modeling technique is incapable of representing accurate modeling for the entire rotor blade system subject to complex dynamic loading behaviors and vibrations in distorted flow conditions. A particular objective of our work was to develop a high-fidelity full-rotor aeromechanics analysis capability for a system subjected to a distorted inlet flow by applying cyclic symmetry finite element modeling methodology. This reduction modeling method allows computationally very efficient analysis using a small periodic section of the full rotor blade system. Experimental testing by the use of the 8-foot by 6-foot Supersonic Wind Tunnel Test facility at NASA Glenn Research Center was also carried out for the system designated as the Boundary Layer Ingesting Inlet/Distortion-Tolerant Fan (BLI2DTF) technology development. The results obtained from the present numerical modeling technique were evaluated with those of the wind tunnel experimental test, toward establishing a computationally efficient aeromechanics analysis modeling tool facilitating for analyses of the full rotor blade systems subjected to a distorted inlet flow conditions. Fairly good correlations were achieved hence our computational modeling techniques were fully demonstrated. The analysis result showed that the safety margin requirement set in the BLI2DTF fan blade design provided a sufficient margin with respect to the operating speed range.
Avionic Data Bus Integration Technology

DTIC Science & Technology

1991-12-01

address the hardware-software interaction between a digital data bus and an avionic system. Very Large Scale Integration (VLSI) ICs and multiversion ...the SCP. In 1984, the Sperry Corporation developed a fault tolerant system which employed multiversion programming, voting, and monitoring for error... MULTIVERSION PROGRAMMING. N-version programming. 226 N-VERSION PROGRAMMING. The independent coding of a number, N, of redundant computer programs that
Tolerancing the alignment of large-core optical fibers, fiber bundles and light guides using a Fourier approach.

PubMed

Sawyer, Travis W; Petersburg, Ryan; Bohndiek, Sarah E

2017-04-20

Optical fiber technology is found in a wide variety of applications to flexibly relay light between two points, enabling information transfer across long distances and allowing access to hard-to-reach areas. Large-core optical fibers and light guides find frequent use in illumination and spectroscopic applications, for example, endoscopy and high-resolution astronomical spectroscopy. Proper alignment is critical for maximizing throughput in optical fiber coupling systems; however, there currently are no formal approaches to tolerancing the alignment of a light-guide coupling system. Here, we propose a Fourier alignment sensitivity (FAS) algorithm to determine the optimal tolerances on the alignment of a light guide by computing the alignment sensitivity. The algorithm shows excellent agreement with both simulated and experimentally measured values and improves on the computation time of equivalent ray-tracing simulations by two orders of magnitude. We then apply FAS to tolerance and fabricate a coupling system, which is shown to meet specifications, thus validating FAS as a tolerancing technique. These results indicate that FAS is a flexible and rapid means to quantify the alignment sensitivity of a light guide, widely informing the design and tolerancing of coupling systems.
Tolerancing the alignment of large-core optical fibers, fiber bundles and light guides using a Fourier approach

PubMed Central

Sawyer, Travis W.; Petersburg, Ryan; Bohndiek, Sarah E.

2017-01-01

Optical fiber technology is found in a wide variety of applications to flexibly relay light between two points, enabling information transfer across long distances and allowing access to hard-to-reach areas. Large-core optical fibers and light guides find frequent use in illumination and spectroscopic applications; for example, endoscopy and high-resolution astronomical spectroscopy. Proper alignment is critical for maximizing throughput in optical fiber coupling systems, however, there currently are no formal approaches to tolerancing the alignment of a light guide coupling system. Here, we propose a Fourier Alignment Sensitivity (FAS) algorithm to determine the optimal tolerances on the alignment of a light guide by computing the alignment sensitivity. The algorithm shows excellent agreement with both simulated and experimentally measured values and improves on the computation time of equivalent ray tracing simulations by two orders of magnitude. We then apply FAS to tolerance and fabricate a coupling system, which is shown to meet specifications, thus validating FAS as a tolerancing technique. These results indicate that FAS is a flexible and rapid means to quantify the alignment sensitivity of a light guide, widely informing the design and tolerancing of coupling systems. PMID:28430250
SARA - SURE/ASSIST RELIABILITY ANALYSIS WORKSTATION (VAX VMS VERSION)

NASA Technical Reports Server (NTRS)

Butler, R. W.

1994-01-01

SARA, the SURE/ASSIST Reliability Analysis Workstation, is a bundle of programs used to solve reliability problems. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. The Systems Validation Methods group at NASA Langley Research Center has created a set of four software packages that form the basis for a reliability analysis workstation, including three for use in analyzing reconfigurable, fault-tolerant systems and one for analyzing non-reconfigurable systems. The SARA bundle includes the three for reconfigurable, fault-tolerant systems: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), and PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920). As indicated by the program numbers in parentheses, each of these three packages is also available separately in two machine versions. The fourth package, which is only available separately, is FTC, the Fault Tree Compiler (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree which describes a non-reconfigurable system. PAWS/STEM and SURE are analysis programs which utilize different solution methods, but have a common input language, the SURE language. ASSIST is a preprocessor that generates SURE language from a more abstract definition. ASSIST, SURE, and PAWS/STEM are described briefly in the following paragraphs. For additional details about the individual packages, including pricing, please refer to their respective abstracts. ASSIST, the Abstract Semi-Markov Specification Interface to the SURE Tool program, allows a reliability engineer to describe the failure behavior of a fault-tolerant computer system in an abstract, high-level language. The ASSIST program then automatically generates a corresponding semi-Markov model. A one-page ASSIST-language description may result in a semi-Markov model with thousands of states and transitions. The ASSIST program also includes model-reduction techniques to facilitate efficient modeling of large systems. The semi-Markov model generated by ASSIST is in the format needed for input to SURE and PAWS/STEM. The Semi-Markov Unreliability Range Evaluator, SURE, is an analysis tool for reconfigurable, fault-tolerant systems. SURE provides an efficient means for calculating accurate upper and lower bounds for the death state probabilities for a large class of semi-Markov models, not just those which can be reduced to critical-pair architectures. The calculated bounds are close enough (usually within 5 percent of each other) for use in reliability studies of ultra-reliable computer systems. The SURE bounding theorems have algebraic solutions and are consequently computationally efficient even for large and complex systems. SURE can optionally regard a specified parameter as a variable over a range of values, enabling an automatic sensitivity analysis. SURE output is tabular. The PAWS/STEM package includes two programs for the creation and evaluation of pure Markov models describing the behavior of fault-tolerant reconfigurable computer systems: the Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The programs that comprise the SARA package were originally developed for use on DEC VAX series computers running VMS and were later ported for use on Sun series computers running SunOS. They are written in C-language, Pascal, and FORTRAN 77. An ANSI compliant C compiler is required in order to compile the C portion of the Sun version source code. The Pascal and FORTRAN code can be compiled on Sun computers using Sun Pascal and Sun Fortran. For the VMS version, VAX C, VAX PASCAL, and VAX FORTRAN can be used to recompile the source code. The standard distribution medium for the VMS version of SARA (COS-10041) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of SARA (COS-10039) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. Electronic copies of the ASSIST user's manual in TeX and PostScript formats are provided on the distribution medium. DEC, VAX, VMS, and TK50 are registered trademarks of Digital Equipment Corporation. Sun, Sun3, Sun4, and SunOS are trademarks of Sun Microsystems, Inc. TeX is a trademark of the American Mathematical Society. PostScript is a registered trademark of Adobe Systems Incorporated.
SARA - SURE/ASSIST RELIABILITY ANALYSIS WORKSTATION (UNIX VERSION)

NASA Technical Reports Server (NTRS)

Butler, R. W.

1994-01-01

SARA, the SURE/ASSIST Reliability Analysis Workstation, is a bundle of programs used to solve reliability problems. The mathematical approach chosen to solve a reliability problem may vary with the size and nature of the problem. The Systems Validation Methods group at NASA Langley Research Center has created a set of four software packages that form the basis for a reliability analysis workstation, including three for use in analyzing reconfigurable, fault-tolerant systems and one for analyzing non-reconfigurable systems. The SARA bundle includes the three for reconfigurable, fault-tolerant systems: SURE reliability analysis program (COSMIC program LAR-13789, LAR-14921); the ASSIST specification interface program (LAR-14193, LAR-14923), and PAWS/STEM reliability analysis programs (LAR-14165, LAR-14920). As indicated by the program numbers in parentheses, each of these three packages is also available separately in two machine versions. The fourth package, which is only available separately, is FTC, the Fault Tree Compiler (LAR-14586, LAR-14922). FTC is used to calculate the top-event probability for a fault tree which describes a non-reconfigurable system. PAWS/STEM and SURE are analysis programs which utilize different solution methods, but have a common input language, the SURE language. ASSIST is a preprocessor that generates SURE language from a more abstract definition. ASSIST, SURE, and PAWS/STEM are described briefly in the following paragraphs. For additional details about the individual packages, including pricing, please refer to their respective abstracts. ASSIST, the Abstract Semi-Markov Specification Interface to the SURE Tool program, allows a reliability engineer to describe the failure behavior of a fault-tolerant computer system in an abstract, high-level language. The ASSIST program then automatically generates a corresponding semi-Markov model. A one-page ASSIST-language description may result in a semi-Markov model with thousands of states and transitions. The ASSIST program also includes model-reduction techniques to facilitate efficient modeling of large systems. The semi-Markov model generated by ASSIST is in the format needed for input to SURE and PAWS/STEM. The Semi-Markov Unreliability Range Evaluator, SURE, is an analysis tool for reconfigurable, fault-tolerant systems. SURE provides an efficient means for calculating accurate upper and lower bounds for the death state probabilities for a large class of semi-Markov models, not just those which can be reduced to critical-pair architectures. The calculated bounds are close enough (usually within 5 percent of each other) for use in reliability studies of ultra-reliable computer systems. The SURE bounding theorems have algebraic solutions and are consequently computationally efficient even for large and complex systems. SURE can optionally regard a specified parameter as a variable over a range of values, enabling an automatic sensitivity analysis. SURE output is tabular. The PAWS/STEM package includes two programs for the creation and evaluation of pure Markov models describing the behavior of fault-tolerant reconfigurable computer systems: the Pade Approximation with Scaling (PAWS) and Scaled Taylor Exponential Matrix (STEM) programs. PAWS and STEM produce exact solutions for the probability of system failure and provide a conservative estimate of the number of significant digits in the solution. Markov models of fault-tolerant architectures inevitably lead to numerically stiff differential equations. Both PAWS and STEM have the capability to solve numerically stiff models. These complementary programs use separate methods to determine the matrix exponential in the solution of the model's system of differential equations. In general, PAWS is better suited to evaluate small and dense models. STEM operates at lower precision, but works faster than PAWS for larger models. The programs that comprise the SARA package were originally developed for use on DEC VAX series computers running VMS and were later ported for use on Sun series computers running SunOS. They are written in C-language, Pascal, and FORTRAN 77. An ANSI compliant C compiler is required in order to compile the C portion of the Sun version source code. The Pascal and FORTRAN code can be compiled on Sun computers using Sun Pascal and Sun Fortran. For the VMS version, VAX C, VAX PASCAL, and VAX FORTRAN can be used to recompile the source code. The standard distribution medium for the VMS version of SARA (COS-10041) is a 9-track 1600 BPI magnetic tape in VMSINSTAL format. It is also available on a TK50 tape cartridge in VMSINSTAL format. Executables are included. The standard distribution medium for the Sun version of SARA (COS-10039) is a .25 inch streaming magnetic tape cartridge in UNIX tar format. Both Sun3 and Sun4 executables are included. Electronic copies of the ASSIST user's manual in TeX and PostScript formats are provided on the distribution medium. DEC, VAX, VMS, and TK50 are registered trademarks of Digital Equipment Corporation. Sun, Sun3, Sun4, and SunOS are trademarks of Sun Microsystems, Inc. TeX is a trademark of the American Mathematical Society. PostScript is a registered trademark of Adobe Systems Incorporated.
Network Topologies and Dynamics Leading to Endotoxin Tolerance and Priming in Innate Immune Cells

NASA Astrophysics Data System (ADS)

Fu, Yan; Glaros, Trevor; Zhu, Meng; Wang, Ping; Wu, Zhanghan; Tyson, John; Li, Liwu; Xing, Jianhua

2012-01-01

The innate immune system, acting as the first line of host defense, senses and adapts to foreign challenges through complex intracellular and intercellular signaling networks. Endotoxin tolerance and priming elicited by macrophages are classic examples of the complex adaptation of innate immune cells. Upon repetitive exposures to different doses of bacterial endotoxin (lipopolysaccharide) or other stimulants, macrophages show either suppressed or augmented inflammatory responses compared to a single exposure to the stimulant. Endotoxin tolerance and priming are critically involved in both immune homeostasis and the pathogenesis of diverse inflammatory diseases. However, the underlying molecular mechanisms are not well understood. By means of a computational search through the parameter space of a coarse-grained three-node network with a two-stage Metropolis sampling approach, we enumerated all the network topologies that can generate priming or tolerance. We discovered three major mechanisms for priming (pathway synergy, suppressor deactivation, activator induction) and one for tolerance (inhibitor persistence). These results not only explain existing experimental observations, but also reveal intriguing test scenarios for future experimental studies to clarify mechanisms of endotoxin priming and tolerance.
Message Efficient Checkpointing and Rollback Recovery in Heterogeneous Mobile Networks

NASA Astrophysics Data System (ADS)

Jaggi, Parmeet Kaur; Singh, Awadhesh Kumar

2016-06-01

Heterogeneous networks provide an appealing way of expanding the computing capability of mobile networks by combining infrastructure-less mobile ad-hoc networks with the infrastructure-based cellular mobile networks. The nodes in such a network range from low-power nodes to macro base stations and thus, vary greatly in their capabilities such as computation power and battery power. The nodes are susceptible to different types of transient and permanent failures and therefore, the algorithms designed for such networks need to be fault-tolerant. The article presents a checkpointing algorithm for the rollback recovery of mobile hosts in a heterogeneous mobile network. Checkpointing is a well established approach to provide fault tolerance in static and cellular mobile distributed systems. However, the use of checkpointing for fault tolerance in a heterogeneous environment remains to be explored. The proposed protocol is based on the results of zigzag paths and zigzag cycles by Netzer-Xu. Considering the heterogeneity prevalent in the network, an uncoordinated checkpointing technique is employed. Yet, useless checkpoints are avoided without causing a high message overhead.
Effects of numerical tolerance levels on an atmospheric chemistry model for mercury

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ferris, D.C.; Burns, D.S.; Shuford, J.

1996-12-31

A Box Model was developed to investigate the atmospheric oxidation processes of mercury in the environment. Previous results indicated the most important influences on the atmospheric concentration of HgO(g) are (i) the flux of HgO(g) volatilization, which is related to the surface medium, extent of contamination, and temperature, and (ii) the presence of Cl{sub 2} in the atmosphere. The numerical solver which has been incorporated into the ORganic CHemistry Integrated Dispersion (ORCHID) model uses the Livermore Solver of Ordinary Differential Equations (LSODE). In the solution of the ODE`s, LSODE uses numerical tolerances. The tolerances effect computer run time, the relativemore » accuracy of ODE calculated species concentrations and whether or not LSODE converges to a solution using this system of equations. The effects of varying these tolerances on the solution of the box model and the ORCHID model will be discussed.« less
Enhancing radiation tolerance by controlling defect mobility and migration pathways in multicomponent single-phase alloys

NASA Astrophysics Data System (ADS)

Lu, Chenyang; Niu, Liangliang; Chen, Nanjun; Jin, Ke; Yang, Taini; Xiu, Pengyuan; Zhang, Yanwen; Gao, Fei; Bei, Hongbin; Shi, Shi; He, Mo-Rigen; Robertson, Ian M.; Weber, William J.; Wang, Lumin

2016-12-01

A grand challenge in material science is to understand the correlation between intrinsic properties and defect dynamics. Radiation tolerant materials are in great demand for safe operation and advancement of nuclear and aerospace systems. Unlike traditional approaches that rely on microstructural and nanoscale features to mitigate radiation damage, this study demonstrates enhancement of radiation tolerance with the suppression of void formation by two orders magnitude at elevated temperatures in equiatomic single-phase concentrated solid solution alloys, and more importantly, reveals its controlling mechanism through a detailed analysis of the depth distribution of defect clusters and an atomistic computer simulation. The enhanced swelling resistance is attributed to the tailored interstitial defect cluster motion in the alloys from a long-range one-dimensional mode to a short-range three-dimensional mode, which leads to enhanced point defect recombination. The results suggest design criteria for next generation radiation tolerant structural alloys.
Damage Tolerance of Large Shell Structures

NASA Technical Reports Server (NTRS)

Minnetyan, L.; Chamis, C. C.

1999-01-01

Progressive damage and fracture of large shell structures is investigated. A computer model is used for the assessment of structural response, progressive fracture resistance, and defect/damage tolerance characteristics. Critical locations of a stiffened conical shell segment are identified. Defective and defect-free computer models are simulated to evaluate structural damage/defect tolerance. Safe pressurization levels are assessed for the retention of structural integrity at the presence of damage/ defects. Damage initiation, growth, accumulation, and propagation to fracture are included in the simulations. Damage propagation and burst pressures for defective and defect-free shells are compared to evaluate damage tolerance. Design implications with regard to defect and damage tolerance of a large steel pressure vessel are examined.
Test experience on an ultrareliable computer communication network

NASA Technical Reports Server (NTRS)

Abbott, L. W.

1984-01-01

The dispersed sensor processing mesh (DSPM) is an experimental, ultra-reliable, fault-tolerant computer communications network that exhibits an organic-like ability to regenerate itself after suffering damage. The regeneration is accomplished by two routines - grow and repair. This paper discusses the DSPM concept for achieving fault tolerance and provides a brief description of the mechanization of both the experiment and the six-node experimental network. The main topic of this paper is the system performance of the growth algorithm contained in the grow routine. The characteristics imbued to DSPM by the growth algorithm are also discussed. Data from an experimental DSPM network and software simulation of larger DSPM-type networks are used to examine the inherent limitation on growth time by the growth algorithm and the relationship of growth time to network size and topology.
Self-Checking Pairs Of Microprocessors

NASA Technical Reports Server (NTRS)

Smith, Brian S.

1995-01-01

Method of imparting fault tolerance to computer system provides for immediate detection of faults at microprocessor level. Shadow microprocessor provides nominal duplicate outputs to verify functioning of main microprocessor. When output signal on any pin of one microprocessor differs from that on corresponding pin of other microprocessor, comparator puts out alarm signal.
Middleware for big data processing: test results

NASA Astrophysics Data System (ADS)

Gankevich, I.; Gaiduchok, V.; Korkhov, V.; Degtyarev, A.; Bogdanov, A.

2017-12-01

Dealing with large volumes of data is resource-consuming work which is more and more often delegated not only to a single computer but also to a whole distributed computing system at once. As the number of computers in a distributed system increases, the amount of effort put into effective management of the system grows. When the system reaches some critical size, much effort should be put into improving its fault tolerance. It is difficult to estimate when some particular distributed system needs such facilities for a given workload, so instead they should be implemented in a middleware which works efficiently with a distributed system of any size. It is also difficult to estimate whether a volume of data is large or not, so the middleware should also work with data of any volume. In other words, the purpose of the middleware is to provide facilities that adapt distributed computing system for a given workload. In this paper we introduce such middleware appliance. Tests show that this middleware is well-suited for typical HPC and big data workloads and its performance is comparable with well-known alternatives.
Stochastic Stability of Sampled Data Systems with a Jump Linear Controller

NASA Technical Reports Server (NTRS)

Gonzalez, Oscar R.; Herencia-Zapana, Heber; Gray, W. Steven

2004-01-01

In this paper an equivalence between the stochastic stability of a sampled-data system and its associated discrete-time representation is established. The sampled-data system consists of a deterministic, linear, time-invariant, continuous-time plant and a stochastic, linear, time-invariant, discrete-time, jump linear controller. The jump linear controller models computer systems and communication networks that are subject to stochastic upsets or disruptions. This sampled-data model has been used in the analysis and design of fault-tolerant systems and computer-control systems with random communication delays without taking into account the inter-sample response. This paper shows that the known equivalence between the stability of a deterministic sampled-data system and the associated discrete-time representation holds even in a stochastic framework.

Relaxed fault-tolerant hardware implementation of neural networks in the presence of multiple transient errors.

PubMed

Mahdiani, Hamid Reza; Fakhraie, Sied Mehdi; Lucas, Caro

2012-08-01

Reliability should be identified as the most important challenge in future nano-scale very large scale integration (VLSI) implementation technologies for the development of complex integrated systems. Normally, fault tolerance (FT) in a conventional system is achieved by increasing its redundancy, which also implies higher implementation costs and lower performance that sometimes makes it even infeasible. In contrast to custom approaches, a new class of applications is categorized in this paper, which is inherently capable of absorbing some degrees of vulnerability and providing FT based on their natural properties. Neural networks are good indicators of imprecision-tolerant applications. We have also proposed a new class of FT techniques called relaxed fault-tolerant (RFT) techniques which are developed for VLSI implementation of imprecision-tolerant applications. The main advantage of RFT techniques with respect to traditional FT solutions is that they exploit inherent FT of different applications to reduce their implementation costs while improving their performance. To show the applicability as well as the efficiency of the RFT method, the experimental results for implementation of a face-recognition computationally intensive neural network and its corresponding RFT realization are presented in this paper. The results demonstrate promising higher performance of artificial neural network VLSI solutions for complex applications in faulty nano-scale implementation environments.
An Architectural Concept for Intrusion Tolerance in Air Traffic Networks

NASA Technical Reports Server (NTRS)

Maddalon, Jeffrey M.; Miner, Paul S.

2003-01-01

The goal of an intrusion tolerant network is to continue to provide predictable and reliable communication in the presence of a limited num ber of compromised network components. The behavior of a compromised network component ranges from a node that no longer responds to a nod e that is under the control of a malicious entity that is actively tr ying to cause other nodes to fail. Most current data communication ne tworks do not include support for tolerating unconstrained misbehavio r of components in the network. However, the fault tolerance communit y has developed protocols that provide both predictable and reliable communication in the presence of the worst possible behavior of a limited number of nodes in the system. One may view a malicious entity in a communication network as a node that has failed and is behaving in an arbitrary manner. NASA/Langley Research Center has developed one such fault-tolerant computing platform called SPIDER (Scalable Proces sor-Independent Design for Electromagnetic Resilience). The protocols and interconnection mechanisms of SPIDER may be adapted to large-sca le, distributed communication networks such as would be required for future Air Traffic Management systems. The predictability and reliabi lity guarantees provided by the SPIDER protocols have been formally v erified. This analysis can be readily adapted to similar network stru ctures.
An abstract specification language for Markov reliability models

NASA Technical Reports Server (NTRS)

Butler, R. W.

1985-01-01

Markov models can be used to compute the reliability of virtually any fault tolerant system. However, the process of delineating all of the states and transitions in a model of complex system can be devastatingly tedious and error-prone. An approach to this problem is presented utilizing an abstract model definition language. This high level language is described in a nonformal manner and illustrated by example.
An abstract language for specifying Markov reliability models

NASA Technical Reports Server (NTRS)

Butler, Ricky W.

1986-01-01

Markov models can be used to compute the reliability of virtually any fault tolerant system. However, the process of delineating all of the states and transitions in a model of complex system can be devastatingly tedious and error-prone. An approach to this problem is presented utilizing an abstract model definition language. This high level language is described in a nonformal manner and illustrated by example.
Periodically Self Restoring Redundant Systems for VLSI Based Highly Reliable Design,

DTIC Science & Technology

1984-01-01

fault tolerance technique for realizing highly reliable computer systems for critical control applications . However, VL.SI technology has imposed a...operating correctly; failed critical real time control applications . n modules are discarded from the vote. the classical "static" voted redundancy...redundant modules are failure number of InterconnecttIon3. This results In f aree. However, for applications requiring higm modular complexity because
Nonuniform code concatenation for universal fault-tolerant quantum computing

NASA Astrophysics Data System (ADS)

Nikahd, Eesa; Sedighi, Mehdi; Saheb Zamani, Morteza

2017-09-01

Using transversal gates is a straightforward and efficient technique for fault-tolerant quantum computing. Since transversal gates alone cannot be computationally universal, they must be combined with other approaches such as magic state distillation, code switching, or code concatenation to achieve universality. In this paper we propose an alternative approach for universal fault-tolerant quantum computing, mainly based on the code concatenation approach proposed in [T. Jochym-O'Connor and R. Laflamme, Phys. Rev. Lett. 112, 010505 (2014), 10.1103/PhysRevLett.112.010505], but in a nonuniform fashion. The proposed approach is described based on nonuniform concatenation of the 7-qubit Steane code with the 15-qubit Reed-Muller code, as well as the 5-qubit code with the 15-qubit Reed-Muller code, which lead to two 49-qubit and 47-qubit codes, respectively. These codes can correct any arbitrary single physical error with the ability to perform a universal set of fault-tolerant gates, without using magic state distillation.
Autonomous safety and reliability features of the K-1 avionics system

NASA Astrophysics Data System (ADS)

Mueller, George E.; Kohrs, Dick; Bailey, Richard; Lai, Gary

2004-03-01

Kistler Aerospace Corporation is developing the K-1, a fully reusable, two-stage-to-orbit launch vehicle. Both stages return to the launch site using parachutes and airbags. Initial flight operations will occur from Woomera, Australia. K-1 guidance is performed autonomously. Each stage of the K-1 employs a triplex, fault tolerant avionics architecture, including three fault tolerant computers and three radiation hardened Embedded GPS/INS units with a hardware voter. The K-1 has an Integrated Vehicle Health Management (IVHM) system on each stage residing in the three vehicle computers based on similar systems in commercial aircraft. During first-stage ascent, the IVHM system performs an Instantaneous Impact Prediction (IIP) calculation 25 times per second, initiating an abort in the event the vehicle is outside a predetermined safety corridor for at least 3 consecutive calculations. In this event, commands are issued to terminate thrust, separate the stages, dump all propellant in the first-stage, and initiate a normal landing sequence. The second-stage flight computer calculates its ability to reach orbit along its state vector, initiating an abort sequence similar to the first stage if it cannot. On a nominal mission, following separation, the second-stage also performs calculations to assure its impact point is within a safety corridor. The K-1's guidance and control design is being tested through simulation with hardware-in-the-loop at Draper Laboratory. Kistler's verification strategy assures reliable and safe operation of the K-1.
BESIU Physical Analysis on Hadoop Platform

NASA Astrophysics Data System (ADS)

Huo, Jing; Zang, Dongsong; Lei, Xiaofeng; Li, Qiang; Sun, Gongxing

2014-06-01

In the past 20 years, computing cluster has been widely used for High Energy Physics data processing. The jobs running on the traditional cluster with a Data-to-Computing structure, have to read large volumes of data via the network to the computing nodes for analysis, thereby making the I/O latency become a bottleneck of the whole system. The new distributed computing technology based on the MapReduce programming model has many advantages, such as high concurrency, high scalability and high fault tolerance, and it can benefit us in dealing with Big Data. This paper brings the idea of using MapReduce model to do BESIII physical analysis, and presents a new data analysis system structure based on Hadoop platform, which not only greatly improve the efficiency of data analysis, but also reduces the cost of system building. Moreover, this paper establishes an event pre-selection system based on the event level metadata(TAGs) database to optimize the data analyzing procedure.
RRAM-based parallel computing architecture using k-nearest neighbor classification for pattern recognition

NASA Astrophysics Data System (ADS)

Jiang, Yuning; Kang, Jinfeng; Wang, Xinan

2017-03-01

Resistive switching memory (RRAM) is considered as one of the most promising devices for parallel computing solutions that may overcome the von Neumann bottleneck of today’s electronic systems. However, the existing RRAM-based parallel computing architectures suffer from practical problems such as device variations and extra computing circuits. In this work, we propose a novel parallel computing architecture for pattern recognition by implementing k-nearest neighbor classification on metal-oxide RRAM crossbar arrays. Metal-oxide RRAM with gradual RESET behaviors is chosen as both the storage and computing components. The proposed architecture is tested by the MNIST database. High speed (~100 ns per example) and high recognition accuracy (97.05%) are obtained. The influence of several non-ideal device properties is also discussed, and it turns out that the proposed architecture shows great tolerance to device variations. This work paves a new way to achieve RRAM-based parallel computing hardware systems with high performance.
Formal design and verification of a reliable computing platform for real-time control (phase 3 results)

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Divito, Ben L.; Holloway, C. Michael

1994-01-01

In this paper the design and formal verification of the lower levels of the Reliable Computing Platform (RCP), a fault-tolerant computing system for digital flight control applications, are presented. The RCP uses NMR-style redundancy to mask faults and internal majority voting to flush the effects of transient faults. Two new layers of the RCP hierarchy are introduced: the Minimal Voting refinement (DA_minv) of the Distributed Asynchronous (DA) model and the Local Executive (LE) Model. Both the DA_minv model and the LE model are specified formally and have been verified using the Ehdm verification system. All specifications and proofs are available electronically via the Internet using anonymous FTP or World Wide Web (WWW) access.
Transparent Ada rendezvous in a fault tolerant distributed system

NASA Technical Reports Server (NTRS)

Racine, Roger

1986-01-01

There are many problems associated with distributing an Ada program over a loosely coupled communication network. Some of these problems involve the various aspects of the distributed rendezvous. The problems addressed involve supporting the delay statement in a selective call and supporting the else clause in a selective call. Most of these difficulties are compounded by the need for an efficient communication system. The difficulties are compounded even more by considering the possibility of hardware faults occurring while the program is running. With a hardware fault tolerant computer system, it is possible to design a distribution scheme and communication software which is efficient and allows Ada semantics to be preserved. An Ada design for the communications software of one such system will be presented, including a description of the services provided in the seven layers of an International Standards Organization (ISO) Open System Interconnect (OSI) model communications system. The system capabilities (hardware and software) that allow this communication system will also be described.
Loss Tolerance in One-Way Quantum Computation via Counterfactual Error Correction

NASA Astrophysics Data System (ADS)

Varnava, Michael; Browne, Daniel E.; Rudolph, Terry

2006-09-01

We introduce a scheme for fault tolerantly dealing with losses (or other “leakage” errors) in cluster state computation that can tolerate up to 50% qubit loss. This is achieved passively using an adaptive strategy of measurement—no coherent measurements or coherent correction is required. Since the scheme relies on inferring information about what would have been the outcome of a measurement had one been able to carry it out, we call this counterfactual error correction.
Data and results of a laboratory investigation of microprocessor upset caused by simulated lightning-induced analog transients

NASA Technical Reports Server (NTRS)

Belcastro, C. M.

1984-01-01

Advanced composite aircraft designs include fault-tolerant computer-based digital control systems with thigh reliability requirements for adverse as well as optimum operating environments. Since aircraft penetrate intense electromagnetic fields during thunderstorms, onboard computer systems maya be subjected to field-induced transient voltages and currents resulting in functional error modes which are collectively referred to as digital system upset. A methodology was developed for assessing the upset susceptibility of a computer system onboard an aircraft flying through a lightning environment. Upset error modes in a general-purpose microprocessor were studied via tests which involved the random input of analog transients which model lightning-induced signals onto interface lines of an 8080-based microcomputer from which upset error data were recorded. The application of Markov modeling to upset susceptibility estimation is discussed and a stochastic model development.
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1980-01-01

RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
Automation of reliability evaluation procedures through CARE - The computer-aided reliability estimation program.

NASA Technical Reports Server (NTRS)

Mathur, F. P.

1972-01-01

Description of an on-line interactive computer program called CARE (Computer-Aided Reliability Estimation) which can model self-repair and fault-tolerant organizations and perform certain other functions. Essentially CARE consists of a repository of mathematical equations defining the various basic redundancy schemes. These equations, under program control, are then interrelated to generate the desired mathematical model to fit the architecture of the system under evaluation. The mathematical model is then supplied with ground instances of its variables and is then evaluated to generate values for the reliability-theoretic functions applied to the model.
A distributed programming environment for Ada

NASA Technical Reports Server (NTRS)

Brennan, Peter; Mcdonnell, Tom; Mcfarland, Gregory; Timmins, Lawrence J.; Litke, John D.

1986-01-01

Despite considerable commercial exploitation of fault tolerance systems, significant and difficult research problems remain in such areas as fault detection and correction. A research project is described which constructs a distributed computing test bed for loosely coupled computers. The project is constructing a tool kit to support research into distributed control algorithms, including a distributed Ada compiler, distributed debugger, test harnesses, and environment monitors. The Ada compiler is being written in Ada and will implement distributed computing at the subsystem level. The design goal is to provide a variety of control mechanics for distributed programming while retaining total transparency at the code level.
Nonlinear optics quantum computing with circuit QED.

PubMed

Adhikari, Prabin; Hafezi, Mohammad; Taylor, J M

2013-02-08

One approach to quantum information processing is to use photons as quantum bits and rely on linear optical elements for most operations. However, some optical nonlinearity is necessary to enable universal quantum computing. Here, we suggest a circuit-QED approach to nonlinear optics quantum computing in the microwave regime, including a deterministic two-photon phase gate. Our specific example uses a hybrid quantum system comprising a LC resonator coupled to a superconducting flux qubit to implement a nonlinear coupling. Compared to the self-Kerr nonlinearity, we find that our approach has improved tolerance to noise in the qubit while maintaining fast operation.
Three real-time architectures - A study using reward models

NASA Technical Reports Server (NTRS)

Sjogren, J. A.; Smith, R. M.

1990-01-01

Numerous applications in the area of computer system analysis can be effectively studied with Markov reward models. These models describe the evolutionary behavior of the computer system by a continuous-time Markov chain, and a reward rate is associated with each state. In reliability/availability models, upstates have reward rate 1, and down states have reward rate zero associated with them. In a combined model of performance and reliability, the reward rate of a state may be the computational capacity, or a related performance measure. Steady-state expected reward rate and expected instantaneous reward rate are clearly useful measures which can be extracted from the Markov reward model. The diversity of areas where Markov reward models may be used is illustrated with a comparative study of three examples of interest to the fault tolerant computing community.
Distributed computing system with dual independent communications paths between computers and employing split tokens

NASA Technical Reports Server (NTRS)

Rasmussen, Robert D. (Inventor); Manning, Robert M. (Inventor); Lewis, Blair F. (Inventor); Bolotin, Gary S. (Inventor); Ward, Richard S. (Inventor)

1990-01-01

This is a distributed computing system providing flexible fault tolerance; ease of software design and concurrency specification; and dynamic balance of the loads. The system comprises a plurality of computers each having a first input/output interface and a second input/output interface for interfacing to communications networks each second input/output interface including a bypass for bypassing the associated computer. A global communications network interconnects the first input/output interfaces for providing each computer the ability to broadcast messages simultaneously to the remainder of the computers. A meshwork communications network interconnects the second input/output interfaces providing each computer with the ability to establish a communications link with another of the computers bypassing the remainder of computers. Each computer is controlled by a resident copy of a common operating system. Communications between respective ones of computers is by means of split tokens each having a moving first portion which is sent from computer to computer and a resident second portion which is disposed in the memory of at least one of computer and wherein the location of the second portion is part of the first portion. The split tokens represent both functions to be executed by the computers and data to be employed in the execution of the functions. The first input/output interfaces each include logic for detecting a collision between messages and for terminating the broadcasting of a message whereby collisions between messages are detected and avoided.
Edge detection based on computational ghost imaging with structured illuminations

NASA Astrophysics Data System (ADS)

Yuan, Sheng; Xiang, Dong; Liu, Xuemei; Zhou, Xin; Bing, Pibin

2018-03-01

Edge detection is one of the most important tools to recognize the features of an object. In this paper, we propose an optical edge detection method based on computational ghost imaging (CGI) with structured illuminations which are generated by an interference system. The structured intensity patterns are designed to make the edge of an object be directly imaged from detected data in CGI. This edge detection method can extract the boundaries for both binary and grayscale objects in any direction at one time. We also numerically test the influence of distance deviations in the interference system on edge extraction, i.e., the tolerance of the optical edge detection system to distance deviation. Hopefully, it may provide a guideline for scholars to build an experimental system.

Reliability and maintainability assessment factors for reliable fault-tolerant systems

NASA Technical Reports Server (NTRS)

Bavuso, S. J.

1984-01-01

A long term goal of the NASA Langley Research Center is the development of a reliability assessment methodology of sufficient power to enable the credible comparison of the stochastic attributes of one ultrareliable system design against others. This methodology, developed over a 10 year period, is a combined analytic and simulative technique. An analytic component is the Computer Aided Reliability Estimation capability, third generation, or simply CARE III. A simulative component is the Gate Logic Software Simulator capability, or GLOSS. The numerous factors that potentially have a degrading effect on system reliability and the ways in which these factors that are peculiar to highly reliable fault tolerant systems are accounted for in credible reliability assessments. Also presented are the modeling difficulties that result from their inclusion and the ways in which CARE III and GLOSS mitigate the intractability of the heretofore unworkable mathematics.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Korneeva, Anna; Shaydurov, Vladimir

In the paper, the data analysis is considered for thin-film thermoresistors coated on to a radio-electronic printed circuit board to determine possible zones of its overheating. A mathematical model consists in an underdetermined system of linear algebraic equations with an infinite set of solutions. For computing a more real solution, two additional conditions are used: the smoothness of a solution and the positiveness of an increase of temperature during overheating. Computational experiments demonstrate that an overheating zone is determined exactly with a tolerable accuracy of temperature in it.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Ali, Amjad Majid; Albert, Don; Andersson, Par

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small computer clusters. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work 9normally a parallel job) on the set of allocated nodes. Finally, it arbitrates conflicting requests for resources by managing a queue of pending work.
From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators

NASA Astrophysics Data System (ADS)

Yim, Keun Soo

This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
Advantages of High Tolerance Measurements in Fusion Environments Applying Photogrammetry

DOE Office of Scientific and Technical Information (OSTI.GOV)

T. Dodson, R. Ellis, C. Priniski, S. Raftopoulos, D. Stevens, M. Viola

2009-02-04

Photogrammetry, a state-of-the-art technique of metrology employing digital photographs as the vehicle for measurement, has been investigated in the fusion environment. Benefits of this high tolerance methodology include relatively easy deployment for multiple point measurements and deformation/distortion studies. Depending on the equipment used, photogrammetric systems can reach tolerances of 25 microns (0.001 in) to 100 microns (0.004 in) on a 3-meter object. During the fabrication and assembly of the National Compact Stellarator Experiment (NCSX) the primary measurement systems deployed were CAD coordinate-based computer metrology equipment and supporting algorithms such as both interferometer-aided (IFM) and absolute distance measurementbased (ADM) laser trackers,more » as well as portable Coordinate Measurement Machine (CMM) arms. Photogrammetry was employed at NCSX as a quick and easy tool to monitor coil distortions incurred during welding operations of the machine assembly process and as a way to reduce assembly downtime for metrology processes.« less
Reliable communication in the presence of failures

NASA Technical Reports Server (NTRS)

Birman, Kenneth P.; Joseph, Thomas A.

1987-01-01

The design and correctness of a communication facility for a distributed computer system are reported on. The facility provides support for fault-tolerant process groups in the form of a family of reliable multicast protocols that can be used in both local- and wide-area networks. These protocols attain high levels of concurrency, while respecting application-specific delivery ordering constraints, and have varying cost and performance that depend on the degree of ordering desired. In particular, a protocol that enforces causal delivery orderings is introduced and shown to be a valuable alternative to conventional asynchronous communication protocols. The facility also ensures that the processes belonging to a fault-tolerant process group will observe consistant orderings of events affecting the group as a whole, including process failures, recoveries, migration, and dynamic changes to group properties like member rankings. A review of several uses for the protocols is the ISIS system, which supports fault-tolerant resilient objects and bulletin boards, illustrates the significant simplification of higher level algorithms made possible by our approach.
Expert Systems for the Scheduling of Image Processing Tasks on a Parallel Processing System

DTIC Science & Technology

1986-12-01

existed for over twenty years. Credit for designing and implementing the first computer vision system is usually given to L. G . Roberts [Robe65]. With...hardware differences between systems. 44 LIST OF REFERENCES [Adam82] G . B. Adams III and H. J. Siegel, "The Extra Stage Cube: a Fault-Tolerant...Academic Press, 1985 [Robe65] L. G . Roberts, "Machine Perception of Three-Dimensional Solids," in Optical and Electro-Optical Information Processing, ed. J
Laboratory test methodology for evaluating the effects of electromagnetic disturbances on fault-tolerant control systems

NASA Technical Reports Server (NTRS)

Belcastro, Celeste M.

1989-01-01

Control systems for advanced aircraft, especially those with relaxed static stability, will be critical to flight and will, therefore, have very high reliability specifications which must be met for adverse as well as nominal operating conditions. Adverse conditions can result from electromagnetic disturbances caused by lightning, high energy radio frequency transmitters, and nuclear electromagnetic pulses. Tools and techniques must be developed to verify the integrity of the control system in adverse operating conditions. The most difficult and illusive perturbations to computer based control systems caused by an electromagnetic environment (EME) are functional error modes that involve no component damage. These error modes are collectively known as upset, can occur simultaneously in all of the channels of a redundant control system, and are software dependent. A methodology is presented for performing upset tests on a multichannel control system and considerations are discussed for the design of upset tests to be conducted in the lab on fault tolerant control systems operating in a closed loop with a simulated plant.
The implementation and use of Ada on distributed systems with high reliability requirements

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

The general inadequacy of Ada for programming systems that must survive processor loss was shown. A solution to the problem was proposed in which there are no syntatic changes to Ada. The approach was evaluated using a full-scale, realistic application. The application used was the Advanced Transport Operating System (ATOPS), an experimental computer control system developed for a modified Boeing 737 aircraft. The ATOPS system is a full authority, real-time avionics system providing a large variety of advanced features. Methods of building fault tolerance into concurrent systems were explored. A set of criteria by which the proposed method will be judged was examined. Extensive interaction with personnel from Computer Sciences Corporation and NASA Langley occurred to determine the requirements of the ATOPS software. Backward error recovery in concurrent systems was assessed.
Graphics enhanced computer emulation for improved timing-race and fault tolerance control system analysis. [of Centaur liquid-fuel booster

NASA Technical Reports Server (NTRS)

Szatkowski, G. P.

1983-01-01

A computer simulation system has been developed for the Space Shuttle's advanced Centaur liquid fuel booster rocket, in order to conduct systems safety verification and flight operations training. This simulation utility is designed to analyze functional system behavior by integrating control avionics with mechanical and fluid elements, and is able to emulate any system operation, from simple relay logic to complex VLSI components, with wire-by-wire detail. A novel graphics data entry system offers a pseudo-wire wrap data base that can be easily updated. Visual subsystem operations can be selected and displayed in color on a six-monitor graphics processor. System timing and fault verification analyses are conducted by injecting component fault modes and min/max timing delays, and then observing system operation through a red line monitor.
Use of Field Programmable Gate Array Technology in Future Space Avionics

NASA Technical Reports Server (NTRS)

Ferguson, Roscoe C.; Tate, Robert

2005-01-01

Fulfilling NASA's new vision for space exploration requires the development of sustainable, flexible and fault tolerant spacecraft control systems. The traditional development paradigm consists of the purchase or fabrication of hardware boards with fixed processor and/or Digital Signal Processing (DSP) components interconnected via a standardized bus system. This is followed by the purchase and/or development of software. This paradigm has several disadvantages for the development of systems to support NASA's new vision. Building a system to be fault tolerant increases the complexity and decreases the performance of included software. Standard bus design and conventional implementation produces natural bottlenecks. Configuring hardware components in systems containing common processors and DSPs is difficult initially and expensive or impossible to change later. The existence of Hardware Description Languages (HDLs), the recent increase in performance, density and radiation tolerance of Field Programmable Gate Arrays (FPGAs), and Intellectual Property (IP) Cores provides the technology for reprogrammable Systems on a Chip (SOC). This technology supports a paradigm better suited for NASA's vision. Hardware and software production are melded for more effective development; they can both evolve together over time. Designers incorporating this technology into future avionics can benefit from its flexibility. Systems can be designed with improved fault isolation and tolerance using hardware instead of software. Also, these designs can be protected from obsolescence problems where maintenance is compromised via component and vendor availability.To investigate the flexibility of this technology, the core of the Central Processing Unit and Input/Output Processor of the Space Shuttle AP101S Computer were prototyped in Verilog HDL and synthesized into an Altera Stratix FPGA.
77 FR 39353 - Wassenaar Arrangement 2011 Plenary Agreements Implementation: Commerce Control List, Definitions...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-07-02

... Controls for Conventional Arms and Dual-Use Goods and Technologies is a group of 41 like-minded states... specified and packaged as medical products, are not subject to control. ECCN 1C008 (Non-Fluorinated... technology and computer system design have made control of fault tolerance neither warranted nor feasible...
Analysis of Multi-State Systems with Multi-State Components Using EVMDDs

DTIC Science & Technology

2012-05-01

Fault-Tolerant Computing (FTCS), pp. 249– 258, June 1995. [5] T. Kam, T. Villa, R. K. Brayton , and A. L. Sangiovanni- Vincentelli, “Multi-valued...Shmerko, and R. S. Stankovic, Decision Diagram Techniques for Micro- and Nanoelectronic Design, CRC Press, Taylor & Francis Group, 2006. [16] X. Zang, D
The embedded software life cycle - An expanded view

NASA Technical Reports Server (NTRS)

Larman, Brian T.; Loesh, Robert E.

1989-01-01

Six common issues that are encountered in the development of software for embedded computer systems are discussed from the perspective of their interrelationships with the development process and/or the system itself. Particular attention is given to concurrent hardware/software development, prototyping, the inaccessibility of the operational system, fault tolerance, the long life cycle, and inheritance. It is noted that the life cycle for embedded software must include elements beyond simply the specification and implementation of the target software.
Data Management Working Group report

NASA Technical Reports Server (NTRS)

Filardo, Edward J.; Smith, David B.

1986-01-01

The current flight qualification program lags technology insertion by 6 to 10 years. The objective is to develop an integrated software engineering and development environment assisted by an expert system technology. An operating system needs to be developed which is portable to the on-board computers of the year 2000. The use of ADA verses a High-Order Language; fault tolerance; fiber optics networks; communication protocols; and security are also examined and outlined.
Reliability issues in active control of large flexible space structures

NASA Technical Reports Server (NTRS)

Vandervelde, W. E.

1986-01-01

Efforts in this reporting period were centered on four research tasks: design of failure detection filters for robust performance in the presence of modeling errors, design of generalized parity relations for robust performance in the presence of modeling errors, design of failure sensitive observers using the geometric system theory of Wonham, and computational techniques for evaluation of the performance of control systems with fault tolerance and redundancy management
A proposed protocol for acceptance and constancy control of computed tomography systems: a Nordic Association for Clinical Physics (NACP) work group report.

PubMed

Kuttner, Samuel; Bujila, Robert; Kortesniemi, Mika; Andersson, Henrik; Kull, Love; Østerås, Bjørn Helge; Thygesen, Jesper; Tarp, Ivanka Sojat

2013-03-01

Quality assurance (QA) of computed tomography (CT) systems is one of the routine tasks for medical physicists in the Nordic countries. However, standardized QA protocols do not yet exist and the QA methods, as well as the applied tolerance levels, vary in scope and extent at different hospitals. To propose a standardized protocol for acceptance and constancy testing of CT scanners in the Nordic Region. Following a Nordic Association for Clinical Physics (NACP) initiative, a group of medical physicists, with representatives from four Nordic countries, was formed. Based on international literature and practical experience within the group, a comprehensive standardized test protocol was developed. The proposed protocol includes tests related to the mechanical functionality, X-ray tube, detector, and image quality for CT scanners. For each test, recommendations regarding the purpose, equipment needed, an outline of the test method, the measured parameter, tolerance levels, and the testing frequency are stated. In addition, a number of optional tests are briefly discussed that may provide further information about the CT system. Based on international references and medical physicists' practical experiences, a comprehensive QA protocol for CT systems is proposed, including both acceptance and constancy tests. The protocol may serve as a reference for medical physicists in the Nordic countries.
Mass Analyzers Facilitate Research on Addiction

NASA Technical Reports Server (NTRS)

2012-01-01

The famous go/no go command for Space Shuttle launches comes from a place called the Firing Room. Located at Kennedy Space Center in the Launch Control Center (LCC), there are actually four Firing Rooms that take up most of the third floor of the LCC. These rooms comprise the nerve center for Space Shuttle launch and processing. Test engineers in the Firing Rooms operate the Launch Processing System (LPS), which is a highly automated, computer-controlled system for assembly, checkout, and launch of the Space Shuttle. LPS monitors thousands of measurements on the Space Shuttle and its ground support equipment, compares them to predefined tolerance levels, and then displays values that are out of tolerance. Firing Room operators view the data and send commands about everything from propellant levels inside the external tank to temperatures inside the crew compartment. In many cases, LPS will automatically react to abnormal conditions and perform related functions without test engineer intervention; however, firing room engineers continue to look at each and every happening to ensure a safe launch. Some of the systems monitored during launch operations include electrical, cooling, communications, and computers. One of the thousands of measurements derived from these systems is the amount of hydrogen and oxygen inside the shuttle during launch.
Multiple Embedded Processors for Fault-Tolerant Computing

NASA Technical Reports Server (NTRS)

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Combining dynamical decoupling with fault-tolerant quantum computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ng, Hui Khoon; Preskill, John; Lidar, Daniel A.

2011-07-15

We study how dynamical decoupling (DD) pulse sequences can improve the reliability of quantum computers. We prove upper bounds on the accuracy of DD-protected quantum gates and derive sufficient conditions for DD-protected gates to outperform unprotected gates. Under suitable conditions, fault-tolerant quantum circuits constructed from DD-protected gates can tolerate stronger noise and have a lower overhead cost than fault-tolerant circuits constructed from unprotected gates. Our accuracy estimates depend on the dynamics of the bath that couples to the quantum computer and can be expressed either in terms of the operator norm of the bath's Hamiltonian or in terms of themore » power spectrum of bath correlations; we explain in particular how the performance of recursively generated concatenated pulse sequences can be analyzed from either viewpoint. Our results apply to Hamiltonian noise models with limited spatial correlations.« less

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wu, Qishi; Zhu, Mengxia; Rao, Nageswara S

We propose an intelligent decision support system based on sensor and computer networks that incorporates various component techniques for sensor deployment, data routing, distributed computing, and information fusion. The integrated system is deployed in a distributed environment composed of both wireless sensor networks for data collection and wired computer networks for data processing in support of homeland security defense. We present the system framework and formulate the analytical problems and develop approximate or exact solutions for the subtasks: (i) sensor deployment strategy based on a two-dimensional genetic algorithm to achieve maximum coverage with cost constraints; (ii) data routing scheme tomore » achieve maximum signal strength with minimum path loss, high energy efficiency, and effective fault tolerance; (iii) network mapping method to assign computing modules to network nodes for high-performance distributed data processing; and (iv) binary decision fusion rule that derive threshold bounds to improve system hit rate and false alarm rate. These component solutions are implemented and evaluated through either experiments or simulations in various application scenarios. The extensive results demonstrate that these component solutions imbue the integrated system with the desirable and useful quality of intelligence in decision making.« less
A nearly-linear computational-cost scheme for the forward dynamics of an N-body pendulum

NASA Technical Reports Server (NTRS)

Chou, Jack C. K.

1989-01-01

The dynamic equations of motion of an n-body pendulum with spherical joints are derived to be a mixed system of differential and algebraic equations (DAE's). The DAE's are kept in implicit form to save arithmetic and preserve the sparsity of the system and are solved by the robust implicit integration method. At each solution point, the predicted solution is corrected to its exact solution within given tolerance using Newton's iterative method. For each iteration, a linear system of the form J delta X = E has to be solved. The computational cost for solving this linear system directly by LU factorization is O(n exp 3), and it can be reduced significantly by exploring the structure of J. It is shown that by recognizing the recursive patterns and exploiting the sparsity of the system the multiplicative and additive computational costs for solving J delta X = E are O(n) and O(n exp 2), respectively. The formulation and solution method for an n-body pendulum is presented. The computational cost is shown to be nearly linearly proportional to the number of bodies.
Mission Management Computer Software for RLV-TD

NASA Astrophysics Data System (ADS)

Manju, C. R.; Joy, Josna Susan; Vidya, L.; Sheenarani, I.; Sruthy, C. N.; Viswanathan, P. C.; Dinesh, Sudin; Jayalekshmy, L.; Karuturi, Kesavabrahmaji; Sheema, E.; Syamala, S.; Unnikrishnan, S. Manju; Ali, S. Akbar; Paramasivam, R.; Sheela, D. S.; Shukkoor, A. Abdul; Lalithambika, V. R.; Mookiah, T.

2017-12-01

The Mission Management Computer (MMC) software is responsible for the autonomous navigation, sequencing, guidance and control of the Re-usable Launch Vehicle (RLV), through lift-off, ascent, coasting, re-entry, controlled descent and splashdown. A hard real-time system has been designed for handling the mission requirements in an integrated manner and for meeting the stringent timing constraints. Redundancy management and fault-tolerance techniques are also built into the system, in order to achieve a successful mission even in presence of component failures. This paper describes the functions and features of the components of the MMC software which has accomplished the successful RLV-Technology Demonstrator mission.
Critical fault patterns determination in fault-tolerant computer systems

NASA Technical Reports Server (NTRS)

Mccluskey, E. J.; Losq, J.

1978-01-01

The method proposed tries to enumerate all the critical fault-patterns (successive occurrences of failures) without analyzing every single possible fault. The conditions for the system to be operating in a given mode can be expressed in terms of the static states. Thus, one can find all the system states that correspond to a given critical mode of operation. The next step consists in analyzing the fault-detection mechanisms, the diagnosis algorithm and the process of switch control. From them, one can find all the possible system configurations that can result from a failure occurrence. Thus, one can list all the characteristics, with respect to detection, diagnosis, and switch control, that failures must have to constitute critical fault-patterns. Such an enumeration of the critical fault-patterns can be directly used to evaluate the overall system tolerance to failures. Present research is focused on how to efficiently make use of these system-level characteristics to enumerate all the failures that verify these characteristics.
Modelling nanoscale objects in order to conduct an empirical research into their properties as part of an engineering system designed

NASA Astrophysics Data System (ADS)

Makarov, M.; Shchanikov, S.; Trantina, N.

2017-01-01

We have conducted a research into the major, in terms of their future application, properties of nanoscale objects, based on modelling these objects as free-standing physical elements beyond the structure of an engineering system designed for their integration as well as a part of a system that operates under the influence of the external environment. For the empirical research suggested within the scope of this work, we have chosen a nanoscale electronic element intended to be used while designing information processing systems with the parallel architecture - a memristor. The target function of the research was to provide the maximum fault-tolerance index of a memristor-based system when affected by all possible impacts of the internal destabilizing factors and external environment. The research results have enabled us to receive and classify all the factors predetermining the fault-tolerance index of the hardware implementation of a computing system based on the nanoscale electronic element base.
Some Thoughts Regarding Practical Quantum Computing

NASA Astrophysics Data System (ADS)

Ghoshal, Debabrata; Gomez, Richard; Lanzagorta, Marco; Uhlmann, Jeffrey

2006-03-01

Quantum computing has become an important area of research in computer science because of its potential to provide more efficient algorithmic solutions to certain problems than are possible with classical computing. The ability of performing parallel operations over an exponentially large computational space has proved to be the main advantage of the quantum computing model. In this regard, we are particularly interested in the potential applications of quantum computers to enhance real software systems of interest to the defense, industrial, scientific and financial communities. However, while much has been written in popular and scientific literature about the benefits of the quantum computational model, several of the problems associated to the practical implementation of real-life complex software systems in quantum computers are often ignored. In this presentation we will argue that practical quantum computation is not as straightforward as commonly advertised, even if the technological problems associated to the manufacturing and engineering of large-scale quantum registers were solved overnight. We will discuss some of the frequently overlooked difficulties that plague quantum computing in the areas of memories, I/O, addressing schemes, compilers, oracles, approximate information copying, logical debugging, error correction and fault-tolerant computing protocols.
Batching System for Superior Service

NASA Technical Reports Server (NTRS)

2001-01-01

Veridian's Portable Batch System (PBS) was the recipient of the 1997 NASA Space Act Award for outstanding software. A batch system is a set of processes for managing queues and jobs. Without a batch system, it is difficult to manage the workload of a computer system. By bundling the enterprise's computing resources, the PBS technology offers users a single coherent interface, resulting in efficient management of the batch services. Users choose which information to package into "containers" for system-wide use. PBS also provides detailed system usage data, a procedure not easily executed without this software. PBS operates on networked, multi-platform UNIX environments. Veridian's new version, PBS Pro,TM has additional features and enhancements, including support for additional operating systems. Veridian distributes the original version of PBS as Open Source software via the PBS website. Customers can register and download the software at no cost. PBS Pro is also available via the web and offers additional features such as increased stability, reliability, and fault tolerance.A company using PBS can expect a significant increase in the effective management of its computing resources. Tangible benefits include increased utilization of costly resources and enhanced understanding of computational requirements and user needs.
Quantum Error Correction with Biased Noise

NASA Astrophysics Data System (ADS)

Brooks, Peter

Quantum computing offers powerful new techniques for speeding up the calculation of many classically intractable problems. Quantum algorithms can allow for the efficient simulation of physical systems, with applications to basic research, chemical modeling, and drug discovery; other algorithms have important implications for cryptography and internet security. At the same time, building a quantum computer is a daunting task, requiring the coherent manipulation of systems with many quantum degrees of freedom while preventing environmental noise from interacting too strongly with the system. Fortunately, we know that, under reasonable assumptions, we can use the techniques of quantum error correction and fault tolerance to achieve an arbitrary reduction in the noise level. In this thesis, we look at how additional information about the structure of noise, or "noise bias," can improve or alter the performance of techniques in quantum error correction and fault tolerance. In Chapter 2, we explore the possibility of designing certain quantum gates to be extremely robust with respect to errors in their operation. This naturally leads to structured noise where certain gates can be implemented in a protected manner, allowing the user to focus their protection on the noisier unprotected operations. In Chapter 3, we examine how to tailor error-correcting codes and fault-tolerant quantum circuits in the presence of dephasing biased noise, where dephasing errors are far more common than bit-flip errors. By using an appropriately asymmetric code, we demonstrate the ability to improve the amount of error reduction and decrease the physical resources required for error correction. In Chapter 4, we analyze a variety of protocols for distilling magic states, which enable universal quantum computation, in the presence of faulty Clifford operations. Here again there is a hierarchy of noise levels, with a fixed error rate for faulty gates, and a second rate for errors in the distilled states which decreases as the states are distilled to better quality. The interplay of of these different rates sets limits on the achievable distillation and how quickly states converge to that limit.
Robot Position Sensor Fault Tolerance

NASA Technical Reports Server (NTRS)

Aldridge, Hal A.

1997-01-01

Robot systems in critical applications, such as those in space and nuclear environments, must be able to operate during component failure to complete important tasks. One failure mode that has received little attention is the failure of joint position sensors. Current fault tolerant designs require the addition of directly redundant position sensors which can affect joint design. A new method is proposed that utilizes analytical redundancy to allow for continued operation during joint position sensor failure. Joint torque sensors are used with a virtual passive torque controller to make the robot joint stable without position feedback and improve position tracking performance in the presence of unknown link dynamics and end-effector loading. Two Cartesian accelerometer based methods are proposed to determine the position of the joint. The joint specific position determination method utilizes two triaxial accelerometers attached to the link driven by the joint with the failed position sensor. The joint specific method is not computationally complex and the position error is bounded. The system wide position determination method utilizes accelerometers distributed on different robot links and the end-effector to determine the position of sets of multiple joints. The system wide method requires fewer accelerometers than the joint specific method to make all joint position sensors fault tolerant but is more computationally complex and has lower convergence properties. Experiments were conducted on a laboratory manipulator. Both position determination methods were shown to track the actual position satisfactorily. A controller using the position determination methods and the virtual passive torque controller was able to servo the joints to a desired position during position sensor failure.
FPGA-Based High-Performance Embedded Systems for Adaptive Edge Computing in Cyber-Physical Systems: The ARTICo³ Framework.

PubMed

Rodríguez, Alfonso; Valverde, Juan; Portilla, Jorge; Otero, Andrés; Riesgo, Teresa; de la Torre, Eduardo

2018-06-08

Cyber-Physical Systems are experiencing a paradigm shift in which processing has been relocated to the distributed sensing layer and is no longer performed in a centralized manner. This approach, usually referred to as Edge Computing, demands the use of hardware platforms that are able to manage the steadily increasing requirements in computing performance, while keeping energy efficiency and the adaptability imposed by the interaction with the physical world. In this context, SRAM-based FPGAs and their inherent run-time reconfigurability, when coupled with smart power management strategies, are a suitable solution. However, they usually fail in user accessibility and ease of development. In this paper, an integrated framework to develop FPGA-based high-performance embedded systems for Edge Computing in Cyber-Physical Systems is presented. This framework provides a hardware-based processing architecture, an automated toolchain, and a runtime to transparently generate and manage reconfigurable systems from high-level system descriptions without additional user intervention. Moreover, it provides users with support for dynamically adapting the available computing resources to switch the working point of the architecture in a solution space defined by computing performance, energy consumption and fault tolerance. Results show that it is indeed possible to explore this solution space at run time and prove that the proposed framework is a competitive alternative to software-based edge computing platforms, being able to provide not only faster solutions, but also higher energy efficiency for computing-intensive algorithms with significant levels of data-level parallelism.
Sequence Tolerance of a Highly Stable Single Domain Antibody: Comparison of Computational and Experimental Profiles

DTIC Science & Technology

2016-09-09

evaluating 18 mutants using either the A or B conformer is only r = ~ 0.2. Given the poor performance of approximating the observed experimental ...1 Sequence Tolerance of a Highly Stable Single Domain Antibody: Comparison of Computational and Experimental Profiles Mark A. Olson,1 Patricia...unusually high thermal stability is explored by a combined computational and experimental study. Starting with the crystallographic structure
High-fidelity spin measurement on the nitrogen-vacancy center

NASA Astrophysics Data System (ADS)

Hanks, Michael; Trupke, Michael; Schmiedmayer, Jörg; Munro, William J.; Nemoto, Kae

2017-10-01

Nitrogen-vacancy (NV) centers in diamond are versatile candidates for many quantum information processing tasks, ranging from quantum imaging and sensing through to quantum communication and fault-tolerant quantum computers. Critical to almost every potential application is an efficient mechanism for the high fidelity readout of the state of the electronic and nuclear spins. Typically such readout has been achieved through an optically resonant fluorescence measurement, but the presence of decay through a meta-stable state will limit its efficiency to the order of 99%. While this is good enough for many applications, it is insufficient for large scale quantum networks and fault-tolerant computational tasks. Here we explore an alternative approach based on dipole induced transparency (state-dependent reflection) in an NV center cavity QED system, using the most recent knowledge of the NV center’s parameters to determine its feasibility, including the decay channels through the meta-stable subspace and photon ionization. We find that single-shot measurements above fault-tolerant thresholds should be available in the strong coupling regime for a wide range of cavity-center cooperativities, using a majority voting approach utilizing single photon detection. Furthermore, extremely high fidelity measurements are possible using weak optical pulses.
Computer program determines exact two-sided tolerance limits for normal distributions

NASA Technical Reports Server (NTRS)

Friedman, H. A.; Webb, S. R.

1968-01-01

Computer program determines by numerical integration the exact statistical two-sided tolerance limits, when the proportion between the limits is at least a specified number. The program is limited to situations in which the underlying probability distribution for the population sampled is the normal distribution with unknown mean and variance.
Computer architecture for efficient algorithmic executions in real-time systems: New technology for avionics systems and advanced space vehicles

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Youngblood, John N.; Saha, Aindam

1987-01-01

Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processing elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.
Computer architecture for efficient algorithmic executions in real-time systems: new technology for avionics systems and advanced space vehicles

DOE Office of Scientific and Technical Information (OSTI.GOV)

Carroll, C.C.; Youngblood, J.N.; Saha, A.

1987-12-01

Improvements and advances in the development of computer architecture now provide innovative technology for the recasting of traditional sequential solutions into high-performance, low-cost, parallel system to increase system performance. Research conducted in development of specialized computer architecture for the algorithmic execution of an avionics system, guidance and control problem in real time is described. A comprehensive treatment of both the hardware and software structures of a customized computer which performs real-time computation of guidance commands with updated estimates of target motion and time-to-go is presented. An optimal, real-time allocation algorithm was developed which maps the algorithmic tasks onto the processingmore » elements. This allocation is based on the critical path analysis. The final stage is the design and development of the hardware structures suitable for the efficient execution of the allocated task graph. The processing element is designed for rapid execution of the allocated tasks. Fault tolerance is a key feature of the overall architecture. Parallel numerical integration techniques, tasks definitions, and allocation algorithms are discussed. The parallel implementation is analytically verified and the experimental results are presented. The design of the data-driven computer architecture, customized for the execution of the particular algorithm, is discussed.« less
Enhancing radiation tolerance by controlling defect mobility and migration pathways in multicomponent single-phase alloys

DOE PAGES

Lu, Chenyang; Niu, Liangliang; Chen, Nanjun; ...

2016-12-15

A grand challenge in material science is to understand the correlation between intrinsic properties and defect dynamics. Radiation tolerant materials are in great demand for safe operation and advancement of nuclear and aerospace systems. Unlike traditional approaches that rely on microstructural and nanoscale features to mitigate radiation damage, this study demonstrates enhancement of radiation tolerance with the suppression of void formation by two orders magnitude at elevated temperatures in equiatomic single-phase concentrated solid solution alloys, and more importantly, reveals its controlling mechanism through a detailed analysis of the depth distribution of defect clusters and an atomistic computer simulation. The enhancedmore » swelling resistance is attributed to the tailored interstitial defect cluster motion in the alloys from a long-range one-dimensional mode to a short-range three-dimensional mode, which leads to enhanced point defect recombination. Finally, the results suggest design criteria for next generation radiation tolerant structural alloys.« less
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Levy, Scott N.

2016-05-01

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 10 18 oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing themore » scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.« less
Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

NASA Technical Reports Server (NTRS)

Chen, Paul Peichuan

1993-01-01

Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.
Dynamic Routing for Delay-Tolerant Networking in Space Flight Operations

NASA Technical Reports Server (NTRS)

Burleigh, Scott C.

2008-01-01

Contact Graph Routing (CGR) is a dynamic routing system that computes routes through a time-varying topology composed of scheduled, bounded communication contacts in a network built on the Delay-Tolerant Networking (DTN) architecture. It is designed to support operations in a space network based on DTN, but it also could be used in terrestrial applications where operation according to a predefined schedule is preferable to opportunistic communication, as in a low-power sensor network. This paper will describe the operation of the CGR system and explain how it can enable data delivery over scheduled transmission opportunities, fully utilizing the available transmission capacity, without knowing the current state of any bundle protocol node (other than the local node itself) and without exhausting processing resources at any bundle router.

Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles.

PubMed

Lampa, Samuel; Alvarsson, Jonathan; Spjuth, Ola

2016-01-01

Predictive modelling in drug discovery is challenging to automate as it often contains multiple analysis steps and might involve cross-validation and parameter tuning that create complex dependencies between tasks. With large-scale data or when using computationally demanding modelling methods, e-infrastructures such as high-performance or cloud computing are required, adding to the existing challenges of fault-tolerant automation. Workflow management systems can aid in many of these challenges, but the currently available systems are lacking in the functionality needed to enable agile and flexible predictive modelling. We here present an approach inspired by elements of the flow-based programming paradigm, implemented as an extension of the Luigi system which we name SciLuigi. We also discuss the experiences from using the approach when modelling a large set of biochemical interactions using a shared computer cluster.Graphical abstract.
The UCLA Design Diversity Experiment (DEDIX) system: A distributed testbed for multiple-version software

NASA Technical Reports Server (NTRS)

Avizienis, A.; Gunningberg, P.; Kelly, J. P. J.; Strigini, L.; Traverse, P. J.; Tso, K. S.; Voges, U.

1986-01-01

To establish a long-term research facility for experimental investigations of design diversity as a means of achieving fault-tolerant systems, a distributed testbed for multiple-version software was designed. It is part of a local network, which utilizes the Locus distributed operating system to operate a set of 20 VAX 11/750 computers. It is used in experiments to measure the efficacy of design diversity and to investigate reliability increases under large-scale, controlled experimental conditions.
CARE 3 user-friendly interface user's guide

NASA Technical Reports Server (NTRS)

Martensen, A. L.

1987-01-01

CARE 3 predicts the unreliability of highly reliable reconfigurable fault-tolerant systems that include redundant computers or computer systems. CARE3MENU is a user-friendly interface used to create an input for the CARE 3 program. The CARE3MENU interface has been designed to minimize user input errors. Although a CARE3MENU session may be successfully completed and all parameters may be within specified limits or ranges, the CARE 3 program is not guaranteed to produce meaningful results if the user incorrectly interprets the CARE 3 stochastic model. The CARE3MENU User Guide provides complete information on how to create a CARE 3 model with the interface. The CARE3MENU interface runs under the VAX/VMS operating system.
What does fault tolerant Deep Learning need from MPI?

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amatya, Vinay C.; Vishnu, Abhinav; Siegel, Charles M.

Deep Learning (DL) algorithms have become the {\\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for amore » fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less
Experiment and application of soft x-ray grazing incidence optical scattering phenomena

NASA Astrophysics Data System (ADS)

Chen, Shuyan; Li, Cheng; Zhang, Yang; Su, Liping; Geng, Tao; Li, Kun

2017-08-01

For short wavelength imaging systems，surface scattering effects is one of important factors degrading imaging performance. Study of non-intuitive surface scatter effects resulting from practical optical fabrication tolerances is a necessary work for optical performance evaluation of high resolution short wavelength imaging systems. In this paper, Soft X-ray optical scattering distribution is measured by a soft X-ray reflectometer installed by my lab, for different sample mirrors、wavelength and grazing angle. Then aim at space solar telescope, combining these scattered light distributions, and surface scattering numerical model of grazing incidence imaging system, PSF and encircled energy of optical system of space solar telescope are computed. We can conclude that surface scattering severely degrade imaging performance of grazing incidence systems through analysis and computation.
Software Validation via Model Animation

NASA Technical Reports Server (NTRS)

Dutle, Aaron M.; Munoz, Cesar A.; Narkawicz, Anthony J.; Butler, Ricky W.

2015-01-01

This paper explores a new approach to validating software implementations that have been produced from formally-verified algorithms. Although visual inspection gives some confidence that the implementations faithfully reflect the formal models, it does not provide complete assurance that the software is correct. The proposed approach, which is based on animation of formal specifications, compares the outputs computed by the software implementations on a given suite of input values to the outputs computed by the formal models on the same inputs, and determines if they are equal up to a given tolerance. The approach is illustrated on a prototype air traffic management system that computes simple kinematic trajectories for aircraft. Proofs for the mathematical models of the system's algorithms are carried out in the Prototype Verification System (PVS). The animation tool PVSio is used to evaluate the formal models on a set of randomly generated test cases. Output values computed by PVSio are compared against output values computed by the actual software. This comparison improves the assurance that the translation from formal models to code is faithful and that, for example, floating point errors do not greatly affect correctness and safety properties.
Mini-Ckpts: Surviving OS Failures in Persistent Memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fiala, David; Mueller, Frank; Ferreira, Kurt Brian

Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory are more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs -- and in HPC also for any other nodes a parallelized application runs on and communicates with: Any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore,more » the OS itself must be capable of tolerating failures. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. Mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime system can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current, coarse-grained, application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional fault scenarios.« less
Design Trade-off Between Performance and Fault-Tolerance of Space Onboard Computers

NASA Astrophysics Data System (ADS)

Gorbunov, M. S.; Antonov, A. A.

2017-01-01

It is well known that there is a trade-off between performance and power consumption in onboard computers. The fault-tolerance is another important factor affecting performance, chip area and power consumption. Involving special SRAM cells and error-correcting codes is often too expensive with relation to the performance needed. We discuss the possibility of finding the optimal solutions for modern onboard computer for scientific apparatus focusing on multi-level cache memory design.
The Design of Fault Tolerant Quantum Dot Cellular Automata Based Logic

NASA Technical Reports Server (NTRS)

Armstrong, C. Duane; Humphreys, William M.; Fijany, Amir

2002-01-01

As transistor geometries are reduced, quantum effects begin to dominate device performance. At some point, transistors cease to have the properties that make them useful computational components. New computing elements must be developed in order to keep pace with Moore s Law. Quantum dot cellular automata (QCA) represent an alternative paradigm to transistor-based logic. QCA architectures that are robust to manufacturing tolerances and defects must be developed. We are developing software that allows the exploration of fault tolerant QCA gate architectures by automating the specification, simulation, analysis and documentation processes.
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB

NASA Technical Reports Server (NTRS)

Clune, E.; Segall, Z.; Siewiorek, D.

1985-01-01

This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.
Robust design of an inkjet-printed capacitive sensor for position tracking of a MOEMS-mirror in a Michelson interferometer setup

NASA Astrophysics Data System (ADS)

Faller, Lisa-Marie; Zangl, Hubert

2017-05-01

To guarantee high performance of Micro Optical Electro Mechanical Systems (MOEMS), precise position feedback is crucial. To overcome drawbacks of widely used optical feedback, we propose an inkjet-printed capacitive position sensor as smart packaging solution. Printing processes suffer from tolerances in excess of those from standard processes. Thus, FEM simulations covering assumed tolerances of the system are adopted. These simulations are structured following a Design Of Computer Experiments (DOCE) and are then employed to determine a optimal sensor design. Based on the simulation results, statistical models are adopted for the dynamic system. These models are to be used together with specifically designed hardware, considered to cope with challenging requirements of ≍50nm position accuracy at 10MS/s with 1000μm measurement range. Noise analysis is performed considering the influence of uncertainties to assess resolution and bandwidth capabilities.
Large field distributed aperture laser semiactive angle measurement system design with imaging fiber bundles.

PubMed

Xu, Chunyun; Cheng, Haobo; Feng, Yunpeng; Jing, Xiaoli

2016-09-01

A type of laser semiactive angle measurement system is designed for target detecting and tracking. Only one detector is used to detect target location from four distributed aperture optical systems through a 4×1 imaging fiber bundle. A telecentric optical system in image space is designed to increase the efficiency of imaging fiber bundles. According to the working principle of a four-quadrant (4Q) detector, fiber diamond alignment is adopted between an optical system and a 4Q detector. The structure of the laser semiactive angle measurement system is, we believe, novel. Tolerance analysis is carried out to determine tolerance limits of manufacture and installation errors of the optical system. The performance of the proposed method is identified by computer simulations and experiments. It is demonstrated that the linear region of the system is ±12°, with measurement error of better than 0.2°. In general, this new system can be used with large field of view and high accuracy, providing an efficient, stable, and fast method for angle measurement in practical situations.
Round-off errors in cutting plane algorithms based on the revised simplex procedure

NASA Technical Reports Server (NTRS)

Moore, J. E.

1973-01-01

This report statistically analyzes computational round-off errors associated with the cutting plane approach to solving linear integer programming problems. Cutting plane methods require that the inverse of a sequence of matrices be computed. The problem basically reduces to one of minimizing round-off errors in the sequence of inverses. Two procedures for minimizing this problem are presented, and their influence on error accumulation is statistically analyzed. One procedure employs a very small tolerance factor to round computed values to zero. The other procedure is a numerical analysis technique for reinverting or improving the approximate inverse of a matrix. The results indicated that round-off accumulation can be effectively minimized by employing a tolerance factor which reflects the number of significant digits carried for each calculation and by applying the reinversion procedure once to each computed inverse. If 18 significant digits plus an exponent are carried for each variable during computations, then a tolerance value of 0.1 x 10 to the minus 12th power is reasonable.
Real-time closed-loop simulation and upset evaluation of control systems in harsh electromagnetic environments

NASA Technical Reports Server (NTRS)

Belcastro, Celeste M.

1989-01-01

Digital control systems for applications such as aircraft avionics and multibody systems must maintain adequate control integrity in adverse as well as nominal operating conditions. For example, control systems for advanced aircraft, and especially those with relaxed static stability, will be critical to flight and will, therefore, have very high reliability specifications which must be met regardless of operating conditions. In addition, multibody systems such as robotic manipulators performing critical functions must have control systems capable of robust performance in any operating environment in order to complete the assigned task reliably. Severe operating conditions for electronic control systems can result from electromagnetic disturbances caused by lightning, high energy radio frequency (HERF) transmitters, and nuclear electromagnetic pulses (NEMP). For this reason, techniques must be developed to evaluate the integrity of the control system in adverse operating environments. The most difficult and illusive perturbations to computer-based control systems that can be caused by an electromagnetic environment (EME) are functional error modes that involve no component damage. These error modes are collectively known as upset, can occur simultaneously in all of the channels of a redundant control system, and are software dependent. Upset studies performed to date have not addressed the assessment of fault tolerant systems and do not involve the evaluation of a control system operating in a closed-loop with the plant. A methodology for performing a real-time simulation of the closed-loop dynamics of a fault tolerant control system with a simulated plant operating in an electromagnetically harsh environment is presented. In particular, considerations for performing upset tests on the controller are discussed. Some of these considerations are the generation and coupling of analog signals representative of electromagnetic disturbances to a control system under test, analog data acquisition, and digital data acquisition from fault tolerant systems. In addition, a case study of an upset test methodology for a fault tolerant electromagnetic aircraft engine control system is presented.
Digitized adiabatic quantum computing with a superconducting circuit.

PubMed

Barends, R; Shabani, A; Lamata, L; Kelly, J; Mezzacapo, A; Las Heras, U; Babbush, R; Fowler, A G; Campbell, B; Chen, Yu; Chen, Z; Chiaro, B; Dunsworth, A; Jeffrey, E; Lucero, E; Megrant, A; Mutus, J Y; Neeley, M; Neill, C; O'Malley, P J J; Quintana, C; Roushan, P; Sank, D; Vainsencher, A; Wenner, J; White, T C; Solano, E; Neven, H; Martinis, John M

2016-06-09

Quantum mechanics can help to solve complex problems in physics and chemistry, provided they can be programmed in a physical device. In adiabatic quantum computing, a system is slowly evolved from the ground state of a simple initial Hamiltonian to a final Hamiltonian that encodes a computational problem. The appeal of this approach lies in the combination of simplicity and generality; in principle, any problem can be encoded. In practice, applications are restricted by limited connectivity, available interactions and noise. A complementary approach is digital quantum computing, which enables the construction of arbitrary interactions and is compatible with error correction, but uses quantum circuit algorithms that are problem-specific. Here we combine the advantages of both approaches by implementing digitized adiabatic quantum computing in a superconducting system. We tomographically probe the system during the digitized evolution and explore the scaling of errors with system size. We then let the full system find the solution to random instances of the one-dimensional Ising problem as well as problem Hamiltonians that involve more complex interactions. This digital quantum simulation of the adiabatic algorithm consists of up to nine qubits and up to 1,000 quantum logic gates. The demonstration of digitized adiabatic quantum computing in the solid state opens a path to synthesizing long-range correlations and solving complex computational problems. When combined with fault-tolerance, our approach becomes a general-purpose algorithm that is scalable.
Investigation of progressive failure robustness and alternate load paths for damage tolerant structures

NASA Astrophysics Data System (ADS)

Marhadi, Kun Saptohartyadi

Structural optimization for damage tolerance under various unforeseen damage scenarios is computationally challenging. It couples non-linear progressive failure analysis with sampling-based stochastic analysis of random damage. The goal of this research was to understand the relationship between alternate load paths available in a structure and its damage tolerance, and to use this information to develop computationally efficient methods for designing damage tolerant structures. Progressive failure of a redundant truss structure subjected to small random variability was investigated to identify features that correlate with robustness and predictability of the structure's progressive failure. The identified features were used to develop numerical surrogate measures that permit computationally efficient deterministic optimization to achieve robustness and predictability of progressive failure. Analysis of damage tolerance on designs with robust progressive failure indicated that robustness and predictability of progressive failure do not guarantee damage tolerance. Damage tolerance requires a structure to redistribute its load to alternate load paths. In order to investigate the load distribution characteristics that lead to damage tolerance in structures, designs with varying degrees of damage tolerance were generated using brute force stochastic optimization. A method based on principal component analysis was used to describe load distributions (alternate load paths) in the structures. Results indicate that a structure that can develop alternate paths is not necessarily damage tolerant. The alternate load paths must have a required minimum load capability. Robustness analysis of damage tolerant optimum designs indicates that designs are tailored to specified damage. A design Optimized under one damage specification can be sensitive to other damages not considered. Effectiveness of existing load path definitions and characterizations were investigated for continuum structures. A load path definition using a relative compliance change measure (U* field) was demonstrated to be the most useful measure of load path. This measure provides quantitative information on load path trajectories and qualitative information on the effectiveness of the load path. The use of the U* description of load paths in optimizing structures for effective load paths was investigated.
Care 3 phase 2 report, maintenance manual

NASA Technical Reports Server (NTRS)

Bryant, L. A.; Stiffler, J. J.

1982-01-01

CARE 3 (Computer-Aided Reliability Estimation, version three) is a computer program designed to help estimate the reliability of complex, redundant systems. Although the program can model a wide variety of redundant structures, it was developed specifically for fault-tolerant avionics systems--systems distinguished by the need for extremely reliable performance since a system failure could well result in the loss of human life. It substantially generalizes the class of redundant configurations that could be accommodated, and includes a coverage model to determine the various coverage probabilities as a function of the applicable fault recovery mechanisms (detection delay, diagnostic scheduling interval, isolation and recovery delay, etc.). CARE 3 further generalizes the class of system structures that can be modeled and greatly expands the coverage model to take into account such effects as intermittent and transient faults, latent faults, error propagation, etc.
Design of a fault tolerant airborne digital computer. Volume 2: Computational requirements and technology

NASA Technical Reports Server (NTRS)

Ratner, R. S.; Shapiro, E. B.; Zeidler, H. M.; Wahlstrom, S. E.; Clark, C. B.; Goldberg, J.

1973-01-01

This final report summarizes the work on the design of a fault tolerant digital computer for aircraft. Volume 2 is composed of two parts. Part 1 is concerned with the computational requirements associated with an advanced commercial aircraft. Part 2 reviews the technology that will be available for the implementation of the computer in the 1975-1985 period. With regard to the computation task 26 computations have been categorized according to computational load, memory requirements, criticality, permitted down-time, and the need to save data in order to effect a roll-back. The technology part stresses the impact of large scale integration (LSI) on the realization of logic and memory. Also considered was module interconnection possibilities so as to minimize fault propagation.
Space station data system analysis/architecture study. Task 3: Trade studies, DR-5, volume 2

NASA Technical Reports Server (NTRS)

1985-01-01

Results of a Space Station Data System Analysis/Architecture Study for the Goddard Space Flight Center are presented. This study, which emphasized a system engineering design for a complete, end-to-end data system, was divided into six tasks: (1); Functional requirements definition; (2) Options development; (3) Trade studies; (4) System definitions; (5) Program plan; and (6) Study maintenance. The Task inter-relationship and documentation flow are described. Information in volume 2 is devoted to Task 3: trade Studies. Trade Studies have been carried out in the following areas: (1) software development test and integration capability; (2) fault tolerant computing; (3) space qualified computers; (4) distributed data base management system; (5) system integration test and verification; (6) crew workstations; (7) mass storage; (8) command and resource management; and (9) space communications. Results are presented for each task.

The application of emulation techniques in the analysis of highly reliable, guidance and control computer systems

NASA Technical Reports Server (NTRS)

Migneault, Gerard E.

1987-01-01

Emulation techniques can be a solution to a difficulty that arises in the analysis of the reliability of guidance and control computer systems for future commercial aircraft. Described here is the difficulty, the lack of credibility of reliability estimates obtained by analytical modeling techniques. The difficulty is an unavoidable consequence of the following: (1) a reliability requirement so demanding as to make system evaluation by use testing infeasible; (2) a complex system design technique, fault tolerance; (3) system reliability dominated by errors due to flaws in the system definition; and (4) elaborate analytical modeling techniques whose precision outputs are quite sensitive to errors of approximation in their input data. Use of emulation techniques for pseudo-testing systems to evaluate bounds on the parameter values needed for the analytical techniques is then discussed. Finally several examples of the application of emulation techniques are described.
Simple Linux Utility for Resource Management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jette, M.

2009-09-09

SLURM is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small computer clusters. As a cluster resource manager, SLURM has three key functions. First, it allocates exclusive and/or non exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allciated nodes. Finally, it arbitrates conflicting requests for resouces by managing a queue of pending work.
On decentralized control of large-scale systems

NASA Technical Reports Server (NTRS)

Siljak, D. D.

1978-01-01

A scheme is presented for decentralized control of large-scale linear systems which are composed of a number of interconnected subsystems. By ignoring the interconnections, local feedback controls are chosen to optimize each decoupled subsystem. Conditions are provided to establish compatibility of the individual local controllers and achieve stability of the overall system. Besides computational simplifications, the scheme is attractive because of its structural features and the fact that it produces a robust decentralized regulator for large dynamic systems, which can tolerate a wide range of nonlinearities and perturbations among the subsystems.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors

NASA Technical Reports Server (NTRS)

Probst, David K.

1993-01-01

A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Fault tolerant hypercube computer system architecture

NASA Technical Reports Server (NTRS)

Madan, Herb S. (Inventor); Chow, Edward (Inventor)

1989-01-01

A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node operably connected to the first multiplexer whereby the second watch dog node can selectively communicate with individual ones of the computing nodes through the second and fourth networks. The branch is completed by a first load balancing node; and a second multiplexer connected between the first load balancing node and the first and second watch dog nodes, allowing the first load balancing node to selectively communicate with the first and second watch dog nodes.
Production tolerance of additive manufactured polymeric objects for clinical applications.

PubMed

Braian, Michael; Jimbo, Ryo; Wennerberg, Ann

2016-07-01

To determine the production tolerance of four commercially available additive manufacturing systems. By reverse engineering annex A and B from the ISO_12836;2012, two geometrical figures relevant to dentistry was obtained. Object A specifies the measurement of an inlay-shaped object and B a multi-unit specimen to simulate a four-unit bridge model. The objects were divided into x, y and z measurements, object A was divided into a total of 16 parameters and object B was tested for 12 parameters. The objects were designed digitally and manufactured by professionals in four different additive manufacturing systems; each system produced 10 samples of each objects. For object A, three manufacturers presented an accuracy of <100μm and one system showed an accuracy of <20μm. For object B, all systems presented an accuracy of <100μm, and most parameters were <40μm. The standard deviation for most parameters were <40μm. The growing interest and use of intra-oral digitizing systems stresses the use of computer aided manufacturing of working models. The additive manufacturing techniques has the potential to help us in the digital workflow. Thus, it is important to have knowledge about production accuracy and tolerances. This study presents a method to test additive manufacturing units for accuracy and repeatability. Copyright © 2016 The Academy of Dental Materials. Published by Elsevier Ltd. All rights reserved.
Crowd Computing as a Cooperation Problem: An Evolutionary Approach

NASA Astrophysics Data System (ADS)

Christoforou, Evgenia; Fernández Anta, Antonio; Georgiou, Chryssis; Mosteiro, Miguel A.; Sánchez, Angel

2013-05-01

Cooperation is one of the socio-economic issues that has received more attention from the physics community. The problem has been mostly considered by studying games such as the Prisoner's Dilemma or the Public Goods Game. Here, we take a step forward by studying cooperation in the context of crowd computing. We introduce a model loosely based on Principal-agent theory in which people (workers) contribute to the solution of a distributed problem by computing answers and reporting to the problem proposer (master). To go beyond classical approaches involving the concept of Nash equilibrium, we work on an evolutionary framework in which both the master and the workers update their behavior through reinforcement learning. Using a Markov chain approach, we show theoretically that under certain----not very restrictive—conditions, the master can ensure the reliability of the answer resulting of the process. Then, we study the model by numerical simulations, finding that convergence, meaning that the system reaches a point in which it always produces reliable answers, may in general be much faster than the upper bounds given by the theoretical calculation. We also discuss the effects of the master's level of tolerance to defectors, about which the theory does not provide information. The discussion shows that the system works even with very large tolerances. We conclude with a discussion of our results and possible directions to carry this research further.
Challenges of Future High-End Computing

NASA Technical Reports Server (NTRS)

Bailey, David; Kutler, Paul (Technical Monitor)

1998-01-01

The next major milestone in high performance computing is a sustained rate of one Pflop/s (also written one petaflops, or 10(circumflex)15 floating-point operations per second). In addition to prodigiously high computational performance, such systems must of necessity feature very large main memories, as well as comparably high I/O bandwidth and huge mass storage facilities. The current consensus of scientists who have studied these issues is that "affordable" petaflops systems may be feasible by the year 2010, assuming that certain key technologies continue to progress at current rates. One important question is whether applications can be structured to perform efficiently on such systems, which are expected to incorporate many thousands of processors and deeply hierarchical memory systems. To answer these questions, advanced performance modeling techniques, including simulation of future architectures and applications, may be required. It may also be necessary to formulate "latency tolerant algorithms" and other completely new algorithmic approaches for certain applications. This talk will give an overview of these challenges.
A Fault Oblivious Extreme-Scale Execution Environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKie, Jim

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massivemore » data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.« less
Tolerant (parallel) Programming

NASA Technical Reports Server (NTRS)

DiNucci, David C.; Bailey, David H. (Technical Monitor)

1997-01-01

In order to be truly portable, a program must be tolerant of a wide range of development and execution environments, and a parallel program is just one which must be tolerant of a very wide range. This paper first defines the term "tolerant programming", then describes many layers of tools to accomplish it. The primary focus is on F-Nets, a formal model for expressing computation as a folded partial-ordering of operations, thereby providing an architecture-independent expression of tolerant parallel algorithms. For implementing F-Nets, Cooperative Data Sharing (CDS) is a subroutine package for implementing communication efficiently in a large number of environments (e.g. shared memory and message passing). Software Cabling (SC), a very-high-level graphical programming language for building large F-Nets, possesses many of the features normally expected from today's computer languages (e.g. data abstraction, array operations). Finally, L2(sup 3) is a CASE tool which facilitates the construction, compilation, execution, and debugging of SC programs.
Computer-Delivered Social Norm Message Increases Pain Tolerance

PubMed Central

Pulvers, Kim; Schroeder, Jacquelyn; Limas, Eleuterio F.; Zhu, Shu-Hong

2013-01-01

Background Few experimental studies have been conducted on social determinants of pain tolerance. Purpose This study tests a brief, computer-delivered social norm message for increasing pain tolerance. Methods Healthy young adults (N=260; 44 % Caucasian; 27 % Hispanic) were randomly assigned into a 2 (social norm)×2 (challenge) cold pressor study, stratified by gender. They received standard instructions or standard instructions plus a message that contained artifically elevated information about typical performance of others. Results Those receiving a social norm message displayed significantly higher pain tolerance, F(1, 255)=26.95, p<.001, ηp2=.10 and pain threshold F(1, 244)=9.81, p=.002, ηp2=.04, but comparable pain intensity, p>.05. There were no interactions between condition and gender on any outcome variables, p>.05. Conclusions Social norms can significantly increase pain tolerance, even with a brief verbal message delivered by a video. PMID:24146086
Electro-Optical Inspection For Tolerance Control As An Integral Part Of A Flexible Machining Cell

NASA Astrophysics Data System (ADS)

Renaud, Blaise

1986-11-01

Institut CERAC has been involved in optical metrology and 3-dimensional surface control for the last couple of years. Among the industrial applications considered is the on-line shape evaluation of machined parts within the manufacturing cell. The specific objective is to measure the machining errors and to compare them with the tolerances set by designers. An electro-optical sensing technique has been developed which relies on a projection Moire contouring optical method. A prototype inspection system has been designed, making use of video detection and computer image processing. Moire interferograms are interpreted, and the metrological information automatically retrieved. A structured database can be generated for subsequent data analysis and for real-time closed-loop corrective actions. A real-time kernel embedded into a synchronisation network (Petri-net) for the control of concurrent processes in the Electra-Optical Inspection (E0I) station was realised and implemented in a MODULA-2 program DIN01. The prototype system for on-line automatic tolerance control taking place within a flexible machining cell is described in this paper, together with the fast-prototype synchronisation program.
The use of programmable logic controllers (PLC) for rocket engine component testing

NASA Technical Reports Server (NTRS)

Nail, William; Scheuermann, Patrick; Witcher, Kern

1991-01-01

Application of PLCs to the rocket engine component testing at a new Stennis Space Center Component Test Facility is suggested as an alternative to dedicated specialized computers. The PLC systems are characterized by rugged design, intuitive software, fault tolerance, flexibility, multiple end device options, networking capability, and built-in diagnostics. A distributed PLC-based system is projected to be used for testing LH2/LOx turbopumps required for the ALS/NLS rocket engines.
Large-Scale Exploratory Analysis, Cleaning, and Modeling for Event Detection in Real-World Power Systems Data

DTIC Science & Technology

2013-11-01

big data with R is relatively new. RHadoop is a mature product from Revolution Analytics that uses R with Hadoop Streaming [15] and provides...agnostic all- data summaries or computations, in which case we use MapReduce directly. 2.3 D&R Software Environment In this work, we use the Hadoop ...job scheduling and tracking, data distribu- tion, system architecture, heterogeneity, and fault-tolerance. Hadoop also provides a distributed key-value
Realizing universal Majorana fermionic quantum computation

NASA Astrophysics Data System (ADS)

Wu, Ya-Jie; He, Jing; Kou, Su-Peng

2014-08-01

Majorana fermionic quantum computation (MFQC) was proposed by S. B. Bravyi and A. Yu. Kitaev [Ann. Phys. (NY) 298, 210 (2002), 10.1006/aphy.2002.6254], who indicated that a (nontopological) fault-tolerant quantum computer built from Majorana fermions may be more efficient than that built from distinguishable two-state systems. However, until now scientists have not known how to realize a MFQC in a physical system. In this paper we propose a possible realization of MFQC. We find that the end of a line defect of a p-wave superconductor or superfluid in a honeycomb lattice traps a Majorana zero mode, which becomes the starting point of MFQC. Then we show how to manipulate Majorana fermions to perform universal MFQC, which possesses possibilities for high-level local controllability through individually addressing the quantum states of individual constituent elements by using timely cold-atom technology.
Software-Implemented Fault Tolerance in Communications Systems

NASA Technical Reports Server (NTRS)

Gantenbein, Rex E.

1994-01-01

Software-implemented fault tolerance (SIFT) is used in many computer-based command, control, and communications (C(3)) systems to provide the nearly continuous availability that they require. In the communications subsystem of Space Station Alpha, SIFT algorithms are used to detect and recover from failures in the data and command link between the Station and its ground support. The paper presents a review of these algorithms and discusses how such techniques can be applied to similar systems found in applications such as manufacturing control, military communications, and programmable devices such as pacemakers. With support from the Tracking and Communication Division of NASA's Johnson Space Center, researchers at the University of Wyoming are developing a testbed for evaluating the effectiveness of these algorithms prior to their deployment. This testbed will be capable of simulating a variety of C(3) system failures and recording the response of the Space Station SIFT algorithms to these failures. The design of this testbed and the applicability of the approach in other environments is described.
Adding Fault Tolerance to NPB Benchmarks Using ULFM

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parchman, Zachary W; Vallee, Geoffroy R; Naughton III, Thomas J

2016-01-01

In the world of high-performance computing, fault tolerance and application resilience are becoming some of the primary concerns because of increasing hardware failures and memory corruptions. While the research community has been investigating various options, from system-level solutions to application-level solutions, standards such as the Message Passing Interface (MPI) are also starting to include such capabilities. The current proposal for MPI fault tolerant is centered around the User-Level Failure Mitigation (ULFM) concept, which provides means for fault detection and recovery of the MPI layer. This approach does not address application-level recovery, which is currently left to application developers. In thismore » work, we present a mod- ification of some of the benchmarks of the NAS parallel benchmark (NPB) to include support of the ULFM capabilities as well as application-level strategies and mechanisms for application-level failure recovery. As such, we present: (i) an application-level library to checkpoint and restore data, (ii) extensions of NPB benchmarks for fault tolerance based on different strategies, (iii) a fault injection tool, and (iv) some preliminary results that show the impact of such fault tolerant strategies on the application execution.« less
Noise-tolerant inverse analysis models for nondestructive evaluation of transportation infrastructure systems using neural networks

NASA Astrophysics Data System (ADS)

Ceylan, Halil; Gopalakrishnan, Kasthurirangan; Birkan Bayrak, Mustafa; Guclu, Alper

2013-09-01

The need to rapidly and cost-effectively evaluate the present condition of pavement infrastructure is a critical issue concerning the deterioration of ageing transportation infrastructure all around the world. Nondestructive testing (NDT) and evaluation methods are well-suited for characterising materials and determining structural integrity of pavement systems. The falling weight deflectometer (FWD) is a NDT equipment used to assess the structural condition of highway and airfield pavement systems and to determine the moduli of pavement layers. This involves static or dynamic inverse analysis (referred to as backcalculation) of FWD deflection profiles in the pavement surface under a simulated truck load. The main objective of this study was to employ biologically inspired computational systems to develop robust pavement layer moduli backcalculation algorithms that can tolerate noise or inaccuracies in the FWD deflection data collected in the field. Artificial neural systems, also known as artificial neural networks (ANNs), are valuable computational intelligence tools that are increasingly being used to solve resource-intensive complex engineering problems. Unlike the linear elastic layered theory commonly used in pavement layer backcalculation, non-linear unbound aggregate base and subgrade soil response models were used in an axisymmetric finite element structural analysis programme to generate synthetic database for training and testing the ANN models. In order to develop more robust networks that can tolerate the noisy or inaccurate pavement deflection patterns in the NDT data, several network architectures were trained with varying levels of noise in them. The trained ANN models were capable of rapidly predicting the pavement layer moduli and critical pavement responses (tensile strains at the bottom of the asphalt concrete layer, compressive strains on top of the subgrade layer and the deviator stresses on top of the subgrade layer), and also pavement surface deflections with very low average errors comparable with those obtained directly from the finite element analyses.
Demonstration of a quantum error detection code using a square lattice of four superconducting qubits

PubMed Central

Córcoles, A.D.; Magesan, Easwar; Srinivasan, Srikanth J.; Cross, Andrew W.; Steffen, M.; Gambetta, Jay M.; Chow, Jerry M.

2015-01-01

The ability to detect and deal with errors when manipulating quantum systems is a fundamental requirement for fault-tolerant quantum computing. Unlike classical bits that are subject to only digital bit-flip errors, quantum bits are susceptible to a much larger spectrum of errors, for which any complete quantum error-correcting code must account. Whilst classical bit-flip detection can be realized via a linear array of qubits, a general fault-tolerant quantum error-correcting code requires extending into a higher-dimensional lattice. Here we present a quantum error detection protocol on a two-by-two planar lattice of superconducting qubits. The protocol detects an arbitrary quantum error on an encoded two-qubit entangled state via quantum non-demolition parity measurements on another pair of error syndrome qubits. This result represents a building block towards larger lattices amenable to fault-tolerant quantum error correction architectures such as the surface code. PMID:25923200
Demonstration of a quantum error detection code using a square lattice of four superconducting qubits.

PubMed

Córcoles, A D; Magesan, Easwar; Srinivasan, Srikanth J; Cross, Andrew W; Steffen, M; Gambetta, Jay M; Chow, Jerry M

2015-04-29

The ability to detect and deal with errors when manipulating quantum systems is a fundamental requirement for fault-tolerant quantum computing. Unlike classical bits that are subject to only digital bit-flip errors, quantum bits are susceptible to a much larger spectrum of errors, for which any complete quantum error-correcting code must account. Whilst classical bit-flip detection can be realized via a linear array of qubits, a general fault-tolerant quantum error-correcting code requires extending into a higher-dimensional lattice. Here we present a quantum error detection protocol on a two-by-two planar lattice of superconducting qubits. The protocol detects an arbitrary quantum error on an encoded two-qubit entangled state via quantum non-demolition parity measurements on another pair of error syndrome qubits. This result represents a building block towards larger lattices amenable to fault-tolerant quantum error correction architectures such as the surface code.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fang, Aiman; Laguna, Ignacio; Sato, Kento

Future high-performance computing systems may face frequent failures with their rapid increase in scale and complexity. Resilience to faults has become a major challenge for large-scale applications running on supercomputers, which demands fault tolerance support for prevalent MPI applications. Among failure scenarios, process failures are one of the most severe issues as they usually lead to termination of applications. However, the widely used MPI implementations do not provide mechanisms for fault tolerance. We propose FTA-MPI (Fault Tolerance Assistant MPI), a programming model that provides support for failure detection, failure notification and recovery. Specifically, FTA-MPI exploits a try/catch model that enablesmore » failure localization and transparent recovery of process failures in MPI applications. We demonstrate FTA-MPI with synthetic applications and a molecular dynamics code CoMD, and show that FTA-MPI provides high programmability for users and enables convenient and flexible recovery of process failures.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)

Lu, Chenyang; Niu, Liangliang; Chen, Nanjun

A grand challenge in material science is to understand the correlation between intrinsic properties and defect dynamics. Radiation tolerant materials are in great demand for safe operation and advancement of nuclear and aerospace systems. Unlike traditional approaches that rely on microstructural and nanoscale features to mitigate radiation damage, this study demonstrates enhancement of radiation tolerance with the suppression of void formation by two orders magnitude at elevated temperatures in equiatomic single-phase concentrated solid solution alloys, and more importantly, reveals its controlling mechanism through a detailed analysis of the depth distribution of defect clusters and an atomistic computer simulation. The enhancedmore » swelling resistance is attributed to the tailored interstitial defect cluster motion in the alloys from a long-range one-dimensional mode to a short-range three-dimensional mode, which leads to enhanced point defect recombination. Finally, the results suggest design criteria for next generation radiation tolerant structural alloys.« less
Towards the formal verification of the requirements and design of a processor interface unit

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

The formal verification of the design and partial requirements for a Processor Interface Unit (PIU) using the Higher Order Logic (HOL) theorem-proving system is described. The processor interface unit is a single-chip subsystem within a fault-tolerant embedded system under development within the Boeing Defense and Space Group. It provides the opportunity to investigate the specification and verification of a real-world subsystem within a commercially-developed fault-tolerant computer. An overview of the PIU verification effort is given. The actual HOL listing from the verification effort are documented in a companion NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings' including the general-purpose HOL theories and definitions that support the PIU verification as well as tactics used in the proofs.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Emission Measurements of Ultracell XX25 Reformed Methanol Fuel Cell System

DTIC Science & Technology

2008-06-01

combination of electrochemical devices such as fuel cell and battery. Polymer electrolyte membrane fuel cells ( PEMFC ) using hydrogen or liquid...communications and computers, sensors and night vision capabilities. High temperature PEMFC offers some advantages such as enhanced electrode kinetics and better...tolerance of carbon monoxide that will poison the conventional PEMFC . Ultracell Corporation, Livermore, California has developed a first
Operational Suitability Guide. Volume 2. Templates

DTIC Science & Technology

1990-05-01

Intended mission, and the required technical and operational characteristics. The mission must be adequately defined and key hardware and software ...operational availability. With the use of fault-tolerant computer hardware and software , the system R&M will significantly improve end-to-end...should Include both hardware and software elements, as appropriate. Unique characteristics or unique support concepts should be Identified if they result
Phase Transition of a Dynamical System with a Bi-Directional, Instantaneous Coupling to a Virtual System

NASA Astrophysics Data System (ADS)

Gintautas, Vadas; Hubler, Alfred

2006-03-01

As worldwide computer resources increase in power and decrease in cost, real-time simulations of physical systems are becoming increasingly prevalent, from laboratory models to stock market projections and entire ``virtual worlds'' in computer games. Often, these systems are meticulously designed to match real-world systems as closely as possible. We study the limiting behavior of a virtual horizontally driven pendulum coupled to its real-world counterpart, where the interaction occurs on a time scale that is much shorter than the time scale of the dynamical system. We find that if the physical parameters of the virtual system match those of the real system within a certain tolerance, there is a qualitative change in the behavior of the two-pendulum system as the strength of the coupling is increased. Applications include a new method to measure the physical parameters of a real system and the use of resonance spectroscopy to refine a computer model. As virtual systems better approximate real ones, even very weak interactions may produce unexpected and dramatic behavior. The research is supported by the National Science Foundation Grant No. NSF PHY 01-40179, NSF DMS 03-25939 ITR, and NSF DGE 03-38215.
Finding the way with a noisy brain.

PubMed

Cheung, Allen; Vickerstaff, Robert

2010-11-11

Successful navigation is fundamental to the survival of nearly every animal on earth, and achieved by nervous systems of vastly different sizes and characteristics. Yet surprisingly little is known of the detailed neural circuitry from any species which can accurately represent space for navigation. Path integration is one of the oldest and most ubiquitous navigation strategies in the animal kingdom. Despite a plethora of computational models, from equational to neural network form, there is currently no consensus, even in principle, of how this important phenomenon occurs neurally. Recently, all path integration models were examined according to a novel, unifying classification system. Here we combine this theoretical framework with recent insights from directed walk theory, and develop an intuitive yet mathematically rigorous proof that only one class of neural representation of space can tolerate noise during path integration. This result suggests many existing models of path integration are not biologically plausible due to their intolerance to noise. This surprising result imposes significant computational limitations on the neurobiological spatial representation of all successfully navigating animals, irrespective of species. Indeed, noise-tolerance may be an important functional constraint on the evolution of neuroarchitectural plans in the animal kingdom.
Parallel Architectures for Planetary Exploration Requirements (PAPER)

NASA Technical Reports Server (NTRS)

Cezzar, Ruknet; Sen, Ranjan K.

1989-01-01

The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified.
Tolerance analysis program

NASA Technical Reports Server (NTRS)

Watson, H. K.

1971-01-01

Digital computer program determines tolerance values of end to end signal chain or flow path, given preselected probability value. Technique is useful in the synthesis and analysis phases of subsystem design processes.
A Scheduling Algorithm for Replicated Real-Time Tasks

NASA Technical Reports Server (NTRS)

Yu, Albert C.; Lin, Kwei-Jay

1991-01-01

We present an algorithm for scheduling real-time periodic tasks on a multiprocessor system under fault-tolerant requirement. Our approach incorporates both the redundancy and masking technique and the imprecise computation model. Since the tasks in hard real-time systems have stringent timing constraints, the redundancy and masking technique are more appropriate than the rollback techniques which usually require extra time for error recovery. The imprecise computation model provides flexible functionality by trading off the quality of the result produced by a task with the amount of processing time required to produce it. It therefore permits the performance of a real-time system to degrade gracefully. We evaluate the algorithm by stochastic analysis and Monte Carlo simulations. The results show that the algorithm is resilient under hardware failures.
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
A brief overview of NASA Langley's research program in formal methods

NASA Technical Reports Server (NTRS)

1992-01-01

An overview of NASA Langley's research program in formal methods is presented. The major goal of this work is to bring formal methods technology to a sufficiently mature level for use by the United States aerospace industry. Towards this goal, work is underway to design and formally verify a fault-tolerant computing platform suitable for advanced flight control applications. Also, several direct technology transfer efforts have been initiated that apply formal methods to critical subsystems of real aerospace computer systems. The research team consists of six NASA civil servants and contractors from Boeing Military Aircraft Company, Computational Logic Inc., Odyssey Research Associates, SRI International, University of California at Davis, and Vigyan Inc.
DEPEND: A simulation-based environment for system level dependability analysis

NASA Technical Reports Server (NTRS)

Goswami, Kumar; Iyer, Ravishankar K.

1992-01-01

The design and evaluation of highly reliable computer systems is a complex issue. Designers mostly develop such systems based on prior knowledge and experience and occasionally from analytical evaluations of simplified designs. A simulation-based environment called DEPEND which is especially geared for the design and evaluation of fault-tolerant architectures is presented. DEPEND is unique in that it exploits the properties of object-oriented programming to provide a flexible framework with which a user can rapidly model and evaluate various fault-tolerant systems. The key features of the DEPEND environment are described, and its capabilities are illustrated with a detailed analysis of a real design. In particular, DEPEND is used to simulate the Unix based Tandem Integrity fault-tolerance and evaluate how well it handles near-coincident errors caused by correlated and latent faults. Issues such as memory scrubbing, re-integration policies, and workload dependent repair times which affect how the system handles near-coincident errors are also evaluated. Issues such as the method used by DEPEND to simulate error latency and the time acceleration technique that provides enormous simulation speed up are also discussed. Unlike any other simulation-based dependability studies, the use of these approaches and the accuracy of the simulation model are validated by comparing the results of the simulations, with measurements obtained from fault injection experiments conducted on a production Tandem Integrity machine.
Fault-tolerant arithmetic via time-shared TMR

NASA Astrophysics Data System (ADS)

Swartzlander, Earl E.

1999-11-01

Fault tolerance is increasingly important as society has come to depend on computers for more and more aspects of daily life. The current concern about the Y2K problems indicates just how much we depend on accurate computers. This paper describes work on time- shared TMR, a technique which is used to provide arithmetic operations that produce correct results in spite of circuit faults.
Application of the actor model to large scale NDE data analysis

NASA Astrophysics Data System (ADS)

Coughlin, Chris

2018-03-01

The Actor model of concurrent computation discretizes a problem into a series of independent units or actors that interact only through the exchange of messages. Without direct coupling between individual components, an Actor-based system is inherently concurrent and fault-tolerant. These traits lend themselves to so-called "Big Data" applications in which the volume of data to analyze requires a distributed multi-system design. For a practical demonstration of the Actor computational model, a system was developed to assist with the automated analysis of Nondestructive Evaluation (NDE) datasets using the open source Myriad Data Reduction Framework. A machine learning model trained to detect damage in two-dimensional slices of C-Scan data was deployed in a streaming data processing pipeline. To demonstrate the flexibility of the Actor model, the pipeline was deployed on a local system and re-deployed as a distributed system without recompiling, reconfiguring, or restarting the running application.
ASSIST: User's manual

NASA Technical Reports Server (NTRS)

Johnson, S. C.

1986-01-01

Semi-Markov models can be used to compute the reliability of virtually any fault-tolerant system. However, the process of delineating all of the states and transitions in a model of a complex system can be devastingly tedious and error-prone. The ASSIST program allows the user to describe the semi-Markov model in a high-level language. Instead of specifying the individual states of the model, the user specifies the rules governing the behavior of the system and these are used by ASSIST to automatically generate the model. The ASSIST program is described and illustrated by examples.
Control of linear uncertain systems utilizing mismatched state observers

NASA Technical Reports Server (NTRS)

Goldstein, B.

1972-01-01

The control of linear continuous dynamical systems is investigated as a problem of limited state feedback control. The equations which describe the structure of an observer are developed constrained to time-invarient systems. The optimal control problem is formulated, accounting for the uncertainty in the design parameters. Expressions for bounds on closed loop stability are also developed. The results indicate that very little uncertainty may be tolerated before divergence occurs in the recursive computation algorithms, and the derived stability bound yields extremely conservative estimates of regions of allowable parameter variations.
Engineered Zircaloy Cladding Modifications for Improved Accident Tolerance of LWR Nuclear Fuel

DOE Office of Scientific and Technical Information (OSTI.GOV)

Heuser, Brent; Stubbins, James; Kozlowski, Tomasz

The DOE NEUP sponsored IRP on accident tolerant fuel (ATF) entitled Engineered Zircaloy Cladding Modifications for Improved Accident Tolerance of LWR Nuclear Fuel involved three academic institutions, Idaho National Laboratory (INL), and ATI Materials (ATI). Detailed descriptions of the work at the University of Illinois (UIUC, prime), the University of Florida (UF), the University of Michigan (UMich), and INL are included in this document as separate sections. This summary provides a synopsis of the work performed across the IRP team. Two ATF solution pathways were initially proposed, coatings on monolithic Zr-based LWR cladding material and selfhealing modifications of Zr-based alloys.more » The coating pathway was extensively investigated, both experimentally and in computations. Experimental activities related to ATF coatings were centered at UIUC, UF, and UMich and involved coating development and testing, and ion irradiation. Neutronic and thermal hydraulic aspects of ATF coatings were the focus of computational work at UIUC and UMich, while materials science aspects were the focus of computational work at UF and INL. ATI provided monolithic Zircaloy 2 and 4 material and a binary Zr-Y alloy material. The selfhealing pathway was investigated with advanced computations only. Beryllium was identified as a valid self-healing additive early in this work. However, all attempts to fabricate a Zr-Be alloy failed. Several avenues of fabrication were explored. ATI ultimately declined our fabrication request over health concerns associated with Be (we note that Be was not part of the original work scope and the ATI SOW). Likewise, Ames Laboratory declined our fabrication request, citing known litigation dating to the 1980s and 1990s involving the U.S. Federal government and U.S. National Laboratory employees involving the use of Be. Materion (formerly, Brush Wellman) also declined our fabrication request, citing the difficulty in working with a highly reactive Zr and Be. International fabrication options were explored in Europe and Asia, but this proved to be impractical, if not impossible. Consequently, experimental investigation of the Zr-Be binary system was dropped and exploration binary Zr-Y binary system was initiated. The motivation behind the Zr-Y system is the known thermodynamic stability of yttria over zirconia.« less
Image Analysis via Soft Computing: Prototype Applications at NASA KSC and Product Commercialization

NASA Technical Reports Server (NTRS)

Dominguez, Jesus A.; Klinko, Steve

2011-01-01

This slide presentation reviews the use of "soft computing" which differs from "hard computing" in that it is more tolerant of imprecision, partial truth, uncertainty, and approximation and its use in image analysis. Soft computing provides flexible information processing to handle real life ambiguous situations and achieve tractability, robustness low solution cost, and a closer resemblance to human decision making. Several systems are or have been developed: Fuzzy Reasoning Edge Detection (FRED), Fuzzy Reasoning Adaptive Thresholding (FRAT), Image enhancement techniques, and visual/pattern recognition. These systems are compared with examples that show the effectiveness of each. NASA applications that are reviewed are: Real-Time (RT) Anomaly Detection, Real-Time (RT) Moving Debris Detection and the Columbia Investigation. The RT anomaly detection reviewed the case of a damaged cable for the emergency egress system. The use of these techniques is further illustrated in the Columbia investigation with the location and detection of Foam debris. There are several applications in commercial usage: image enhancement, human screening and privacy protection, visual inspection, 3D heart visualization, tumor detections and x ray image enhancement.

A novel concept for smart trepanation.

PubMed

Follmann, Axel; Korff, Alexander; Fuertjes, Tobias; Kunze, Sandra C; Schmieder, Kirsten; Radermacher, Klaus

2012-01-01

Trepanation of the skull is a common procedure in craniofacial and neurosurgical interventions, allowing access to the innermost cranial structures. Despite a careful advancement, injury of the dura mater represents a frequent complication during these cranial openings. The technology of computer-assisted surgery offers different support systems such as navigated tools and surgical robots. This article presents a novel technical approach toward an image- and sensor-based synergistic control of the cutting depth of a manually guided soft-tissue-preserving saw. Feasibility studies in a laboratory setup modeling relevant skull tissue parameters demonstrate that errors due to computed tomography or magnetic resonance image segmentation and registration, optical tracking, and mechanical tolerances of up to 2.5 mm, imminent to many computer-assisted surgery systems, can be compensated for by the cutting tool characteristics without damaging the dura. In conclusion, the feasibility of a computer-controlled trepanation system providing a safer and efficient trepanation has been demonstrated. Injuries of the dura mater can be avoided, and the bone cutting gap can be reduced to 0.5 mm with potential benefits for the reintegration of the bone flap.
VLSI Implementation of Fault Tolerance Multiplier based on Reversible Logic Gate

NASA Astrophysics Data System (ADS)

Ahmad, Nabihah; Hakimi Mokhtar, Ahmad; Othman, Nurmiza binti; Fhong Soon, Chin; Rahman, Ab Al Hadi Ab

2017-08-01

Multiplier is one of the essential component in the digital world such as in digital signal processing, microprocessor, quantum computing and widely used in arithmetic unit. Due to the complexity of the multiplier, tendency of errors are very high. This paper aimed to design a 2×2 bit Fault Tolerance Multiplier based on Reversible logic gate with low power consumption and high performance. This design have been implemented using 90nm Complemetary Metal Oxide Semiconductor (CMOS) technology in Synopsys Electronic Design Automation (EDA) Tools. Implementation of the multiplier architecture is by using the reversible logic gates. The fault tolerance multiplier used the combination of three reversible logic gate which are Double Feynman gate (F2G), New Fault Tolerance (NFT) gate and Islam Gate (IG) with the area of 160μm x 420.3μm (67.25 mm2). This design achieved a low power consumption of 122.85μW and propagation delay of 16.99ns. The fault tolerance multiplier proposed achieved a low power consumption and high performance which suitable for application of modern computing as it has a fault tolerance capabilities.
Arcmancer: Geodesics and polarized radiative transfer library

NASA Astrophysics Data System (ADS)

Pihajoki, Pauli; Mannerkoski, Matias; Nättilä, Joonas; Johansson, Peter H.

2018-05-01

Arcmancer computes geodesics and performs polarized radiative transfer in user-specified spacetimes. The library supports Riemannian and semi-Riemannian spaces of any dimension and metric; it also supports multiple simultaneous coordinate charts, embedded geometric shapes, local coordinate systems, and automatic parallel propagation. Arcmancer can be used to solve various problems in numerical geometry, such as solving the curve equation of motion using adaptive integration with configurable tolerances and differential equations along precomputed curves. It also provides support for curves with an arbitrary acceleration term and generic tools for generating ray initial conditions and performing parallel computation over the image, among other tools.
Exploiting opportunistic resources for ATLAS with ARC CE and the Event Service

NASA Astrophysics Data System (ADS)

Cameron, D.; Filipčič, A.; Guan, W.; Tsulaia, V.; Walker, R.; Wenaus, T.; ATLAS Collaboration

2017-10-01

With ever-greater computing needs and fixed budgets, big scientific experiments are turning to opportunistic resources as a means to add much-needed extra computing power. These resources can be very different in design from those that comprise the Grid computing of most experiments, therefore exploiting them requires a change in strategy for the experiment. They may be highly restrictive in what can be run or in connections to the outside world, or tolerate opportunistic usage only on condition that tasks may be terminated without warning. The Advanced Resource Connector Computing Element (ARC CE) with its nonintrusive architecture is designed to integrate resources such as High Performance Computing (HPC) systems into a computing Grid. The ATLAS experiment developed the ATLAS Event Service (AES) primarily to address the issue of jobs that can be terminated at any point when opportunistic computing capacity is needed by someone else. This paper describes the integration of these two systems in order to exploit opportunistic resources for ATLAS in a restrictive environment. In addition to the technical details, results from deployment of this solution in the SuperMUC HPC centre in Munich are shown.
Highly fault-tolerant parallel computation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Spielman, D.A.

We re-introduce the coded model of fault-tolerant computation in which the input and output of a computational device are treated as words in an error-correcting code. A computational device correctly computes a function in the coded model if its input and output, once decoded, are a valid input and output of the function. In the coded model, it is reasonable to hope to simulate all computational devices by devices whose size is greater by a constant factor but which are exponentially reliable even if each of their components can fail with some constant probability. We consider fine-grained parallel computations inmore » which each processor has a constant probability of producing the wrong output at each time step. We show that any parallel computation that runs for time t on w processors can be performed reliably on a faulty machine in the coded model using w log{sup O(l)} w processors and time t log{sup O(l)} w. The failure probability of the computation will be at most t {center_dot} exp(-w{sup 1/4}). The codes used to communicate with our fault-tolerant machines are generalized Reed-Solomon codes and can thus be encoded and decoded in O(n log{sup O(1)} n) sequential time and are independent of the machine they are used to communicate with. We also show how coded computation can be used to self-correct many linear functions in parallel with arbitrarily small overhead.« less
Proactive Fault Tolerance Using Preemptive Migration

DOE Office of Scientific and Technical Information (OSTI.GOV)

Engelmann, Christian; Vallee, Geoffroy R; Naughton, III, Thomas J

2009-01-01

Proactive fault tolerance (FT) in high-performance computing is a concept that prevents compute node failures from impacting running parallel applications by preemptively migrating application parts away from nodes that are about to fail. This paper provides a foundation for proactive FT by defining its architecture and classifying implementation options. This paper further relates prior work to the presented architecture and classification, and discusses the challenges ahead for needed supporting technologies.
Closed-form solution of decomposable stochastic models

NASA Technical Reports Server (NTRS)

Sjogren, Jon A.

1990-01-01

Markov and semi-Markov processes are increasingly being used in the modeling of complex reconfigurable systems (fault tolerant computers). The estimation of the reliability (or some measure of performance) of the system reduces to solving the process for its state probabilities. Such a model may exhibit numerous states and complicated transition distributions, contributing to an expensive and numerically delicate solution procedure. Thus, when a system exhibits a decomposition property, either structurally (autonomous subsystems), or behaviorally (component failure versus reconfiguration), it is desirable to exploit this decomposition in the reliability calculation. In interesting cases there can be failure states which arise from non-failure states of the subsystems. Equations are presented which allow the computation of failure probabilities of the total (combined) model without requiring a complete solution of the combined model. This material is presented within the context of closed-form functional representation of probabilities as utilized in the Symbolic Hierarchical Automated Reliability and Performance Evaluator (SHARPE) tool. The techniques adopted enable one to compute such probability functions for a much wider class of systems at a reduced computational cost. Several examples show how the method is used, especially in enhancing the versatility of the SHARPE tool.
Fault tolerant computer control for a Maglev transportation system

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan H.; Nagle, Gail A.; Anagnostopoulos, George

1994-01-01

Magnetically levitated (Maglev) vehicles operating on dedicated guideways at speeds of 500 km/hr are an emerging transportation alternative to short-haul air and high-speed rail. They have the potential to offer a service significantly more dependable than air and with less operating cost than both air and high-speed rail. Maglev transportation derives these benefits by using magnetic forces to suspend a vehicle 8 to 200 mm above the guideway. Magnetic forces are also used for propulsion and guidance. The combination of high speed, short headways, stringent ride quality requirements, and a distributed offboard propulsion system necessitates high levels of automation for the Maglev control and operation. Very high levels of safety and availability will be required for the Maglev control system. This paper describes the mission scenario, functional requirements, and dependability and performance requirements of the Maglev command, control, and communications system. A distributed hierarchical architecture consisting of vehicle on-board computers, wayside zone computers, a central computer facility, and communication links between these entities was synthesized to meet the functional and dependability requirements on the maglev. Two variations of the basic architecture are described: the Smart Vehicle Architecture (SVA) and the Zone Control Architecture (ZCA). Preliminary dependability modeling results are also presented.
Understanding checkpointing overheads on massive-scale systems : analysis of the IBM Blue Gene/P system.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gupta, R.; Naik, H.; Beckman, P.

Providing fault tolerance in high-end petascale systems, consisting of millions of hardware components and complex software stacks, is becoming an increasingly challenging task. Checkpointing continues to be the most prevalent technique for providing fault tolerance in such high-end systems. Considerable research has focussed on optimizing checkpointing; however, in practice, checkpointing still involves a high-cost overhead for users. In this paper, we study the checkpointing overhead seen by various applications running on leadership-class machines like the IBM Blue Gene/P at Argonne National Laboratory. In addition to studying popular applications, we design a methodology to help users understand and intelligently choose anmore » optimal checkpointing frequency to reduce the overall checkpointing overhead incurred. In particular, we study the Grid-Based Projector-Augmented Wave application, the Carr-Parrinello Molecular Dynamics application, the Nek5000 computational fluid dynamics application and the Parallel Ocean Program application-and analyze their memory usage and possible checkpointing trends on 65,536 processors of the Blue Gene/P system.« less
Stochastic Stability of Nonlinear Sampled Data Systems with a Jump Linear Controller

NASA Technical Reports Server (NTRS)

Gonzalez, Oscar R.; Herencia-Zapana, Heber; Gray, W. Steven

2004-01-01

This paper analyzes the stability of a sampled- data system consisting of a deterministic, nonlinear, time- invariant, continuous-time plant and a stochastic, discrete- time, jump linear controller. The jump linear controller mod- els, for example, computer systems and communication net- works that are subject to stochastic upsets or disruptions. This sampled-data model has been used in the analysis and design of fault-tolerant systems and computer-control systems with random communication delays without taking into account the inter-sample response. To analyze stability, appropriate topologies are introduced for the signal spaces of the sampled- data system. With these topologies, the ideal sampling and zero-order-hold operators are shown to be measurable maps. This paper shows that the known equivalence between the stability of a deterministic, linear sampled-data system and its associated discrete-time representation as well as between a nonlinear sampled-data system and a linearized representation holds even in a stochastic framework.
Development of a frameless stereotactic radiosurgery system based on real-time 6D position monitoring and adaptive head motion compensation

NASA Astrophysics Data System (ADS)

Wiersma, Rodney D.; Wen, Zhifei; Sadinski, Meredith; Farrey, Karl; Yenice, Kamil M.

2010-01-01

Stereotactic radiosurgery delivers radiation with great spatial accuracy. To achieve sub-millimeter accuracy for intracranial SRS, a head ring is rigidly fixated to the skull to create a fixed reference. For some patients, the invasiveness of the ring can be highly uncomfortable and not well tolerated. In addition, placing and removing the ring requires special expertise from a neurosurgeon, and patient setup time for SRS can often be long. To reduce the invasiveness, hardware limitations and setup time, we are developing a system for performing accurate head positioning without the use of a head ring. The proposed method uses real-time 6D optical position feedback for turning on and off the treatment beam (gating) and guiding a motor-controlled 3D head motion compensation stage. The setup consists of a central control computer, an optical patient motion tracking system and a 3D motion compensation stage attached to the front of the LINAC couch. A styrofoam head cast was custom-built for patient support and was mounted on the compensation stage. The motion feedback of the markers was processed by the control computer, and the resulting motion of the target was calculated using a rigid body model. If the target deviated beyond a preset position of 0.2 mm, an automatic position correction was performed with stepper motors to adjust the head position via the couch mount motion platform. In the event the target deviated more than 1 mm, a safety relay switch was activated and the treatment beam was turned off. The feasibility of the concept was tested using five healthy volunteers. Head motion data were acquired with and without the use of motion compensation over treatment times of 15 min. On average, test subjects exceeded the 0.5 mm tolerance 86% of the time and the 1.0 mm tolerance 45% of the time without motion correction. With correction, this percentage was reduced to 5% and 2% for the 0.5 mm and 1.0 mm tolerances, respectively.
High-Resiliency and Auto-Scaling of Large-Scale Cloud Computing for OCO-2 L2 Full Physics Processing

NASA Astrophysics Data System (ADS)

Hua, H.; Manipon, G.; Starch, M.; Dang, L. B.; Southam, P.; Wilson, B. D.; Avis, C.; Chang, A.; Cheng, C.; Smyth, M.; McDuffie, J. L.; Ramirez, P.

2015-12-01

Next generation science data systems are needed to address the incoming flood of data from new missions such as SWOT and NISAR where data volumes and data throughput rates are order of magnitude larger than present day missions. Additionally, traditional means of procuring hardware on-premise are already limited due to facilities capacity constraints for these new missions. Existing missions, such as OCO-2, may also require high turn-around time for processing different science scenarios where on-premise and even traditional HPC computing environments may not meet the high processing needs. We present our experiences on deploying a hybrid-cloud computing science data system (HySDS) for the OCO-2 Science Computing Facility to support large-scale processing of their Level-2 full physics data products. We will explore optimization approaches to getting best performance out of hybrid-cloud computing as well as common issues that will arise when dealing with large-scale computing. Novel approaches were utilized to do processing on Amazon's spot market, which can potentially offer ~10X costs savings but with an unpredictable computing environment based on market forces. We will present how we enabled high-tolerance computing in order to achieve large-scale computing as well as operational cost savings.
DAKOTA Design Analysis Kit for Optimization and Terascale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Adams, Brian M.; Dalbey, Keith R.; Eldred, Michael S.

2010-02-24

The DAKOTA (Design Analysis Kit for Optimization and Terascale Applications) toolkit provides a flexible and extensible interface between simulation codes (computational models) and iterative analysis methods. By employing object-oriented design to implement abstractions of the key components required for iterative systems analyses, the DAKOTA toolkit provides a flexible and extensible problem-solving environment for design and analysis of computational models on high performance computers.A user provides a set of DAKOTA commands in an input file and launches DAKOTA. DAKOTA invokes instances of the computational models, collects their results, and performs systems analyses. DAKOTA contains algorithms for optimization with gradient and nongradient-basedmore » methods; uncertainty quantification with sampling, reliability, polynomial chaos, stochastic collocation, and epistemic methods; parameter estimation with nonlinear least squares methods; and sensitivity/variance analysis with design of experiments and parameter study methods. These capabilities may be used on their own or as components within advanced strategies such as hybrid optimization, surrogate-based optimization, mixed integer nonlinear programming, or optimization under uncertainty. Services for parallel computing, simulation interfacing, approximation modeling, fault tolerance, restart, and graphics are also included.« less
Digital Plasma Control System for Alcator C-Mod

NASA Astrophysics Data System (ADS)

Ferrara, M.; Wolfe, S.; Stillerman, J.; Fredian, T.; Hutchinson, I.

2004-11-01

A digital plasma control system (DPCS) has been designed to replace the present C-Mod system, which is based on hybrid analog-digital computer. The initial implementation of DPCS comprises two 64 channel, 16 bit, low-latency cPCI digitizers, each with 16 analog outputs, controlled by a rack-mounted single-processor Linux server, which also serves as the compute engine. A prototype system employing three older 32 channel digitizers was tested during the 2003-04 campaign. The hybrid's linear PID feedback system was emulated by IDL code executing a synchronous loop, using the same target waveforms and control parameters. Reliable real-time operation was accomplished under a standard Linux OS (RH9) by locking memory and disabling interrupts during the plasma pulse. The DPCS-computed outputs agreed to within a few percent with those produced by the hybrid system, except for discrepancies due to offsets and non-ideal behavior of the hybrid circuitry. The system operated reliably, with no sample loss, at more than twice the 10kHz design specification, providing extra time for implementing more advanced control algorithms. The code is fault-tolerant and produces consistent output waveforms even with 10% sample loss.
Thermal monitoring, measurement, and control system for a Volatile Condensable Materials (VCM) test apparatus

NASA Technical Reports Server (NTRS)

Ives, R. E.

1982-01-01

A thermal monitoring and control concept is described for a volatile condensable materials (VCM) test apparatus where electric resistance heaters are employed. The technique is computer based, but requires only proportioning ON/OFF relay control signals supplied through a programmable scanner and simple quadrac power controllers. System uniqueness is derived from automatic temperature measurements and the averaging of these measurements in discrete overlapping temperature zones. Overall control tolerance proves to be better than + or - 0.5 C from room ambient temperature to 150 C. Using precisely calibrated thermocouples, the method provides excellent temperature control of a small copper VCM heating plate at 125 + or - 0.2 C over a 24 hr test period. For purposes of unattended operation, the programmable computer/controller provides a continual data printout of system operation. Real time operator command is also provided for, as is automatic shutdown of the system and operator alarm in the event of malfunction.
Combining Topological Hardware and Topological Software: Color-Code Quantum Computing with Topological Superconductor Networks

NASA Astrophysics Data System (ADS)

Litinski, Daniel; Kesselring, Markus S.; Eisert, Jens; von Oppen, Felix

2017-07-01

We present a scalable architecture for fault-tolerant topological quantum computation using networks of voltage-controlled Majorana Cooper pair boxes and topological color codes for error correction. Color codes have a set of transversal gates which coincides with the set of topologically protected gates in Majorana-based systems, namely, the Clifford gates. In this way, we establish color codes as providing a natural setting in which advantages offered by topological hardware can be combined with those arising from topological error-correcting software for full-fledged fault-tolerant quantum computing. We provide a complete description of our architecture, including the underlying physical ingredients. We start by showing that in topological superconductor networks, hexagonal cells can be employed to serve as physical qubits for universal quantum computation, and we present protocols for realizing topologically protected Clifford gates. These hexagonal-cell qubits allow for a direct implementation of open-boundary color codes with ancilla-free syndrome read-out and logical T gates via magic-state distillation. For concreteness, we describe how the necessary operations can be implemented using networks of Majorana Cooper pair boxes, and we give a feasibility estimate for error correction in this architecture. Our approach is motivated by nanowire-based networks of topological superconductors, but it could also be realized in alternative settings such as quantum-Hall-superconductor hybrids.
Cooperative fault-tolerant distributed computing U.S. Department of Energy Grant DE-FG02-02ER25537 Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sunderam, Vaidy S.

2007-01-09

The Harness project has developed novel software frameworks for the execution of high-end simulations in a fault-tolerant manner on distributed resources. The H2O subsystem comprises the kernel of the Harness framework, and controls the key functions of resource management across multiple administrative domains, especially issues of access and allocation. It is based on a “pluggable” architecture that enables the aggregated use of distributed heterogeneous resources for high performance computing. The major contributions of the Harness II project result in significantly enhancing the overall computational productivity of high-end scientific applications by enabling robust, failure-resilient computations on cooperatively pooled resource collections.
The Unification of Space Qualified Integrated Circuits by Example of International Space Project GAMMA-400

NASA Astrophysics Data System (ADS)

Bobkov, S. G.; Serdin, O. V.; Arkhangelskiy, A. I.; Arkhangelskaja, I. V.; Suchkov, S. I.; Topchiev, N. P.

The problem of electronic component unification at the different levels (circuits, interfaces, hardware and software) used in space industry is considered. The task of computer systems for space purposes developing is discussed by example of scientific data acquisition system for space project GAMMA-400. The basic characteristics of high reliable and fault tolerant chips developed by SRISA RAS for space applicable computational systems are given. To reduce power consumption and enhance data reliability, embedded system interconnect made hierarchical: upper level is Serial RapidIO 1x or 4x with rate transfer 1.25 Gbaud; next level - SpaceWire with rate transfer up to 400 Mbaud and lower level - MIL-STD-1553B and RS232/RS485. The Ethernet 10/100 is technology interface and provided connection with the previously released modules too. Systems interconnection allows creating different redundancy systems. Designers can develop heterogeneous systems that employ the peer-to-peer networking performance of Serial RapidIO using multiprocessor clusters interconnected by SpaceWire.
Robust Routing Protocol For Digital Messages

NASA Technical Reports Server (NTRS)

Marvit, Maclen

1994-01-01

Refinement of ditigal-message-routing protocol increases fault tolerance of polled networks. AbNET-3 is latest of generic AbNET protocols for transmission of messages among computing nodes. AbNET concept described in "Multiple-Ring Digital Communication Network" (NPO-18133). Specifically aimed at increasing fault tolerance of network in broadcast mode, in which one node broadcasts message to and receives responses from all other nodes. Communication in network of computers maintained even when links fail.
Integrated Data and Control Level Fault Tolerance Techniques for Signal Processing Computer Design

DTIC Science & Technology

1990-09-01

TOLERANCE TECHNIQUES FOR SIGNAL PROCESSING COMPUTER DESIGN G. Robert Redinbo I. INTRODUCTION High-speed signal processing is an important application of...techniques and mathematical approaches will be expanded later to the situation where hardware errors and roundoff and quantization noise affect all...detect errors equal in number to the degree of g(X), the maximum permitted by the Singleton bound [13]. Real cyclic codes, primarily applicable to

Sequential Test Strategies for Multiple Fault Isolation

NASA Technical Reports Server (NTRS)

Shakeri, M.; Pattipati, Krishna R.; Raghavan, V.; Patterson-Hine, Ann; Kell, T.

1997-01-01

In this paper, we consider the problem of constructing near optimal test sequencing algorithms for diagnosing multiple faults in redundant (fault-tolerant) systems. The computational complexity of solving the optimal multiple-fault isolation problem is super-exponential, that is, it is much more difficult than the single-fault isolation problem, which, by itself, is NP-hard. By employing concepts from information theory and Lagrangian relaxation, we present several static and dynamic (on-line or interactive) test sequencing algorithms for the multiple fault isolation problem that provide a trade-off between the degree of suboptimality and computational complexity. Furthermore, we present novel diagnostic strategies that generate a static diagnostic directed graph (digraph), instead of a static diagnostic tree, for multiple fault diagnosis. Using this approach, the storage complexity of the overall diagnostic strategy reduces substantially. Computational results based on real-world systems indicate that the size of a static multiple fault strategy is strictly related to the structure of the system, and that the use of an on-line multiple fault strategy can diagnose faults in systems with as many as 10,000 failure sources.
Delay/Disruption Tolerant Networking for the International Space Station (ISS)

NASA Technical Reports Server (NTRS)

Schlesinger, Adam; Willman, Brett M.; Pitts, Lee; Davidson, Suzanne R.; Pohlchuck, William A.

2017-01-01

Disruption Tolerant Networking (DTN) is an emerging data networking technology designed to abstract the hardware communication layer from the spacecraft/payload computing resources. DTN is specifically designed to operate in environments where link delays and disruptions are common (e.g., space-based networks). The National Aeronautics and Space Administration (NASA) has demonstrated DTN on several missions, such as the Deep Impact Networking (DINET) experiment, the Earth Observing Mission 1 (EO-1) and the Lunar Laser Communication Demonstration (LLCD). To further the maturation of DTN, NASA is implementing DTN protocols on the International Space Station (ISS). This paper explains the architecture of the ISS DTN network, the operational support for the system, the results from integrated ground testing, and the future work for DTN expansion.
The X-38 Spacecraft Fault-Tolerant Avionics System

NASA Technical Reports Server (NTRS)

Kouba,Coy; Buscher, Deborah; Busa, Joseph

2003-01-01

In 1995 NASA began an experimental program to develop a reusable crew return vehicle (CRV) for the International Space Station. The purpose of the CRV was threefold: (i) to bring home an injured or ill crewmember; (ii) to bring home the entire crew if the Shuttle fleet was grounded; and (iii) to evacuate the crew in the case of an imminent Station threat (i.e., fire, decompression, etc). Built at the Johnson Space Center, were two approach and landing prototypes and one spacecraft demonstrator (called V201). A series of increasingly complex ground subsystem tests were completed, and eight successful high-altitude drop tests were achieved to prove the design concept. In this program, an unprecedented amount of commercial-off-the-shelf technology was utilized in this first crewed spacecraft NASA has built since the Shuttle program. Unfortunately, in 2002 the program was canceled due to changing Agency priorities. The vehicle was 80% complete and the program was shut down in such a manner as to preserve design, development, test and engineering data. This paper describes the X-38 V201 fault-tolerant avionics system. Based on Draper Laboratory's Byzantine-resilient fault-tolerant parallel processing system and their "network element" hardware, each flight computer exchanges information on a strict timescale to process input data, compare results, and issue voted vehicle output commands. Major accomplishments achieved in this development include: (i) a space qualified two-fault tolerant design using mostly COTS (hardware and operating system); (ii) a single event upset tolerant network element board, (iii) on-the-fly recovery of a failed processor; (iv) use of synched cache; (v) realignment of memory to bring back a failed channel; (vi) flight code automatically generated from the master measurement list; and (vii) built in-house by a team of civil servants and support contractors. This paper will present an overview of the avionics system and the hardware implementation, as well as the system software and vehicle command & telemetry functions. Potential improvements and lessons learned on this program are also discussed.
Statistical computation of tolerance limits

NASA Technical Reports Server (NTRS)

Wheeler, J. T.

1993-01-01

Based on a new theory, two computer codes were developed specifically to calculate the exact statistical tolerance limits for normal distributions within unknown means and variances for the one-sided and two-sided cases for the tolerance factor, k. The quantity k is defined equivalently in terms of the noncentral t-distribution by the probability equation. Two of the four mathematical methods employ the theory developed for the numerical simulation. Several algorithms for numerically integrating and iteratively root-solving the working equations are written to augment the program simulation. The program codes generate some tables of k's associated with the varying values of the proportion and sample size for each given probability to show accuracy obtained for small sample sizes.
Squid - a simple bioinformatics grid.

PubMed

Carvalho, Paulo C; Glória, Rafael V; de Miranda, Antonio B; Degrave, Wim M

2005-08-03

BLAST is a widely used genetic research tool for analysis of similarity between nucleotide and protein sequences. This paper presents a software application entitled "Squid" that makes use of grid technology. The current version, as an example, is configured for BLAST applications, but adaptation for other computing intensive repetitive tasks can be easily accomplished in the open source version. This enables the allocation of remote resources to perform distributed computing, making large BLAST queries viable without the need of high-end computers. Most distributed computing / grid solutions have complex installation procedures requiring a computer specialist, or have limitations regarding operating systems. Squid is a multi-platform, open-source program designed to "keep things simple" while offering high-end computing power for large scale applications. Squid also has an efficient fault tolerance and crash recovery system against data loss, being able to re-route jobs upon node failure and recover even if the master machine fails. Our results show that a Squid application, working with N nodes and proper network resources, can process BLAST queries almost N times faster than if working with only one computer. Squid offers high-end computing, even for the non-specialist, and is freely available at the project web site. Its open-source and binary Windows distributions contain detailed instructions and a "plug-n-play" instalation containing a pre-configured example.
Proactive Fault Tolerance for HPC with Xen Virtualization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nagarajan, Arun Babu; Mueller, Frank; Engelmann, Christian

2007-01-01

with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a proactive one where processes automatically migrate from unhealthy nodes to healthy ones. Our approach relies on operating system virtualization techniques exemplied by but not limited to Xen. This paper contributes an automatic and transparent mechanismmore » for proactive FT for arbitrary MPI applications. It leverages virtualization techniques combined with health monitoring and load-based migration. We exploit Xen's live migration mechanism for a guest operating system (OS) to migrate an MPI task from a health-deteriorating node to a healthy one without stopping the MPI task during most of the migration. Our proactive FT daemon orchestrates the tasks of health monitoring, load determination and initiation of guest OS migration. Experimental results demonstrate that live migration hides migration costs and limits the overhead to only a few seconds making it an attractive approach to realize FT in HPC systems. Overall, our enhancements make proactive FT a valuable asset for long-running MPI application that is complementary to reactive FT using full checkpoint/ restart schemes since checkpoint frequencies can be reduced as fewer unanticipated failures are encountered. In the context of OS virtualization, we believe that this is the rst comprehensive study of proactive fault tolerance where live migration is actually triggered by health monitoring.« less
Application of Intrusion Tolerance Technology to Joint Battlespace Infosphere (JBI)

DTIC Science & Technology

2003-02-01

performance, scalability and Security Issues and Requirements for Internet-Scale Publish-Subscribe Systems Chenxi Wang, Antonio Carzaniga, David ...by the Defense Advanced Research Agency, under the agreement number F30602-96-1-0314. The work of David Evans was supported by in part by the...Future Generations of Computer Science. October 1998. [10]. D. Chaum , C. Crepeau, and I. Damgard. “Multiparty Unconditionally Secure Protocols,” In
Computer Aided Design Parameters for Forward Basing

DTIC Science & Technology

1988-12-01

21 meters. Systematic errors within limits stated for absolute accuracy are tolerated at this level. DEM data acquired photogrammetrically using manual ...This is a professional drawing package, 19 capable of the manipulation required for this project. With the AutoLISP programming language (a variation on...Table 2). 0 25 Data Conversion Package II GWN System’s Digital Terrain Modeling (DTM) package was used. This AutoLISP -based third party software is
spammpack, Version 2013-06-18

DOE Office of Scientific and Technical Information (OSTI.GOV)

2014-01-17

This library is an implementation of the Sparse Approximate Matrix Multiplication (SpAMM) algorithm introduced. It provides a matrix data type, and an approximate matrix product, which exhibits linear scaling computational complexity for matrices with decay. The product error and the performance of the multiply can be tuned by choosing an appropriate tolerance. The library can be compiled for serial execution or parallel execution on shared memory systems with an OpenMP capable compiler
Checkpointing Shared Memory Programs at the Application-level

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bronevetsky, G; Schulz, M; Szwed, P

2004-09-08

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focusedmore » on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.« less
Advanced information processing system: Fault injection study and results

NASA Technical Reports Server (NTRS)

Burkhardt, Laura F.; Masotto, Thomas K.; Lala, Jaynarayan H.

1992-01-01

The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given.
Fault-Tolerant Software-Defined Radio on Manycore

NASA Technical Reports Server (NTRS)

Ricketts, Scott

2015-01-01

Software-defined radio (SDR) platforms generally rely on field-programmable gate arrays (FPGAs) and digital signal processors (DSPs), but such architectures require significant software development. In addition, application demands for radiation mitigation and fault tolerance exacerbate programming challenges. MaXentric Technologies, LLC, has developed a manycore-based SDR technology that provides 100 times the throughput of conventional radiationhardened general purpose processors. Manycore systems (30-100 cores and beyond) have the potential to provide high processing performance at error rates that are equivalent to current space-deployed uniprocessor systems. MaXentric's innovation is a highly flexible radio, providing over-the-air reconfiguration; adaptability; and uninterrupted, real-time, multimode operation. The technology is also compliant with NASA's Space Telecommunications Radio System (STRS) architecture. In addition to its many uses within NASA communications, the SDR can also serve as a highly programmable research-stage prototyping device for new waveforms and other communications technologies. It can also support noncommunication codes on its multicore processor, collocated with the communications workload-reducing the size, weight, and power of the overall system by aggregating processing jobs to a single board computer.
Augmented design and analysis of computer experiments: a novel tolerance embedded global optimization approach applied to SWIR hyperspectral illumination design.

PubMed

Keresztes, Janos C; John Koshel, R; D'huys, Karlien; De Ketelaere, Bart; Audenaert, Jan; Goos, Peter; Saeys, Wouter

2016-12-26

A novel meta-heuristic approach for minimizing nonlinear constrained problems is proposed, which offers tolerance information during the search for the global optimum. The method is based on the concept of design and analysis of computer experiments combined with a novel two phase design augmentation (DACEDA), which models the entire merit space using a Gaussian process, with iteratively increased resolution around the optimum. The algorithm is introduced through a series of cases studies with increasing complexity for optimizing uniformity of a short-wave infrared (SWIR) hyperspectral imaging (HSI) illumination system (IS). The method is first demonstrated for a two-dimensional problem consisting of the positioning of analytical isotropic point sources. The method is further applied to two-dimensional (2D) and five-dimensional (5D) SWIR HSI IS versions using close- and far-field measured source models applied within the non-sequential ray-tracing software FRED, including inherent stochastic noise. The proposed method is compared to other heuristic approaches such as simplex and simulated annealing (SA). It is shown that DACEDA converges towards a minimum with 1 % improvement compared to simplex and SA, and more importantly requiring only half the number of simulations. Finally, a concurrent tolerance analysis is done within DACEDA for to the five-dimensional case such that further simulations are not required.
A fault-tolerant addressable spin qubit in a natural silicon quantum dot

PubMed Central

Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R.; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

2016-01-01

Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot–based qubits. This result can inspire contributions to quantum computing from industrial communities. PMID:27536725
A fault-tolerant addressable spin qubit in a natural silicon quantum dot.

PubMed

Takeda, Kenta; Kamioka, Jun; Otsuka, Tomohiro; Yoneda, Jun; Nakajima, Takashi; Delbecq, Matthieu R; Amaha, Shinichi; Allison, Giles; Kodera, Tetsuo; Oda, Shunri; Tarucha, Seigo

2016-08-01

Fault-tolerant quantum computing requires high-fidelity qubits. This has been achieved in various solid-state systems, including isotopically purified silicon, but is yet to be accomplished in industry-standard natural (unpurified) silicon, mainly as a result of the dephasing caused by residual nuclear spins. This high fidelity can be achieved by speeding up the qubit operation and/or prolonging the dephasing time, that is, increasing the Rabi oscillation quality factor Q (the Rabi oscillation decay time divided by the π rotation time). In isotopically purified silicon quantum dots, only the second approach has been used, leaving the qubit operation slow. We apply the first approach to demonstrate an addressable fault-tolerant qubit using a natural silicon double quantum dot with a micromagnet that is optimally designed for fast spin control. This optimized design allows access to Rabi frequencies up to 35 MHz, which is two orders of magnitude greater than that achieved in previous studies. We find the optimum Q = 140 in such high-frequency range at a Rabi frequency of 10 MHz. This leads to a qubit fidelity of 99.6% measured via randomized benchmarking, which is the highest reported for natural silicon qubits and comparable to that obtained in isotopically purified silicon quantum dot-based qubits. This result can inspire contributions to quantum computing from industrial communities.
Adjoint-Based, Three-Dimensional Error Prediction and Grid Adaptation

NASA Technical Reports Server (NTRS)

Park, Michael A.

2002-01-01

Engineering computational fluid dynamics (CFD) analysis and design applications focus on output functions (e.g., lift, drag). Errors in these output functions are generally unknown and conservatively accurate solutions may be computed. Computable error estimates can offer the possibility to minimize computational work for a prescribed error tolerance. Such an estimate can be computed by solving the flow equations and the linear adjoint problem for the functional of interest. The computational mesh can be modified to minimize the uncertainty of a computed error estimate. This robust mesh-adaptation procedure automatically terminates when the simulation is within a user specified error tolerance. This procedure for estimating and adapting to error in a functional is demonstrated for three-dimensional Euler problems. An adaptive mesh procedure that links to a Computer Aided Design (CAD) surface representation is demonstrated for wing, wing-body, and extruded high lift airfoil configurations. The error estimation and adaptation procedure yielded corrected functions that are as accurate as functions calculated on uniformly refined grids with ten times as many grid points.
Markov reward processes

NASA Technical Reports Server (NTRS)

Smith, R. M.

1991-01-01

Numerous applications in the area of computer system analysis can be effectively studied with Markov reward models. These models describe the behavior of the system with a continuous-time Markov chain, where a reward rate is associated with each state. In a reliability/availability model, upstates may have reward rate 1 and down states may have reward rate zero associated with them. In a queueing model, the number of jobs of certain type in a given state may be the reward rate attached to that state. In a combined model of performance and reliability, the reward rate of a state may be the computational capacity, or a related performance measure. Expected steady-state reward rate and expected instantaneous reward rate are clearly useful measures of the Markov reward model. More generally, the distribution of accumulated reward or time-averaged reward over a finite time interval may be determined from the solution of the Markov reward model. This information is of great practical significance in situations where the workload can be well characterized (deterministically, or by continuous functions e.g., distributions). The design process in the development of a computer system is an expensive and long term endeavor. For aerospace applications the reliability of the computer system is essential, as is the ability to complete critical workloads in a well defined real time interval. Consequently, effective modeling of such systems must take into account both performance and reliability. This fact motivates our use of Markov reward models to aid in the development and evaluation of fault tolerant computer systems.
Study on fault-tolerant processors for advanced launch system

NASA Technical Reports Server (NTRS)

Shin, Kang G.; Liu, Jyh-Charn

1990-01-01

Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.
Fault tolerant testbed evaluation, phase 1

NASA Technical Reports Server (NTRS)

Caluori, V., Jr.; Newberry, T.

1993-01-01

In recent years, avionics systems development costs have become the driving factor in the development of space systems, military aircraft, and commercial aircraft. A method of reducing avionics development costs is to utilize state-of-the-art software application generator (autocode) tools and methods. The recent maturity of application generator technology has the potential to dramatically reduce development costs by eliminating software development steps that have historically introduced errors and the need for re-work. Application generator tools have been demonstrated to be an effective method for autocoding non-redundant, relatively low-rate input/output (I/O) applications on the Space Station Freedom (SSF) program; however, they have not been demonstrated for fault tolerant, high-rate I/O, flight critical environments. This contract will evaluate the use of application generators in these harsh environments. Using Boeing's quad-redundant avionics system controller as the target system, Space Shuttle Guidance, Navigation, and Control (GN&C) software will be autocoded, tested, and evaluated in the Johnson (Space Center) Avionics Engineering Laboratory (JAEL). The response of the autocoded system will be shown to match the response of the existing Shuttle General Purpose Computers (GPC's), thereby demonstrating the viability of using autocode techniques in the development of future avionics systems.
Software/hardware distributed processing network supporting the Ada environment

NASA Astrophysics Data System (ADS)

Wood, Richard J.; Pryk, Zen

1993-09-01

A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.

Space Tug Avionics Definition Study. Volume 5: Cost and Programmatics

NASA Technical Reports Server (NTRS)

1975-01-01

The baseline avionics system features a central digital computer that integrates the functions of all the space tug subsystems by means of a redundant digital data bus. The central computer consists of dual central processor units, dual input/output processors, and a fault tolerant memory, utilizing internal redundancy and error checking. Three electronically steerable phased arrays provide downlink transmission from any tug attitude directly to ground or via TDRS. Six laser gyros and six accelerometers in a dodecahedron configuration make up the inertial measurement unit. Both a scanning laser radar and a TV system, employing strobe lamps, are required as acquisition and docking sensors. Primary dc power at a nominal 28 volts is supplied from dual lightweight, thermally integrated fuel cells which operate from propellant grade reactants out of the main tanks.
Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tan, Li; Chen, Zizhong; Song, Shuaiwen

2016-01-18

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tan, Li; Chen, Zizhong; Song, Shuaiwen Leon

2015-11-16

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
Computers Take Flight: A History of NASA's Pioneering Digital Fly-By-Wire Project

NASA Technical Reports Server (NTRS)

Tomayko, James E.

2000-01-01

An overview of the NASA F-8 Fly-by Wire project is presented. The project made two significant contributions to the new technology: (1) a solid design base of techniques that work and those that do not, and (2) credible evidence of good flying qualities and the ability of such a system to tolerate real faults and to continue operation without degradation. In 1972 the F-8C aircraft used in the program became he first digital fly-by-wire aircraft to operate without a mechanical backup system.
Care 3, Phase 1, volume 1

NASA Technical Reports Server (NTRS)

Stiffler, J. J.; Bryant, L. A.; Guccione, L.

1979-01-01

A computer program to aid in accessing the reliability of fault tolerant avionics systems was developed. A simple mathematical expression was used to evaluate the reliability of any redundant configuration over any interval during which the failure rates and coverage parameters remained unaffected by configuration changes. Provision was made for convolving such expressions in order to evaluate the reliability of a dual mode system. A coverage model was also developed to determine the various relevant coverage coefficients as a function of the available hardware and software fault detector characteristics, and subsequent isolation and recovery delay statistics.
SMART-FDIR: Use of Artificial Intelligence in the Implementation of a Satellite FDIR

NASA Astrophysics Data System (ADS)

Guiotto, A.; Martelli, A.; Paccagnini, C.

Nowadays space activities are characterized by increased constraints in terms of on-board computing power and functional complexity combined with reduction of costs and schedule. This scenario necessarily originates impacts on the on-board software with particular emphases to the interfaces between on-board software and system/mission level requirements. The questions are: How can the effectiveness of Space System Software design be improved? How can we increase sophistication in the area of autonomy and failure tolerance, maintaining the necessary quality with acceptable risks?
Applications of an architecture design and assessment system (ADAS)

NASA Technical Reports Server (NTRS)

Gray, F. Gail; Debrunner, Linda S.; White, Tennis S.

1988-01-01

A new Architecture Design and Assessment System (ADAS) tool package is introduced, and a range of possible applications is illustrated. ADAS was used to evaluate the performance of an advanced fault-tolerant computer architecture in a modern flight control application. Bottlenecks were identified and possible solutions suggested. The tool was also used to inject faults into the architecture and evaluate the synchronization algorithm, and improvements are suggested. Finally, ADAS was used as a front end research tool to aid in the design of reconfiguration algorithms in a distributed array architecture.
Pulse sequences for suppressing leakage in single-qubit gate operations

NASA Astrophysics Data System (ADS)

Ghosh, Joydip; Coppersmith, S. N.; Friesen, Mark

2017-06-01

Many realizations of solid-state qubits involve couplings to leakage states lying outside the computational subspace, posing a threat to high-fidelity quantum gate operations. Mitigating leakage errors is especially challenging when the coupling strength is unknown, e.g., when it is caused by noise. Here we show that simple pulse sequences can be used to strongly suppress leakage errors for a qubit embedded in a three-level system. As an example, we apply our scheme to the recently proposed charge quadrupole (CQ) qubit for quantum dots. These results provide a solution to a key challenge for fault-tolerant quantum computing with solid-state elements.
Multicast Delayed Authentication For Streaming Synchrophasor Data in the Smart Grid

PubMed Central

Câmara, Sérgio; Anand, Dhananjay; Pillitteri, Victoria; Carmo, Luiz

2017-01-01

Multicast authentication of synchrophasor data is challenging due to the design requirements of Smart Grid monitoring systems such as low security overhead, tolerance of lossy networks, time-criticality and high data rates. In this work, we propose inf -TESLA, Infinite Timed Efficient Stream Loss-tolerant Authentication, a multicast delayed authentication protocol for communication links used to stream synchrophasor data for wide area control of electric power networks. Our approach is based on the authentication protocol TESLA but is augmented to accommodate high frequency transmissions of unbounded length. inf TESLA protocol utilizes the Dual Offset Key Chains mechanism to reduce authentication delay and computational cost associated with key chain commitment. We provide a description of the mechanism using two different modes for disclosing keys and demonstrate its security against a man-in-the-middle attack attempt. We compare our approach against the TESLA protocol in a 2-day simulation scenario, showing a reduction of 15.82% and 47.29% in computational cost, sender and receiver respectively, and a cumulative reduction in the communication overhead. PMID:28736582
Numerical Simulations of Reacting Flows Using Asynchrony-Tolerant Schemes for Exascale Computing

NASA Astrophysics Data System (ADS)

Cleary, Emmet; Konduri, Aditya; Chen, Jacqueline

2017-11-01

Communication and data synchronization between processing elements (PEs) are likely to pose a major challenge in scalability of solvers at the exascale. Recently developed asynchrony-tolerant (AT) finite difference schemes address this issue by relaxing communication and synchronization between PEs at a mathematical level while preserving accuracy, resulting in improved scalability. The performance of these schemes has been validated for simple linear and nonlinear homogeneous PDEs. However, many problems of practical interest are governed by highly nonlinear PDEs with source terms, whose solution may be sensitive to perturbations caused by communication asynchrony. The current work applies the AT schemes to combustion problems with chemical source terms, yielding a stiff system of PDEs with nonlinear source terms highly sensitive to temperature. Examples shown will use single-step and multi-step CH4 mechanisms for 1D premixed and nonpremixed flames. Error analysis will be discussed both in physical and spectral space. Results show that additional errors introduced by the AT schemes are negligible and the schemes preserve their accuracy. We acknowledge funding from the DOE Computational Science Graduate Fellowship administered by the Krell Institute.
Multicast Delayed Authentication For Streaming Synchrophasor Data in the Smart Grid.

PubMed

Câmara, Sérgio; Anand, Dhananjay; Pillitteri, Victoria; Carmo, Luiz

2016-01-01

Multicast authentication of synchrophasor data is challenging due to the design requirements of Smart Grid monitoring systems such as low security overhead, tolerance of lossy networks, time-criticality and high data rates. In this work, we propose inf -TESLA, Infinite Timed Efficient Stream Loss-tolerant Authentication, a multicast delayed authentication protocol for communication links used to stream synchrophasor data for wide area control of electric power networks. Our approach is based on the authentication protocol TESLA but is augmented to accommodate high frequency transmissions of unbounded length. inf TESLA protocol utilizes the Dual Offset Key Chains mechanism to reduce authentication delay and computational cost associated with key chain commitment. We provide a description of the mechanism using two different modes for disclosing keys and demonstrate its security against a man-in-the-middle attack attempt. We compare our approach against the TESLA protocol in a 2-day simulation scenario, showing a reduction of 15.82% and 47.29% in computational cost, sender and receiver respectively, and a cumulative reduction in the communication overhead.
The application of artificial intelligence in the optimal design of mechanical systems

NASA Astrophysics Data System (ADS)

Poteralski, A.; Szczepanik, M.

2016-11-01

The paper is devoted to new computational techniques in mechanical optimization where one tries to study, model, analyze and optimize very complex phenomena, for which more precise scientific tools of the past were incapable of giving low cost and complete solution. Soft computing methods differ from conventional (hard) computing in that, unlike hard computing, they are tolerant of imprecision, uncertainty, partial truth and approximation. The paper deals with an application of the bio-inspired methods, like the evolutionary algorithms (EA), the artificial immune systems (AIS) and the particle swarm optimizers (PSO) to optimization problems. Structures considered in this work are analyzed by the finite element method (FEM), the boundary element method (BEM) and by the method of fundamental solutions (MFS). The bio-inspired methods are applied to optimize shape, topology and material properties of 2D, 3D and coupled 2D/3D structures, to optimize the termomechanical structures, to optimize parameters of composites structures modeled by the FEM, to optimize the elastic vibrating systems to identify the material constants for piezoelectric materials modeled by the BEM and to identify parameters in acoustics problem modeled by the MFS.
Application of Particle Swarm Optimization in Computer Aided Setup Planning

NASA Astrophysics Data System (ADS)

Kafashi, Sajad; Shakeri, Mohsen; Abedini, Vahid

2011-01-01

New researches are trying to integrate computer aided design (CAD) and computer aided manufacturing (CAM) environments. The role of process planning is to convert the design specification into manufacturing instructions. Setup planning has a basic role in computer aided process planning (CAPP) and significantly affects the overall cost and quality of machined part. This research focuses on the development for automatic generation of setups and finding the best setup plan in feasible condition. In order to computerize the setup planning process, three major steps are performed in the proposed system: a) Extraction of machining data of the part. b) Analyzing and generation of all possible setups c) Optimization to reach the best setup plan based on cost functions. Considering workshop resources such as machine tool, cutter and fixture, all feasible setups could be generated. Then the problem is adopted with technological constraints such as TAD (tool approach direction), tolerance relationship and feature precedence relationship to have a completely real and practical approach. The optimal setup plan is the result of applying the PSO (particle swarm optimization) algorithm into the system using cost functions. A real sample part is illustrated to demonstrate the performance and productivity of the system.
Silicon CMOS architecture for a spin-based quantum computer.

PubMed

Veldhorst, M; Eenink, H G J; Yang, C H; Dzurak, A S

2017-12-15

Recent advances in quantum error correction codes for fault-tolerant quantum computing and physical realizations of high-fidelity qubits in multiple platforms give promise for the construction of a quantum computer based on millions of interacting qubits. However, the classical-quantum interface remains a nascent field of exploration. Here, we propose an architecture for a silicon-based quantum computer processor based on complementary metal-oxide-semiconductor (CMOS) technology. We show how a transistor-based control circuit together with charge-storage electrodes can be used to operate a dense and scalable two-dimensional qubit system. The qubits are defined by the spin state of a single electron confined in quantum dots, coupled via exchange interactions, controlled using a microwave cavity, and measured via gate-based dispersive readout. We implement a spin qubit surface code, showing the prospects for universal quantum computation. We discuss the challenges and focus areas that need to be addressed, providing a path for large-scale quantum computing.
Roads towards fault-tolerant universal quantum computation

NASA Astrophysics Data System (ADS)

Campbell, Earl T.; Terhal, Barbara M.; Vuillot, Christophe

2017-09-01

A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Roads towards fault-tolerant universal quantum computation.

PubMed

Campbell, Earl T; Terhal, Barbara M; Vuillot, Christophe

2017-09-13

A practical quantum computer must not merely store information, but also process it. To prevent errors introduced by noise from multiplying and spreading, a fault-tolerant computational architecture is required. Current experiments are taking the first steps toward noise-resilient logical qubits. But to convert these quantum devices from memories to processors, it is necessary to specify how a universal set of gates is performed on them. The leading proposals for doing so, such as magic-state distillation and colour-code techniques, have high resource demands. Alternative schemes, such as those that use high-dimensional quantum codes in a modular architecture, have potential benefits, but need to be explored further.
Advanced techniques in reliability model representation and solution

NASA Technical Reports Server (NTRS)

Palumbo, Daniel L.; Nicol, David M.

1992-01-01

The current tendency of flight control system designs is towards increased integration of applications and increased distribution of computational elements. The reliability analysis of such systems is difficult because subsystem interactions are increasingly interdependent. Researchers at NASA Langley Research Center have been working for several years to extend the capability of Markov modeling techniques to address these problems. This effort has been focused in the areas of increased model abstraction and increased computational capability. The reliability model generator (RMG) is a software tool that uses as input a graphical object-oriented block diagram of the system. RMG uses a failure-effects algorithm to produce the reliability model from the graphical description. The ASSURE software tool is a parallel processing program that uses the semi-Markov unreliability range evaluator (SURE) solution technique and the abstract semi-Markov specification interface to the SURE tool (ASSIST) modeling language. A failure modes-effects simulation is used by ASSURE. These tools were used to analyze a significant portion of a complex flight control system. The successful combination of the power of graphical representation, automated model generation, and parallel computation leads to the conclusion that distributed fault-tolerant system architectures can now be analyzed.
Scientific Application Requirements for Leadership Computing at the Exascale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ahern, Sean; Alam, Sadaf R; Fahey, Mark R

2007-12-01

The Department of Energy s Leadership Computing Facility, located at Oak Ridge National Laboratory s National Center for Computational Sciences, recently polled scientific teams that had large allocations at the center in 2007, asking them to identify computational science requirements for future exascale systems (capable of an exaflop, or 1018 floating point operations per second). These requirements are necessarily speculative, since an exascale system will not be realized until the 2015 2020 timeframe, and are expressed where possible relative to a recent petascale requirements analysis of similar science applications [1]. Our initial findings, which beg further data collection, validation, andmore » analysis, did in fact align with many of our expectations and existing petascale requirements, yet they also contained some surprises, complete with new challenges and opportunities. First and foremost, the breadth and depth of science prospects and benefits on an exascale computing system are striking. Without a doubt, they justify a large investment, even with its inherent risks. The possibilities for return on investment (by any measure) are too large to let us ignore this opportunity. The software opportunities and challenges are enormous. In fact, as one notable computational scientist put it, the scale of questions being asked at the exascale is tremendous and the hardware has gotten way ahead of the software. We are in grave danger of failing because of a software crisis unless concerted investments and coordinating activities are undertaken to reduce and close this hardwaresoftware gap over the next decade. Key to success will be a rigorous requirement for natural mapping of algorithms to hardware in a way that complements (rather than competes with) compilers and runtime systems. The level of abstraction must be raised, and more attention must be paid to functionalities and capabilities that incorporate intent into data structures, are aware of memory hierarchy, possess fault tolerance, exploit asynchronism, and are power-consumption aware. On the other hand, we must also provide application scientists with the ability to develop software without having to become experts in the computer science components. Numerical algorithms are scattered broadly across science domains, with no one particular algorithm being ubiquitous and no one algorithm going unused. Structured grids and dense linear algebra continue to dominate, but other algorithm categories will become more common. A significant increase is projected for Monte Carlo algorithms, unstructured grids, sparse linear algebra, and particle methods, and a relative decrease foreseen in fast Fourier transforms. These projections reflect the expectation of much higher architecture concurrency and the resulting need for very high scalability. The new algorithm categories that application scientists expect to be increasingly important in the next decade include adaptive mesh refinement, implicit nonlinear systems, data assimilation, agent-based methods, parameter continuation, and optimization. The attributes of leadership computing systems expected to increase most in priority over the next decade are (in order of importance) interconnect bandwidth, memory bandwidth, mean time to interrupt, memory latency, and interconnect latency. The attributes expected to decrease most in relative priority are disk latency, archival storage capacity, disk bandwidth, wide area network bandwidth, and local storage capacity. These choices by application developers reflect the expected needs of applications or the expected reality of available hardware. One interpretation is that the increasing priorities reflect the desire to increase computational efficiency to take advantage of increasing peak flops [floating point operations per second], while the decreasing priorities reflect the expectation that computational efficiency will not increase. Per-core requirements appear to be relatively static, while aggregate requirements will grow with the system. This projection is consistent with a relatively small increase in performance per core with a dramatic increase in the number of cores. Leadership system software must face and overcome issues that will undoubtedly be exacerbated at the exascale. The operating system (OS) must be as unobtrusive as possible and possess more stability, reliability, and fault tolerance during application execution. As applications will be more likely at the exascale to experience loss of resources during an execution, the OS must mitigate such a loss with a range of responses. New fault tolerance paradigms must be developed and integrated into applications. Just as application input and output must not be an afterthought in hardware design, job management, too, must not be an afterthought in system software design. Efficient scheduling of those resources will be a major obstacle faced by leadership computing centers at the exas...« less
Adiabatic Quantum Transistors (Open Access, Publisher’s Version)

DTIC Science & Technology

2013-06-14

states are the entangled states originally used to perform measurement-based quantum computation [9,19]. To de- fine the Hamiltonian of our system, we need...carries over to our model. Note that fault-tolerant QC requires expunging entropy (usually via measurement), but this can always be placed at the end... entropy of quantum er- rors, and the latter is important for building architectures that are modular and synchronous. A. Adiabatic measurement amplifier
Water-Cooled Data Center Packs More Power Per Rack | Poster

Cancer.gov

By Frank Blanchard and Ken Michaels, Staff Writers Behind each tall, black computer rack in the data center at the Advanced Technology Research Facility (ATRF) is something both strangely familiar and oddly out of place: It looks like a radiator. The back door of each cabinet is gridded with the coils of the Liebert cooling system, which circulates chilled water to remove heat generated by the high-speed, high-capacity, fault-tolerant equipment.

Fault Model Development for Fault Tolerant VLSI Design

DTIC Science & Technology

1988-05-01

0 % .%. . BEIDGING FAULTS A bridging fault in a digital circuit connects two or more conducting paths of the circuit. The resistance...Melvin Breuer and Arthur Friedman, "Diagnosis and Reliable Design of Digital Systems", Computer Science Press, Inc., 1976. 4. [Chandramouli,1983] R...2138 AEDC LIBARY (TECH REPORTS FILE) MS-O0 ARNOLD AFS TN 37389-9998 USAG1 Attn: ASH-PCA-CRT Ft Huachuca AZ 85613-6000 DOT LIBRARY/iQA SECTION - ATTN
ISIS: A System for Fault-Tolerant Distributed Computing

DTIC Science & Technology

1986-04-01

SECURITY CLASSIMCMTIQN OP THIS PACIE S REPORT DOCUMENTATION PAGE /|^/f /^ ^^O mA \\ la REPORT SICUWTY CLASSIFICATION IS C...Mirrmteil hv me DHtense ^ilsanced ii.ttheiin n Pfnitcts Auftwv IüOD’ iiiuct \\RP\\ inier S .]7H. i ntin’ \\lt)A9<i;MS-( olj;. m<l bv ihe...DOWNGRADING SCHEDULE 3 DISTRIBUTION/AVAILABILITY OF REPORT Approved for Public Release Distribution Unlimited i PERFORMING ORGANUATION REPORT NUMBER( S
Digital flight control systems

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Vanlandingham, H. F.

1977-01-01

The design of stable feedback control laws for sampled-data systems with variable rate sampling was investigated. These types of sampled-data systems arise naturally in digital flight control systems which use digital actuators where it is desirable to decrease the number of control computer output commands in order to save wear and tear of the associated equipment. The design of aircraft control systems which are optimally tolerant of sensor and actuator failures was also studied. Detection of the failed sensor or actuator must be resolved and if the estimate of the state is used in the control law, then it is also desirable to have an estimator which will give the optimal state estimate even under the failed conditions.
The Dangers of Failure Masking in Fault-Tolerant Software: Aspects of a Recent In-Flight Upset Event

NASA Technical Reports Server (NTRS)

Johnson, C. W.; Holloway, C. M.

2007-01-01

On 1 August 2005, a Boeing Company 777-200 aircraft, operating on an international passenger flight from Australia to Malaysia, was involved in a significant upset event while flying on autopilot. The Australian Transport Safety Bureau's investigation into the event discovered that an anomaly existed in the component software hierarchy that allowed inputs from a known faulty accelerometer to be processed by the air data inertial reference unit (ADIRU) and used by the primary flight computer, autopilot and other aircraft systems. This anomaly had existed in original ADIRU software, and had not been detected in the testing and certification process for the unit. This paper describes the software aspects of the incident in detail, and suggests possible implications concerning complex, safety-critical, fault-tolerant software.
Closed-Loop Evaluation of an Integrated Failure Identification and Fault Tolerant Control System for a Transport Aircraft

NASA Technical Reports Server (NTRS)

Shin, Jong-Yeob; Belcastro, Christine; Khong, thuan

2006-01-01

Formal robustness analysis of aircraft control upset prevention and recovery systems could play an important role in their validation and ultimate certification. Such systems developed for failure detection, identification, and reconfiguration, as well as upset recovery, need to be evaluated over broad regions of the flight envelope or under extreme flight conditions, and should include various sources of uncertainty. To apply formal robustness analysis, formulation of linear fractional transformation (LFT) models of complex parameter-dependent systems is required, which represent system uncertainty due to parameter uncertainty and actuator faults. This paper describes a detailed LFT model formulation procedure from the nonlinear model of a transport aircraft by using a preliminary LFT modeling software tool developed at the NASA Langley Research Center, which utilizes a matrix-based computational approach. The closed-loop system is evaluated over the entire flight envelope based on the generated LFT model which can cover nonlinear dynamics. The robustness analysis results of the closed-loop fault tolerant control system of a transport aircraft are presented. A reliable flight envelope (safe flight regime) is also calculated from the robust performance analysis results, over which the closed-loop system can achieve the desired performance of command tracking and failure detection.
Computing the Feasible Spaces of Optimal Power Flow Problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Molzahn, Daniel K.

The solution to an optimal power flow (OPF) problem provides a minimum cost operating point for an electric power system. The performance of OPF solution techniques strongly depends on the problem’s feasible space. This paper presents an algorithm that is guaranteed to compute the entire feasible spaces of small OPF problems to within a specified discretization tolerance. Specifically, the feasible space is computed by discretizing certain of the OPF problem’s inequality constraints to obtain a set of power flow equations. All solutions to the power flow equations at each discretization point are obtained using the Numerical Polynomial Homotopy Continuation (NPHC)more » algorithm. To improve computational tractability, “bound tightening” and “grid pruning” algorithms use convex relaxations to preclude consideration of many discretization points that are infeasible for the OPF problem. Here, the proposed algorithm is used to generate the feasible spaces of two small test cases.« less
Computing the Feasible Spaces of Optimal Power Flow Problems

DOE PAGES

Molzahn, Daniel K.

2017-03-15

The solution to an optimal power flow (OPF) problem provides a minimum cost operating point for an electric power system. The performance of OPF solution techniques strongly depends on the problem’s feasible space. This paper presents an algorithm that is guaranteed to compute the entire feasible spaces of small OPF problems to within a specified discretization tolerance. Specifically, the feasible space is computed by discretizing certain of the OPF problem’s inequality constraints to obtain a set of power flow equations. All solutions to the power flow equations at each discretization point are obtained using the Numerical Polynomial Homotopy Continuation (NPHC)more » algorithm. To improve computational tractability, “bound tightening” and “grid pruning” algorithms use convex relaxations to preclude consideration of many discretization points that are infeasible for the OPF problem. Here, the proposed algorithm is used to generate the feasible spaces of two small test cases.« less
Intelligent failure-tolerant control

NASA Technical Reports Server (NTRS)

Stengel, Robert F.

1991-01-01

An overview of failure-tolerant control is presented, beginning with robust control, progressing through parallel and analytical redundancy, and ending with rule-based systems and artificial neural networks. By design or implementation, failure-tolerant control systems are 'intelligent' systems. All failure-tolerant systems require some degrees of robustness to protect against catastrophic failure; failure tolerance often can be improved by adaptivity in decision-making and control, as well as by redundancy in measurement and actuation. Reliability, maintainability, and survivability can be enhanced by failure tolerance, although each objective poses different goals for control system design. Artificial intelligence concepts are helpful for integrating and codifying failure-tolerant control systems, not as alternatives but as adjuncts to conventional design methods.
Vibration computer programs E13101, E13102, E13104, and E13112 and application to the NERVA program. Project 187: Methodology documentation

NASA Technical Reports Server (NTRS)

Mironenko, G.

1972-01-01

Programs for the analyses of the free or forced, undamped vibrations of one or two elastically-coupled lumped parameter teams are presented. Bearing nonlinearities, casing and rotor distributed mass and elasticity, rotor imbalance, forcing functions, gyroscopic moments, rotary inertia, and shear and flexural deformations are all included in the system dynamics analysis. All bearings have nonlinear load displacement characteristics, the solution is achieved by iteration. Rotor imbalances allowed by such considerations as pilot tolerances and runouts as well as bearing clearances (allowing concail or cylindrical whirl) determine the forcing function magnitudes. The computer programs first obtain a solution wherein the bearings are treated as linear springs of given spring rates. Then, based upon the computed bearing reactions, new spring rates are predicted and another solution of the modified system is made. The iteration is continued until the changes to bearing spring rates and bearing reactions become negligibly small.
Fault-tolerant cooperative output regulation for multi-vehicle systems with sensor faults

NASA Astrophysics Data System (ADS)

Qin, Liguo; He, Xiao; Zhou, D. H.

2017-10-01

This paper presents a unified framework of fault diagnosis and fault-tolerant cooperative output regulation (FTCOR) for a linear discrete-time multi-vehicle system with sensor faults. The FTCOR control law is designed through three steps. A cooperative output regulation (COR) controller is designed based on the internal mode principle when there are no sensor faults. A sufficient condition on the existence of the COR controller is given based on the discrete-time algebraic Riccati equation (DARE). Then, a decentralised fault diagnosis scheme is designed to cope with sensor faults occurring in followers. A residual generator is developed to detect sensor faults of each follower, and a bank of fault-matching estimators are proposed to isolate and estimate sensor faults of each follower. Unlike the current distributed fault diagnosis for multi-vehicle systems, the presented decentralised fault diagnosis scheme in each vehicle reduces the communication and computation load by only using the information of the vehicle. By combing the sensor fault estimation and the COR control law, an FTCOR controller is proposed. Finally, the simulation results demonstrate the effectiveness of the FTCOR controller.
Modeling and measurement of fault-tolerant multiprocessors

NASA Technical Reports Server (NTRS)

Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

1985-01-01

The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.
Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1

NASA Technical Reports Server (NTRS)

Scheper, C.; Baker, R.; Frank, G.; Yalamanchili, S.; Gray, G.

1992-01-01

Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified.
Proteus: a reconfigurable computational network for computer vision

NASA Astrophysics Data System (ADS)

Haralick, Robert M.; Somani, Arun K.; Wittenbrink, Craig M.; Johnson, Robert; Cooper, Kenneth; Shapiro, Linda G.; Phillips, Ihsin T.; Hwang, Jenq N.; Cheung, William; Yao, Yung H.; Chen, Chung-Ho; Yang, Larry; Daugherty, Brian; Lorbeski, Bob; Loving, Kent; Miller, Tom; Parkins, Larye; Soos, Steven L.

1992-04-01

The Proteus architecture is a highly parallel MIMD, multiple instruction, multiple-data machine, optimized for large granularity tasks such as machine vision and image processing The system can achieve 20 Giga-flops (80 Giga-flops peak). It accepts data via multiple serial links at a rate of up to 640 megabytes/second. The system employs a hierarchical reconfigurable interconnection network with the highest level being a circuit switched Enhanced Hypercube serial interconnection network for internal data transfers. The system is designed to use 256 to 1,024 RISC processors. The processors use one megabyte external Read/Write Allocating Caches for reduced multiprocessor contention. The system detects, locates, and replaces faulty subsystems using redundant hardware to facilitate fault tolerance. The parallelism is directly controllable through an advanced software system for partitioning, scheduling, and development. System software includes a translator for the INSIGHT language, a parallel debugger, low and high level simulators, and a message passing system for all control needs. Image processing application software includes a variety of point operators neighborhood, operators, convolution, and the mathematical morphology operations of binary and gray scale dilation, erosion, opening, and closing.
Fault Injection and Monitoring Capability for a Fault-Tolerant Distributed Computation System

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo; Yates, Amy M.; Malekpour, Mahyar R.

2010-01-01

The Configurable Fault-Injection and Monitoring System (CFIMS) is intended for the experimental characterization of effects caused by a variety of adverse conditions on a distributed computation system running flight control applications. A product of research collaboration between NASA Langley Research Center and Old Dominion University, the CFIMS is the main research tool for generating actual fault response data with which to develop and validate analytical performance models and design methodologies for the mitigation of fault effects in distributed flight control systems. Rather than a fixed design solution, the CFIMS is a flexible system that enables the systematic exploration of the problem space and can be adapted to meet the evolving needs of the research. The CFIMS has the capabilities of system-under-test (SUT) functional stimulus generation, fault injection and state monitoring, all of which are supported by a configuration capability for setting up the system as desired for a particular experiment. This report summarizes the work accomplished so far in the development of the CFIMS concept and documents the first design realization.
Error rates and resource overheads of encoded three-qubit gates

NASA Astrophysics Data System (ADS)

Takagi, Ryuji; Yoder, Theodore J.; Chuang, Isaac L.

2017-10-01

A non-Clifford gate is required for universal quantum computation, and, typically, this is the most error-prone and resource-intensive logical operation on an error-correcting code. Small, single-qubit rotations are popular choices for this non-Clifford gate, but certain three-qubit gates, such as Toffoli or controlled-controlled-Z (ccz), are equivalent options that are also more suited for implementing some quantum algorithms, for instance, those with coherent classical subroutines. Here, we calculate error rates and resource overheads for implementing logical ccz with pieceable fault tolerance, a nontransversal method for implementing logical gates. We provide a comparison with a nonlocal magic-state scheme on a concatenated code and a local magic-state scheme on the surface code. We find the pieceable fault-tolerance scheme particularly advantaged over magic states on concatenated codes and in certain regimes over magic states on the surface code. Our results suggest that pieceable fault tolerance is a promising candidate for fault tolerance in a near-future quantum computer.
Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tan, Li; Song, Shuaiwen; Wu, Panruo

2015-05-29

Energy efficiency and resilience are two crucial challenges for HPC systems to reach exascale. While energy efficiency and resilience issues have been extensively studied individually, little has been done to understand the interplay between energy efficiency and resilience for HPC systems. Decreasing the supply voltage associated with a given operating frequency for processors and other CMOS-based components can significantly reduce power consumption. However, this often raises system failure rates and consequently increases application execution time. In this work, we present an energy saving undervolting approach that leverages the mainstream resilience techniques to tolerate the increased failures caused by undervolting.
Towards the formal verification of the requirements and design of a processor interface unit: HOL listings

NASA Technical Reports Server (NTRS)

Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.

1993-01-01

This technical report contains the Higher-Order Logic (HOL) listings of the partial verification of the requirements and design for a commercially developed processor interface unit (PIU). The PIU is an interface chip performing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault tolerant computer system. This system, the Fault Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. This report contains the actual HOL listings of the PIU verification as it currently exists. Section two of this report contains general-purpose HOL theories and definitions that support the PIU verification. These include arithmetic theories dealing with inequalities and associativity, and a collection of tactics used in the PIU proofs. Section three contains the HOL listings for the completed PIU design verification. Section 4 contains the HOL listings for the partial requirements verification of the P-Port.
McrEngine: A Scalable Checkpointing System Using Data-Aware Aggregation and Compression

DOE PAGES

Islam, Tanzima Zerin; Mohror, Kathryn; Bagchi, Saurabh; ...

2013-01-01

High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility ofmore » checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.« less
Serial Back-Plane Technologies in Advanced Avionics Architectures

NASA Technical Reports Server (NTRS)

Varnavas, Kosta

2005-01-01

Current back plane technologies such as VME, and current personal computer back planes such as PCI, are shared bus systems that can exhibit nondeterministic latencies. This means a card can take control of the bus and use resources indefinitely affecting the ability of other cards in the back plane to acquire the bus. This provides a real hit on the reliability of the system. Additionally, these parallel busses only have bandwidths in the 100s of megahertz range and EMI and noise effects get worse the higher the bandwidth goes. To provide scalable, fault-tolerant, advanced computing systems, more applicable to today s connected computing environment and to better meet the needs of future requirements for advanced space instruments and vehicles, serial back-plane technologies should be implemented in advanced avionics architectures. Serial backplane technologies eliminate the problem of one card getting the bus and never relinquishing it, or one minor problem on the backplane bringing the whole system down. Being serial instead of parallel improves the reliability by reducing many of the signal integrity issues associated with parallel back planes and thus significantly improves reliability. The increased speeds associated with a serial backplane are an added bonus.
Using concatenated quantum codes for universal fault-tolerant quantum gates.

PubMed

Jochym-O'Connor, Tomas; Laflamme, Raymond

2014-01-10

We propose a method for universal fault-tolerant quantum computation using concatenated quantum error correcting codes. The concatenation scheme exploits the transversal properties of two different codes, combining them to provide a means to protect against low-weight arbitrary errors. We give the required properties of the error correcting codes to ensure universal fault tolerance and discuss a particular example using the 7-qubit Steane and 15-qubit Reed-Muller codes. Namely, other than computational basis state preparation as required by the DiVincenzo criteria, our scheme requires no special ancillary state preparation to achieve universality, as opposed to schemes such as magic state distillation. We believe that optimizing the codes used in such a scheme could provide a useful alternative to state distillation schemes that exhibit high overhead costs.

Physician Risk Tolerance and Head Computed Tomography Use for Patients with Isolated Headaches.

PubMed

Huang, Yi-Syun; Syue, Yuan-Jhen; Yen, Yung-Lin; Wu, Chien-Hung; Ho, Yu-Ni; Cheng, Fu-Jen

2016-11-01

Headaches are one of the most common afflictions in adults and reasons for emergency department (ED) visits. We sought to determine the association between physician risk tolerance and head computed tomography (CT) use in patients with headaches in the ED. We performed a retrospective study of patients with nontraumatic isolated headaches in the ED and then administered two instruments (Risk-Taking subscale [RTS] of the Jackson Personality Index and a Malpractice Fear Scale [MFS]) to attending physicians who had evaluated these patients and made decisions regarding head CT scans. Outcomes were head CT use during ED evaluation and hospital admission. A hierarchical logistic regression was used to determine the effect of risk scales on head CT use. Of the 1328 patients with headaches, 521 (39.2%) received brain CTs and 83 (6.9%) were admitted; 33 (2.5%) patients received a final diagnosis that the central nervous system was the origin of the disease. Among the 17 emergency physicians (EPs), the median of the MFS and RTS was 23 (interquartile range [IQR] 19-25) and 21 (IQR 20-23), respectively. EPs who were relatively risk-averse and those who possessed a higher level of malpractice fear were not more likely to order brain CTs for patients with isolated headaches. Individual EP risk tolerance, as measured by RTS, and malpractice concerns, measured by MFS, were not predictive of CT use in patients with isolated headaches. Copyright © 2016 Elsevier Inc. All rights reserved.
System on a Chip (SoC) Overview

NASA Technical Reports Server (NTRS)

LaBel, Kenneth A.

2010-01-01

System-on-a-chip or system on chip (SoC or SOC) refers to integrating all components of a computer or other electronic system into a single integrated circuit (chip). It may contain digital, analog, mixed-signal, and often radio-frequency functions all on a single chip substrate. Complexity drives it all: Radiation tolerance and testability are challenges for fault isolation, propagation, and validation. Bigger single silicon die than flown before and technology is scaling below 90nm (new qual methods). Packages have changed and are bigger and more difficult to inspect, test, and understand. Add in embedded passives. Material interfaces are more complex (underfills, processing). New rules for board layouts. Mechanical and thermal designs, etc.
The SPINDLE Disruption-Tolerant Networking System

DTIC Science & Technology

2007-11-01

average availability ( AA ). The AA metric attempts to measure the average fraction of time in the near future that the link will be available for use...Each link’s AA is epidemically disseminated to all nodes. Path costs are computed using the topology learned through this dissemination, with cost of a...link l set to (1 − AA (l)) + c (a small constant factor that makes routing favor fewer number of hops when all links have AA of 1). Additional details
Mission Management Computer and Sequencing Hardware for RLV-TD HEX-01 Mission

NASA Astrophysics Data System (ADS)

Gupta, Sukrat; Raj, Remya; Mathew, Asha Mary; Koshy, Anna Priya; Paramasivam, R.; Mookiah, T.

2017-12-01

Reusable Launch Vehicle-Technology Demonstrator Hypersonic Experiment (RLV-TD HEX-01) mission posed some unique challenges in the design and development of avionics hardware. This work presents the details of mission critical avionics hardware mainly Mission Management Computer (MMC) and sequencing hardware. The Navigation, Guidance and Control (NGC) chain for RLV-TD is dual redundant with cross-strapped Remote Terminals (RTs) interfaced through MIL-STD-1553B bus. MMC is Bus Controller on the 1553 bus, which does the function of GPS aided navigation, guidance, digital autopilot and sequencing for the RLV-TD launch vehicle in different periodicities (10, 20, 500 ms). Digital autopilot execution in MMC with a periodicity of 10 ms (in ascent phase) is introduced for the first time and successfully demonstrated in the flight. MMC is built around Intel i960 processor and has inbuilt fault tolerance features like ECC for memories. Fault Detection and Isolation schemes are implemented to isolate the failed MMC. The sequencing hardware comprises Stage Processing System (SPS) and Command Execution Module (CEM). SPS is `RT' on the 1553 bus which receives the sequencing and control related commands from MMCs and posts to downstream modules after proper error handling for final execution. SPS is designed as a high reliability system by incorporating various fault tolerance and fault detection features. CEM is a relay based module for sequence command execution.
Comparison of microrings and microdisks for high-speed optical modulation in silicon photonics

NASA Astrophysics Data System (ADS)

Ying, Zhoufeng; Wang, Zheng; Zhao, Zheng; Dhar, Shounak; Pan, David Z.; Soref, Richard; Chen, Ray T.

2018-03-01

The past several decades have witnessed the gradual transition from electrical to optical interconnects, ranging from long-haul telecommunication to chip-to-chip interconnects. As one type of key component in integrated optical interconnect and high-performance computing, optical modulators have been well developed these past few years, including ultrahigh-speed microring and microdisk modulators. In this paper, a comparison between microring and microdisk modulators is well analyzed in terms of dimensions, static and dynamic power consumption, and fabrication tolerance. The results show that microdisks have advantages over microrings in these aspects, which gives instructions to the chip design of high-density integrated systems for optical interconnects and optical computing.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary

NASA Technical Reports Server (NTRS)

Smith, T. B., III; Lala, J. H.

1984-01-01

The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
Minimalist fault-tolerance techniques for mitigating single-event effects in non-radiation-hardened microcontrollers

NASA Astrophysics Data System (ADS)

Caldwell, Douglas Wyche

Commercial microcontrollers--monolithic integrated circuits containing microprocessor, memory and various peripheral functions--such as are used in industrial, automotive and military applications, present spacecraft avionics system designers an appealing mix of higher performance and lower power together with faster system-development time and lower unit costs. However, these parts are not radiation-hardened for application in the space environment and Single-Event Effects (SEE) caused by high-energy, ionizing radiation present a significant challenge. Mitigating these effects with techniques which require minimal additional support logic, and thereby preserve the high functional density of these devices, can allow their benefits to be realized. This dissertation uses fault-tolerance to mitigate the transient errors and occasional latchups that non-hardened microcontrollers can experience in the space radiation environment. Space systems requirements and the historical use of fault-tolerant computers in spacecraft provide context. Space radiation and its effects in semiconductors define the fault environment. A reference architecture is presented which uses two or three microcontrollers with a combination of hardware and software voting techniques to mitigate SEE. A prototypical spacecraft function (an inertial measurement unit) is used to illustrate the techniques and to explore how real application requirements impact the fault-tolerance approach. Low-cost approaches which leverage features of existing commercial microcontrollers are analyzed. A high-speed serial bus is used for voting among redundant devices and a novel wire-OR output voting scheme exploits the bidirectional controls of I/O pins. A hardware testbed and prototype software were constructed to evaluate two- and three-processor configurations. Simulated Single-Event Upsets (SEUs) were injected at high rates and the response of the system monitored. The resulting statistics were used to evaluate technical effectiveness. Fault-recovery probabilities (coverages) higher than 99.99% were experimentally demonstrated. The greater than thousand-fold reduction in observed effects provides performance comparable with SEE tolerance of tested, rad-hard devices. Technical results were combined with cost data to assess the cost-effectiveness of the techniques. It was found that a three-processor system was only marginally more effective than a two-device system at detecting and recovering from faults, but consumed substantially more resources, suggesting that simpler configurations are generally more cost-effective.
SIRU development. Volume 3: Software description and program documentation

NASA Technical Reports Server (NTRS)

Oehrle, J.

1973-01-01

The development and initial evaluation of a strapdown inertial reference unit (SIRU) system are discussed. The SIRU configuration is a modular inertial subsystem with hardware and software features that achieve fault tolerant operational capabilities. The SIRU redundant hardware design is formulated about a six gyro and six accelerometer instrument module package. The six axes array provides redundant independent sensing and the symmetry enables the formulation of an optimal software redundant data processing structure with self-contained fault detection and isolation (FDI) capabilities. The basic SIRU software coding system used in the DDP-516 computer is documented.
Introduction

NASA Astrophysics Data System (ADS)

de Laat, Cees; Develder, Chris; Jukan, Admela; Mambretti, Joe

This topic is devoted to communication issues in scalable compute and storage systems, such as parallel computers, networks of workstations, and clusters. All aspects of communication in modern systems were solicited, including advances in the design, implementation, and evaluation of interconnection networks, network interfaces, system and storage area networks, on-chip interconnects, communication protocols, routing and communication algorithms, and communication aspects of parallel and distributed algorithms. In total 15 papers were submitted to this topic of which we selected the 7 strongest papers. We grouped the papers in two sessions of 3 papers each and one paper was selected for the best paper session. We noted a number of papers dealing with changing topologies, stability and forwarding convergence in source routing based cluster interconnect network architectures. We grouped these for the first session. The authors of the paper titled: “Implementing a Change Assimilation Mechanism for Source Routing Interconnects” propose a mechanism that can obtain the new topology, and compute and distribute a new set of fabric paths to the source routed network end points to minimize the impact on the forwarding service. The article entitled “Dependability Analysis of a Fault-tolerant Network Reconfiguration Strateg” reports on a case study analyzing the effects of network size, mean time to node failure, mean time to node repair, mean time to network repair and coverage of the failure when using a 2D mesh network with a fault-tolerant mechanism (similar to the one used in the BlueGene/L system), that is able to remove rows and/or columns in the presence of failures. The last paper in this session: “RecTOR: A New and Efficient Method for Dynamic Network Reconfiguration” presents a new dynamic reconfiguration method, that ensures deadlock-freedom during the reconfiguration without causing performance degradation such as increased latency or decreased throughput. The second session groups 3 papers presenting methods, protocols and architectures that enhance capacities in the Networks. The paper titled: “NIC-assisted Cache-Efficient Receive Stack for Message Passing over Ethernet” presents the addition of multiqueue support in the Open-MX receive stack so that all incoming packets for the same process are treated on the same core. It then introduces the idea of binding the target end process near its dedicated receive queue. In general this multiqueue receive stack performs better than the original single queue stack, especially on large communication patterns where multiple processes are involved and manual binding is difficult. The authors of: “A Multipath Fault-Tolerant Routing Method for High-Speed Interconnection Networks” focus on the problem of fault tolerance for high-speed interconnection networks by designing a fault tolerant routing method. The goal was to solve a certain number of link and node failures, considering its impact, and occurrence probability. Their experiments show that their method allows applications to successfully finalize their execution in the presence of several faults, with an average performance value of 97% with respect to the fault-free scenarios. The paper: “Hardware implementation study of the Self-Clocked Fair Queuing Credit Aware (SCFQ-CA) and Deficit Round Robin Credit Aware (DRR-CA) scheduling algorithms” proposes specific implementations of the two schedulers taking into account the characteristics of current high-performance networks. A comparison is presented on the complexity of these two algorithms in terms of silicon area and computation delay. Finally we selected one paper for the special paper session: “A Case Study of Communication Optimizations on 3D Mesh Interconnects”. In this paper the authors present topology aware mapping as a technique to optimize communication on 3-dimensional mesh interconnects and hence improve performance. Results are presented for OpenAtom on up to 16,384 processors of Blue Gene/L, 8,192 processors of Blue Gene/P and 2,048 processors of Cray XT3.
BUDS Candidate Success Through RTC: First Watch Results

DTIC Science & Technology

2007-01-01

22 NCAPS ...motivation. 22 N P R S T 21 NCAPS (Navy Computer Adaptive Computer Scales) NCAPS • Achievement • Stress Tolerance • Self Reliance • Leadership...military population. The Navy Computer Adaptive Personality Scales ( NCAPS ), however, was developed specifically to predict success across all Navy
High Performance, Dependable Multiprocessor

NASA Technical Reports Server (NTRS)

Ramos, Jeremy; Samson, John R.; Troxel, Ian; Subramaniyan, Rajagopal; Jacobs, Adam; Greco, James; Cieslewski, Grzegorz; Curreri, John; Fischer, Michael; Grobelny, Eric;

2006-01-01

With the ever increasing demand for higher bandwidth and processing capacity of today's space exploration, space science, and defense missions, the ability to efficiently apply commercial-off-the-shelf (COTS) processors for on-board computing is now a critical need. In response to this need, NASA's New Millennium Program office has commissioned the development of Dependable Multiprocessor (DM) technology for use in payload and robotic missions. The Dependable Multiprocessor technology is a COTS-based, power efficient, high performance, highly dependable, fault tolerant cluster computer. To date, Honeywell has successfully demonstrated a TRL4 prototype of the Dependable Multiprocessor [I], and is now working on the development of a TRLS prototype. For the present effort Honeywell has teamed up with the University of Florida's High-performance Computing and Simulation (HCS) Lab, and together the team has demonstrated major elements of the Dependable Multiprocessor TRLS system.

Eigenmode computation of cavities with perturbed geometry using matrix perturbation methods applied on generalized eigenvalue problems

NASA Astrophysics Data System (ADS)

Gorgizadeh, Shahnam; Flisgen, Thomas; van Rienen, Ursula

2018-07-01

Generalized eigenvalue problems are standard problems in computational sciences. They may arise in electromagnetic fields from the discretization of the Helmholtz equation by for example the finite element method (FEM). Geometrical perturbations of the structure under concern lead to a new generalized eigenvalue problems with different system matrices. Geometrical perturbations may arise by manufacturing tolerances, harsh operating conditions or during shape optimization. Directly solving the eigenvalue problem for each perturbation is computationally costly. The perturbed eigenpairs can be approximated using eigenpair derivatives. Two common approaches for the calculation of eigenpair derivatives, namely modal superposition method and direct algebraic methods, are discussed in this paper. Based on the direct algebraic methods an iterative algorithm is developed for efficiently calculating the eigenvalues and eigenvectors of the perturbed geometry from the eigenvalues and eigenvectors of the unperturbed geometry.
EOS: A project to investigate the design and construction of real-time distributed Embedded Operating Systems

NASA Technical Reports Server (NTRS)

Campbell, R. H.; Essick, Ray B.; Johnston, Gary; Kenny, Kevin; Russo, Vince

1987-01-01

Project EOS is studying the problems of building adaptable real-time embedded operating systems for the scientific missions of NASA. Choices (A Class Hierarchical Open Interface for Custom Embedded Systems) is an operating system designed and built by Project EOS to address the following specific issues: the software architecture for adaptable embedded parallel operating systems, the achievement of high-performance and real-time operation, the simplification of interprocess communications, the isolation of operating system mechanisms from one another, and the separation of mechanisms from policy decisions. Choices is written in C++ and runs on a ten processor Encore Multimax. The system is intended for use in constructing specialized computer applications and research on advanced operating system features including fault tolerance and parallelism.
Reactive system verification case study: Fault-tolerant transputer communication

NASA Technical Reports Server (NTRS)

Crane, D. Francis; Hamory, Philip J.

1993-01-01

A reactive program is one which engages in an ongoing interaction with its environment. A system which is controlled by an embedded reactive program is called a reactive system. Examples of reactive systems are aircraft flight management systems, bank automatic teller machine (ATM) networks, airline reservation systems, and computer operating systems. Reactive systems are often naturally modeled (for logical design purposes) as a composition of autonomous processes which progress concurrently and which communicate to share information and/or to coordinate activities. Formal (i.e., mathematical) frameworks for system verification are tools used to increase the users' confidence that a system design satisfies its specification. A framework for reactive system verification includes formal languages for system modeling and for behavior specification and decision procedures and/or proof-systems for verifying that the system model satisfies the system specifications. Using the Ostroff framework for reactive system verification, an approach to achieving fault-tolerant communication between transputers was shown to be effective. The key components of the design, the decoupler processes, may be viewed as discrete-event-controllers introduced to constrain system behavior such that system specifications are satisfied. The Ostroff framework was also effective. The expressiveness of the modeling language permitted construction of a faithful model of the transputer network. The relevant specifications were readily expressed in the specification language. The set of decision procedures provided was adequate to verify the specifications of interest. The need for improved support for system behavior visualization is emphasized.
Annual Rainfall Forecasting by Using Mamdani Fuzzy Inference System

NASA Astrophysics Data System (ADS)

Fallah-Ghalhary, G.-A.; Habibi Nokhandan, M.; Mousavi Baygi, M.

2009-04-01

Long-term rainfall prediction is very important to countries thriving on agro-based economy. In general, climate and rainfall are highly non-linear phenomena in nature giving rise to what is known as "butterfly effect". The parameters that are required to predict the rainfall are enormous even for a short period. Soft computing is an innovative approach to construct computationally intelligent systems that are supposed to possess humanlike expertise within a specific domain, adapt themselves and learn to do better in changing environments, and explain how they make decisions. Unlike conventional artificial intelligence techniques the guiding principle of soft computing is to exploit tolerance for imprecision, uncertainty, robustness, partial truth to achieve tractability, and better rapport with reality. In this paper, 33 years of rainfall data analyzed in khorasan state, the northeastern part of Iran situated at latitude-longitude pairs (31°-38°N, 74°- 80°E). this research attempted to train Fuzzy Inference System (FIS) based prediction models with 33 years of rainfall data. For performance evaluation, the model predicted outputs were compared with the actual rainfall data. Simulation results reveal that soft computing techniques are promising and efficient. The test results using by FIS model showed that the RMSE was obtained 52 millimeter.
Modeling the data management system of Space Station Freedom with DEPEND

NASA Technical Reports Server (NTRS)

Olson, Daniel P.; Iyer, Ravishankar K.; Boyd, Mark A.

1993-01-01

Some of the features and capabilities of the DEPEND simulation-based modeling tool are described. A study of a 1553B local bus subsystem of the Space Station Freedom Data Management System (SSF DMS) is used to illustrate some types of system behavior that can be important to reliability and performance evaluations of this type of spacecraft. A DEPEND model of the subsystem is used to illustrate how these types of system behavior can be modeled, and shows what kinds of engineering and design questions can be answered through the use of these modeling techniques. DEPEND's process-based simulation environment is shown to provide a flexible method for modeling complex interactions between hardware and software elements of a fault-tolerant computing system.
Performance enhancement of a web-based picture archiving and communication system using commercial off-the-shelf server clusters.

PubMed

Liu, Yan-Lin; Shih, Cheng-Ting; Chang, Yuan-Jen; Chang, Shu-Jun; Wu, Jay

2014-01-01

The rapid development of picture archiving and communication systems (PACSs) thoroughly changes the way of medical informatics communication and management. However, as the scale of a hospital's operations increases, the large amount of digital images transferred in the network inevitably decreases system efficiency. In this study, a server cluster consisting of two server nodes was constructed. Network load balancing (NLB), distributed file system (DFS), and structured query language (SQL) duplication services were installed. A total of 1 to 16 workstations were used to transfer computed radiography (CR), computed tomography (CT), and magnetic resonance (MR) images simultaneously to simulate the clinical situation. The average transmission rate (ATR) was analyzed between the cluster and noncluster servers. In the download scenario, the ATRs of CR, CT, and MR images increased by 44.3%, 56.6%, and 100.9%, respectively, when using the server cluster, whereas the ATRs increased by 23.0%, 39.2%, and 24.9% in the upload scenario. In the mix scenario, the transmission performance increased by 45.2% when using eight computer units. The fault tolerance mechanisms of the server cluster maintained the system availability and image integrity. The server cluster can improve the transmission efficiency while maintaining high reliability and continuous availability in a healthcare environment.
Performance Enhancement of a Web-Based Picture Archiving and Communication System Using Commercial Off-the-Shelf Server Clusters

PubMed Central

Chang, Shu-Jun; Wu, Jay

2014-01-01

The rapid development of picture archiving and communication systems (PACSs) thoroughly changes the way of medical informatics communication and management. However, as the scale of a hospital's operations increases, the large amount of digital images transferred in the network inevitably decreases system efficiency. In this study, a server cluster consisting of two server nodes was constructed. Network load balancing (NLB), distributed file system (DFS), and structured query language (SQL) duplication services were installed. A total of 1 to 16 workstations were used to transfer computed radiography (CR), computed tomography (CT), and magnetic resonance (MR) images simultaneously to simulate the clinical situation. The average transmission rate (ATR) was analyzed between the cluster and noncluster servers. In the download scenario, the ATRs of CR, CT, and MR images increased by 44.3%, 56.6%, and 100.9%, respectively, when using the server cluster, whereas the ATRs increased by 23.0%, 39.2%, and 24.9% in the upload scenario. In the mix scenario, the transmission performance increased by 45.2% when using eight computer units. The fault tolerance mechanisms of the server cluster maintained the system availability and image integrity. The server cluster can improve the transmission efficiency while maintaining high reliability and continuous availability in a healthcare environment. PMID:24701580
Fault tolerance in an inner-outer solver: A GVR-enabled case study

DOE PAGES

Zhang, Ziming; Chien, Andrew A.; Teranishi, Keita

2015-04-18

Resilience is a major challenge for large-scale systems. It is particularly important for iterative linear solvers, since they take much of the time of many scientific applications. We show that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer. Informed by these results, we design and evaluate several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates. We implement them, extending Trilinos’ solver library with the Global View Resilience (GVR) programming model, which provides multi-streammore » snapshots, multi-version data structures with portable and rich error checking/recovery. Lastly, experimental results validate correct execution with low performance overhead under varied error conditions.« less
Noise tolerant dendritic lattice associative memories

NASA Astrophysics Data System (ADS)

Ritter, Gerhard X.; Schmalz, Mark S.; Hayden, Eric; Tucker, Marc

2011-09-01

Linear classifiers based on computation over the real numbers R (e.g., with operations of addition and multiplication) denoted by (R, +, x), have been represented extensively in the literature of pattern recognition. However, a different approach to pattern classification involves the use of addition, maximum, and minimum operations over the reals in the algebra (R, +, maximum, minimum) These pattern classifiers, based on lattice algebra, have been shown to exhibit superior information storage capacity, fast training and short convergence times, high pattern classification accuracy, and low computational cost. Such attributes are not always found, for example, in classical neural nets based on the linear inner product. In a special type of lattice associative memory (LAM), called a dendritic LAM or DLAM, it is possible to achieve noise-tolerant pattern classification by varying the design of noise or error acceptance bounds. This paper presents theory and algorithmic approaches for the computation of noise-tolerant lattice associative memories (LAMs) under a variety of input constraints. Of particular interest are the classification of nonergodic data in noise regimes with time-varying statistics. DLAMs, which are a specialization of LAMs derived from concepts of biological neural networks, have successfully been applied to pattern classification from hyperspectral remote sensing data, as well as spatial object recognition from digital imagery. The authors' recent research in the development of DLAMs is overviewed, with experimental results that show utility for a wide variety of pattern classification applications. Performance results are presented in terms of measured computational cost, noise tolerance, classification accuracy, and throughput for a variety of input data and noise levels.

Variation tolerant SoC design

NASA Astrophysics Data System (ADS)

Kozhikkottu, Vivek J.

The scaling of integrated circuits into the nanometer regime has led to variations emerging as a primary concern for designers of integrated circuits. Variations are an inevitable consequence of the semiconductor manufacturing process, and also arise due to the side-effects of operation of integrated circuits (voltage, temperature, and aging). Conventional design approaches, which are based on design corners or worst-case scenarios, leave designers with an undesirable choice between the considerable overheads associated with over-design and significantly reduced manufacturing yield. Techniques for variation-tolerant design at the logic, circuit and layout levels of the design process have been developed and are in commercial use. However, with the incessant increase in variations due to technology scaling and design trends such as near-threshold computing, these techniques are no longer sufficient to contain the effects of variations, and there is a need to address variations at all stages of design. This thesis addresses the problem of variation-tolerant design at the earliest stages of the design process, where the system-level design decisions that are made can have a very significant impact. There are two key aspects to making system-level design variation-aware. First, analysis techniques must be developed to project the impact of variations on system-level metrics such as application performance and energy. Second, variation-tolerant design techniques need to be developed to absorb the residual impact of variations (that cannot be contained through lower-level techniques). In this thesis, we address both these facets by developing robust and scalable variation-aware analysis and variation mitigation techniques at the system level. The first contribution of this thesis is a variation-aware system-level performance analysis framework. We address the key challenge of translating the per-component clock frequency distributions into a system-level application performance distribution. This task is particularly complex and challenging due to the inter-dependencies between components' execution, indirect effects of shared resources, and interactions between multiple system-level "execution paths". We argue that accurate variation-aware performance analysis requires Monte-Carlo based repeated system execution. Our proposed analysis framework leverages emulation to significantly speedup performance analysis without sacrificing the generality and accuracy achieved by Monte-Carlo based simulations. Our experiments show performance improvements of around 60x compared to state-of-the-art hardware-software co-simulation tools and also underscore the framework's potential to enable variation-aware design and exploration at the system level. Our second contribution addresses the problem of designing variation-tolerant SoCs using recovery based design, a popular circuit design paradigm that addresses variations by eliminating guard-bands and operating circuits at close to "zero margins" while detecting and recovering from timing errors. While previous efforts have demonstrated the potential benefits of recovery based design, we identify several challenges that need to be addressed in order to apply this technique to SoCs. We present a systematic design framework to apply recovery based design at the system level. We propose to partition SoCs into "recovery islands", wherein each recovery island consists of one or more SoC components that can recover independent of the rest of the SoC. We present a variation-aware design methodology that partitions a given SoC into recovery islands and computes the optimal operating points for each island, taking into account the various trade-offs involved. Our experiments demonstrate that the proposed design framework achieves an average of 32% energy savings over conventional worst-case designs, with negligible losses in performance. The third contribution of this thesis introduces disproportionate allocation of shared system resources as a means to combat the adverse impact of within-die variations on multi-core platforms. For multi-threaded programs executing on variation-impacted multi-cores platforms, we make the key observation that thread performance is not only a function of the frequency of the core on which it is executing on, but also depends upon the amount of shared system resources allocated to it. We utilize this insight to design a variation-aware runtime scheme which allocates the ways of a last-level shared L2 cache amongst the different cores/threads of a multi-core platform taking into account both application characteristics as well as chip specific variation profiles. Our experiments on 100 quad-core chips, each with a distinct variation profile, shows on an average 15% performance improvements for a suite of multi-threaded benchmarks. Our final contribution investigates the variation-tolerant design of domain-specific accelerators and demonstrates how the unique architectural properties of these accelerators can be leveraged to create highly effective variation tolerance mechanisms. We explore this concept through the variation-tolerant design of a vector processor that efficiently executes applications from the domains of recognition, mining and synthesis (RMS). We develop a novel design approach for variation tolerance, which leverages the unique nature of the vector reduction operations performed by this processor to effectively predict and preempt the occurrence of timing errors under variations and subsequently restore the correct output at the end of each vector reduction operation. We implement the above predict, preempt and restore operations by suitably enhancing the processor hardware and the application software and demonstrate considerable energy benefits (on an average 32%) across six applications from the domains of RMS. In conclusion, our work provides system designers with powerful tools and mechanisms in their efforts to combat variations, resulting in improved designer productivity and variation-tolerant systems.
Recognition of partially occluded threat objects using the annealed Hopefield network

NASA Technical Reports Server (NTRS)

Kim, Jung H.; Yoon, Sung H.; Park, Eui H.; Ntuen, Celestine A.

1992-01-01

Recognition of partially occluded objects has been an important issue to airport security because occlusion causes significant problems in identifying and locating objects during baggage inspection. The neural network approach is suitable for the problems in the sense that the inherent parallelism of neural networks pursues many hypotheses in parallel resulting in high computation rates. Moreover, they provide a greater degree of robustness or fault tolerance than conventional computers. The annealed Hopfield network which is derived from the mean field annealing (MFA) has been developed to find global solutions of a nonlinear system. In the study, it has been proven that the system temperature of MFA is equivalent to the gain of the sigmoid function of a Hopfield network. In our early work, we developed the hybrid Hopfield network (HHN) for fast and reliable matching. However, HHN doesn't guarantee global solutions and yields false matching under heavily occluded conditions because HHN is dependent on initial states by its nature. In this paper, we present the annealed Hopfield network (AHN) for occluded object matching problems. In AHN, the mean field theory is applied to the hybird Hopfield network in order to improve computational complexity of the annealed Hopfield network and provide reliable matching under heavily occluded conditions. AHN is slower than HHN. However, AHN provides near global solutions without initial restrictions and provides less false matching than HHN. In conclusion, a new algorithm based upon a neural network approach was developed to demonstrate the feasibility of the automated inspection of threat objects from x-ray images. The robustness of the algorithm is proved by identifying occluded target objects with large tolerance of their features.
Advanced information processing system - Status report. [for fault tolerant and damage tolerant data processing for aerospace vehicles

NASA Technical Reports Server (NTRS)

Brock, L. D.; Lala, J.

1986-01-01

The Advanced Information Processing System (AIPS) is designed to provide a fault tolerant and damage tolerant data processing architecture for a broad range of aerospace vehicles. The AIPS architecture also has attributes to enhance system effectiveness such as graceful degradation, growth and change tolerance, integrability, etc. Two key building blocks being developed by the AIPS program are a fault and damage tolerant processor and communication network. A proof-of-concept system is now being built and will be tested to demonstrate the validity and performance of the AIPS concepts.
Assessing the Progress of Trapped-Ion Processors Towards Fault-Tolerant Quantum Computation

NASA Astrophysics Data System (ADS)

Bermudez, A.; Xu, X.; Nigmatullin, R.; O'Gorman, J.; Negnevitsky, V.; Schindler, P.; Monz, T.; Poschinger, U. G.; Hempel, C.; Home, J.; Schmidt-Kaler, F.; Biercuk, M.; Blatt, R.; Benjamin, S.; Müller, M.

2017-10-01

A quantitative assessment of the progress of small prototype quantum processors towards fault-tolerant quantum computation is a problem of current interest in experimental and theoretical quantum information science. We introduce a necessary and fair criterion for quantum error correction (QEC), which must be achieved in the development of these quantum processors before their sizes are sufficiently big to consider the well-known QEC threshold. We apply this criterion to benchmark the ongoing effort in implementing QEC with topological color codes using trapped-ion quantum processors and, more importantly, to guide the future hardware developments that will be required in order to demonstrate beneficial QEC with small topological quantum codes. In doing so, we present a thorough description of a realistic trapped-ion toolbox for QEC and a physically motivated error model that goes beyond standard simplifications in the QEC literature. We focus on laser-based quantum gates realized in two-species trapped-ion crystals in high-optical aperture segmented traps. Our large-scale numerical analysis shows that, with the foreseen technological improvements described here, this platform is a very promising candidate for fault-tolerant quantum computation.
Computer-aided operations engineering with integrated models of systems and operations

NASA Technical Reports Server (NTRS)

Malin, Jane T.; Ryan, Dan; Fleming, Land

1994-01-01

CONFIG 3 is a prototype software tool that supports integrated conceptual design evaluation from early in the product life cycle, by supporting isolated or integrated modeling, simulation, and analysis of the function, structure, behavior, failures and operation of system designs. Integration and reuse of models is supported in an object-oriented environment providing capabilities for graph analysis and discrete event simulation. Integration is supported among diverse modeling approaches (component view, configuration or flow path view, and procedure view) and diverse simulation and analysis approaches. Support is provided for integrated engineering in diverse design domains, including mechanical and electro-mechanical systems, distributed computer systems, and chemical processing and transport systems. CONFIG supports abstracted qualitative and symbolic modeling, for early conceptual design. System models are component structure models with operating modes, with embedded time-related behavior models. CONFIG supports failure modeling and modeling of state or configuration changes that result in dynamic changes in dependencies among components. Operations and procedure models are activity structure models that interact with system models. CONFIG is designed to support evaluation of system operability, diagnosability and fault tolerance, and analysis of the development of system effects of problems over time, including faults, failures, and procedural or environmental difficulties.
Joint nonlinearity effects in the design of a flexible truss structure control system

NASA Technical Reports Server (NTRS)

Mercadal, Mathieu

1986-01-01

Nonlinear effects are introduced in the dynamics of large space truss structures by the connecting joints which are designed with rather important tolerances to facilitate the assembly of the structures in space. The purpose was to develop means to investigate the nonlinear dynamics of the structures, particularly the limit cycles that might occur when active control is applied to the structures. An analytical method was sought and derived to predict the occurrence of limit cycles and to determine their stability. This method is mainly based on the quasi-linearization of every joint using describing functions. This approach was proven successful when simple dynamical systems were tested. Its applicability to larger systems depends on the amount of computations it requires, and estimates of the computational task tend to indicate that the number of individual sources of nonlinearity should be limited. Alternate analytical approaches, which do not account for every single nonlinearity, or the simulation of a simplified model of the dynamical system should, therefore, be investigated to determine a more effective way to predict limit cycles in large dynamical systems with an important number of distributed nonlinearities.
Graphical workstation capability for reliability modeling

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Koppen, Sandra V.; Haley, Pamela J.

1992-01-01

In addition to computational capabilities, software tools for estimating the reliability of fault-tolerant digital computer systems must also provide a means of interfacing with the user. Described here is the new graphical interface capability of the hybrid automated reliability predictor (HARP), a software package that implements advanced reliability modeling techniques. The graphics oriented (GO) module provides the user with a graphical language for modeling system failure modes through the selection of various fault-tree gates, including sequence-dependency gates, or by a Markov chain. By using this graphical input language, a fault tree becomes a convenient notation for describing a system. In accounting for any sequence dependencies, HARP converts the fault-tree notation to a complex stochastic process that is reduced to a Markov chain, which it can then solve for system reliability. The graphics capability is available for use on an IBM-compatible PC, a Sun, and a VAX workstation. The GO module is written in the C programming language and uses the graphical kernal system (GKS) standard for graphics implementation. The PC, VAX, and Sun versions of the HARP GO module are currently in beta-testing stages.
Management and design of long-life systems; Proceedings of the Symposium, Denver, Colo., April 24-26, 1973

NASA Technical Reports Server (NTRS)

Schurmeier, H. M.

1974-01-01

The long life of Pioneer interplanetary spacecraft is considered along with a general accelerated methodology for long-life mechanical components, dependable long-lived household appliances, and the design and development philosophy to achieve reliability and long life in large turbine generators. Other topics discussed include an integrated management approach to long life in space, artificial heart reliability factors, and architectural concepts and redundancy techniques in fault-tolerant computers. Individual items are announced in this issue.
Mechanical verification of a schematic Byzantine clock synchronization algorithm

NASA Technical Reports Server (NTRS)

Shankar, Natarajan

1991-01-01

Schneider generalizes a number of protocols for Byzantine fault tolerant clock synchronization and presents a uniform proof for their correctness. The authors present a machine checked proof of this schematic protocol that revises some of the details in Schneider's original analysis. The verification was carried out with the EHDM system developed at the SRI Computer Science Laboratory. The mechanically checked proofs include the verification that the egocentric mean function used in Lamport and Melliar-Smith's Interactive Convergence Algorithm satisfies the requirements of Schneider's protocol.
Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

DOE PAGES

Gamell, Marc; Teranishi, Keita; Kolla, Hemanth; ...

2017-10-26

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments.more » In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.« less
Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gamell, Marc; Teranishi, Keita; Kolla, Hemanth

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments.more » In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.« less
Space station data system analysis/architecture study. Task 2: Options development DR-5. Volume 1: Technology options

NASA Technical Reports Server (NTRS)

1985-01-01

The second task in the Space Station Data System (SSDS) Analysis/Architecture Study is the development of an information base that will support the conduct of trade studies and provide sufficient data to make key design/programmatic decisions. This volume identifies the preferred options in the technology category and characterizes these options with respect to performance attributes, constraints, cost, and risk. The technology category includes advanced materials, processes, and techniques that can be used to enhance the implementation of SSDS design structures. The specific areas discussed are mass storage, including space and round on-line storage and off-line storage; man/machine interface; data processing hardware, including flight computers and advanced/fault tolerant computer architectures; and software, including data compression algorithms, on-board high level languages, and software tools. Also discussed are artificial intelligence applications and hard-wire communications.
Comparative analysis of techniques for evaluating the effectiveness of aircraft computing systems

NASA Technical Reports Server (NTRS)

Hitt, E. F.; Bridgman, M. S.; Robinson, A. C.

1981-01-01

Performability analysis is a technique developed for evaluating the effectiveness of fault-tolerant computing systems in multiphase missions. Performability was evaluated for its accuracy, practical usefulness, and relative cost. The evaluation was performed by applying performability and the fault tree method to a set of sample problems ranging from simple to moderately complex. The problems involved as many as five outcomes, two to five mission phases, permanent faults, and some functional dependencies. Transient faults and software errors were not considered. A different analyst was responsible for each technique. Significantly more time and effort were required to learn performability analysis than the fault tree method. Performability is inherently as accurate as fault tree analysis. For the sample problems, fault trees were more practical and less time consuming to apply, while performability required less ingenuity and was more checkable. Performability offers some advantages for evaluating very complex problems.
Semiconductor-inspired design principles for superconducting quantum computing.

PubMed

Shim, Yun-Pil; Tahan, Charles

2016-03-17

Superconducting circuits offer tremendous design flexibility in the quantum regime culminating most recently in the demonstration of few qubit systems supposedly approaching the threshold for fault-tolerant quantum information processing. Competition in the solid-state comes from semiconductor qubits, where nature has bestowed some very useful properties which can be utilized for spin qubit-based quantum computing. Here we begin to explore how selective design principles deduced from spin-based systems could be used to advance superconducting qubit science. We take an initial step along this path proposing an encoded qubit approach realizable with state-of-the-art tunable Josephson junction qubits. Our results show that this design philosophy holds promise, enables microwave-free control, and offers a pathway to future qubit designs with new capabilities such as with higher fidelity or, perhaps, operation at higher temperature. The approach is also especially suited to qubits on the basis of variable super-semi junctions.
Advanced Launch System Multi-Path Redundant Avionics Architecture Analysis and Characterization

NASA Technical Reports Server (NTRS)

Baker, Robert L.

1993-01-01

The objective of the Multi-Path Redundant Avionics Suite (MPRAS) program is the development of a set of avionic architectural modules which will be applicable to the family of launch vehicles required to support the Advanced Launch System (ALS). To enable ALS cost/performance requirements to be met, the MPRAS must support autonomy, maintenance, and testability capabilities which exceed those present in conventional launch vehicles. The multi-path redundant or fault tolerance characteristics of the MPRAS are necessary to offset a reduction in avionics reliability due to the increased complexity needed to support these new cost reduction and performance capabilities and to meet avionics reliability requirements which will provide cost-effective reductions in overall ALS recurring costs. A complex, real-time distributed computing system is needed to meet the ALS avionics system requirements. General Dynamics, Boeing Aerospace, and C.S. Draper Laboratory have proposed system architectures as candidates for the ALS MPRAS. The purpose of this document is to report the results of independent performance and reliability characterization and assessment analyses of each proposed candidate architecture and qualitative assessments of testability, maintainability, and fault tolerance mechanisms. These independent analyses were conducted as part of the MPRAS Part 2 program and were carried under NASA Langley Research Contract NAS1-17964, Task Assignment 28.
Design for dependability: A simulation-based approach. Ph.D. Thesis, 1993

NASA Technical Reports Server (NTRS)

Goswami, Kumar K.

1994-01-01

This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced.
Extending Moore's Law via Computationally Error Tolerant Computing.

DOE PAGES

Deng, Bobin; Srikanth, Sriseshan; Hein, Eric R.; ...

2018-03-01

Dennard scaling has ended. Lowering the voltage supply (V dd) to sub-volt levels causes intermittent losses in signal integrity, rendering further scaling (down) no longer acceptable as a means to lower the power required by a processor core. However, it is possible to correct the occasional errors caused due to lower V dd in an efficient manner and effectively lower power. By deploying the right amount and kind of redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower V dd. One promising approach is the Redundant Residue Number System (RRNS)more » representation. Unlike other error correcting codes, RRNS has the important property of being closed under addition, subtraction and multiplication, thus enabling computational error correction at a fraction of an overhead compared to conventional approaches. We use the RRNS scheme to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms. Finally, from the simulation results, this RRNS system can reduce the energy-delay-product by about 3× for multiplication intensive workloads and by about 2× in general, when compared to a non-error-correcting binary core.« less
Extending Moore's Law via Computationally Error Tolerant Computing.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deng, Bobin; Srikanth, Sriseshan; Hein, Eric R.

Dennard scaling has ended. Lowering the voltage supply (V dd) to sub-volt levels causes intermittent losses in signal integrity, rendering further scaling (down) no longer acceptable as a means to lower the power required by a processor core. However, it is possible to correct the occasional errors caused due to lower V dd in an efficient manner and effectively lower power. By deploying the right amount and kind of redundancy, we can strike a balance between overhead incurred in achieving reliability and energy savings realized by permitting lower V dd. One promising approach is the Redundant Residue Number System (RRNS)more » representation. Unlike other error correcting codes, RRNS has the important property of being closed under addition, subtraction and multiplication, thus enabling computational error correction at a fraction of an overhead compared to conventional approaches. We use the RRNS scheme to design a Computationally-Redundant, Energy-Efficient core, including the microarchitecture, Instruction Set Architecture (ISA) and RRNS centered algorithms. Finally, from the simulation results, this RRNS system can reduce the energy-delay-product by about 3× for multiplication intensive workloads and by about 2× in general, when compared to a non-error-correcting binary core.« less
Building logical qubits in a superconducting quantum computing system

NASA Astrophysics Data System (ADS)

Gambetta, Jay M.; Chow, Jerry M.; Steffen, Matthias

2017-01-01

The technological world is in the midst of a quantum computing and quantum information revolution. Since Richard Feynman's famous `plenty of room at the bottom' lecture (Feynman, Engineering and Science23, 22 (1960)), hinting at the notion of novel devices employing quantum mechanics, the quantum information community has taken gigantic strides in understanding the potential applications of a quantum computer and laid the foundational requirements for building one. We believe that the next significant step will be to demonstrate a quantum memory, in which a system of interacting qubits stores an encoded logical qubit state longer than the incorporated parts. Here, we describe the important route towards a logical memory with superconducting qubits, employing a rotated version of the surface code. The current status of technology with regards to interconnected superconducting-qubit networks will be described and near-term areas of focus to improve devices will be identified. Overall, the progress in this exciting field has been astounding, but we are at an important turning point, where it will be critical to incorporate engineering solutions with quantum architectural considerations, laying the foundation towards scalable fault-tolerant quantum computers in the near future.
Physical Realization of a Supervised Learning System Built with Organic Memristive Synapses

NASA Astrophysics Data System (ADS)

Lin, Yu-Pu; Bennett, Christopher H.; Cabaret, Théo; Vodenicarevic, Damir; Chabi, Djaafar; Querlioz, Damien; Jousselme, Bruno; Derycke, Vincent; Klein, Jacques-Olivier

2016-09-01

Multiple modern applications of electronics call for inexpensive chips that can perform complex operations on natural data with limited energy. A vision for accomplishing this is implementing hardware neural networks, which fuse computation and memory, with low cost organic electronics. A challenge, however, is the implementation of synapses (analog memories) composed of such materials. In this work, we introduce robust, fastly programmable, nonvolatile organic memristive nanodevices based on electrografted redox complexes that implement synapses thanks to a wide range of accessible intermediate conductivity states. We demonstrate experimentally an elementary neural network, capable of learning functions, which combines four pairs of organic memristors as synapses and conventional electronics as neurons. Our architecture is highly resilient to issues caused by imperfect devices. It tolerates inter-device variability and an adaptable learning rule offers immunity against asymmetries in device switching. Highly compliant with conventional fabrication processes, the system can be extended to larger computing systems capable of complex cognitive tasks, as demonstrated in complementary simulations.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.