software implemented fault: Topics by Science.gov

Sample records for software implemented fault

Software-implemented fault insertion: An FTMP example

NASA Technical Reports Server (NTRS)

Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.

1987-01-01

This report presents a model for fault insertion through software; describes its implementation on a fault-tolerant computer, FTMP; presents a summary of fault detection, identification, and reconfiguration data collected with software-implemented fault insertion; and compares the results to hardware fault insertion data. Experimental results show detection time to be a function of time of insertion and system workload. For the fault detection time, there is no correlation between software-inserted faults and hardware-inserted faults; this is because hardware-inserted faults must manifest as errors before detection, whereas software-inserted faults immediately exercise the error detection mechanisms. In summary, the software-implemented fault insertion is able to be used as an evaluation technique for the fault-handling capabilities of a system in fault detection, identification and recovery. Although the software-inserted faults do not map directly to hardware-inserted faults, experiments show software-implemented fault insertion is capable of emulating hardware fault insertion, with greater ease and automation.
Fault-tolerant software - Experiment with the sift operating system. [Software Implemented Fault Tolerance computer

NASA Technical Reports Server (NTRS)

Brunelle, J. E.; Eckhardt, D. E., Jr.

1985-01-01

Results are presented of an experiment conducted in the NASA Avionics Integrated Research Laboratory (AIRLAB) to investigate the implementation of fault-tolerant software techniques on fault-tolerant computer architectures, in particular the Software Implemented Fault Tolerance (SIFT) computer. The N-version programming and recovery block techniques were implemented on a portion of the SIFT operating system. The results indicate that, to effectively implement fault-tolerant software design techniques, system requirements will be impacted and suggest that retrofitting fault-tolerant software on existing designs will be inefficient and may require system modification.
Fault tolerant software modules for SIFT

NASA Technical Reports Server (NTRS)

Hecht, M.; Hecht, H.

1982-01-01

The implementation of software fault tolerance is investigated for critical modules of the Software Implemented Fault Tolerance (SIFT) operating system to support the computational and reliability requirements of advanced fly by wire transport aircraft. Fault tolerant designs generated for the error reported and global executive are examined. A description of the alternate routines, implementation requirements, and software validation are included.
Design study of Software-Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Goldberg, J.; Green, M. W.; Kutz, W. H.; Levitt, K. N.; Mills, M. E.; Shostak, R. E.; Whiting-Okeefe, P. M.; Zeidler, H. M.

1982-01-01

Software-implemented fault tolerant (SIFT) computer design for commercial aviation is reported. A SIFT design concept is addressed. Alternate strategies for physical implementation are considered. Hardware and software design correctness is addressed. System modeling and effectiveness evaluation are considered from a fault-tolerant point of view.
Development and analysis of the Software Implemented Fault-Tolerance (SIFT) computer

NASA Technical Reports Server (NTRS)

Goldberg, J.; Kautz, W. H.; Melliar-Smith, P. M.; Green, M. W.; Levitt, K. N.; Schwartz, R. L.; Weinstock, C. B.

1984-01-01

SIFT (Software Implemented Fault Tolerance) is an experimental, fault-tolerant computer system designed to meet the extreme reliability requirements for safety-critical functions in advanced aircraft. Errors are masked by performing a majority voting operation over the results of identical computations, and faulty processors are removed from service by reassigning computations to the nonfaulty processors. This scheme has been implemented in a special architecture using a set of standard Bendix BDX930 processors, augmented by a special asynchronous-broadcast communication interface that provides direct, processor to processor communication among all processors. Fault isolation is accomplished in hardware; all other fault-tolerance functions, together with scheduling and synchronization are implemented exclusively by executive system software. The system reliability is predicted by a Markov model. Mathematical consistency of the system software with respect to the reliability model has been partially verified, using recently developed tools for machine-aided proof of program correctness.
Coordinated Fault-Tolerance for High-Performance Computing Final Project Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Panda, Dhabaleswar Kumar; Beckman, Pete

2011-07-28

With the Coordinated Infrastructure for Fault Tolerance Systems (CIFTS, as the original project came to be called) project, our aim has been to understand and tackle the following broad research questions, the answers to which will help the HEC community analyze and shape the direction of research in the field of fault tolerance and resiliency on future high-end leadership systems. Will availability of global fault information, obtained by fault information exchange between the different HEC software on a system, allow individual system software to better detect, diagnose, and adaptively respond to faults? If fault-awareness is raised throughout the system throughmore » fault information exchange, is it possible to get all system software working together to provide a more comprehensive end-to-end fault management on the system? What are the missing fault-tolerance features that widely used HEC system software lacks today that would inhibit such software from taking advantage of systemwide global fault information? What are the practical limitations of a systemwide approach for end-to-end fault management based on fault awareness and coordination? What mechanisms, tools, and technologies are needed to bring about fault awareness and coordination of responses on a leadership-class system? What standards, outreach, and community interaction are needed for adoption of the concept of fault awareness and coordination for fault management on future systems? Keeping our overall objectives in mind, the CIFTS team has taken a parallel fourfold approach. Our central goal was to design and implement a light-weight, scalable infrastructure with a simple, standardized interface to allow communication of fault-related information through the system and facilitate coordinated responses. This work led to the development of the Fault Tolerance Backplane (FTB) publish-subscribe API specification, together with a reference implementation and several experimental implementations on top of existing publish-subscribe tools. We enhanced the intrinsic fault tolerance capabilities representative implementations of a variety of key HPC software subsystems and integrated them with the FTB. Targeting software subsystems included: MPI communication libraries, checkpoint/restart libraries, resource managers and job schedulers, and system monitoring tools. Leveraging the aforementioned infrastructure, as well as developing and utilizing additional tools, we have examined issues associated with expanded, end-to-end fault response from both system and application viewpoints. From the standpoint of system operations, we have investigated log and root cause analysis, anomaly detection and fault prediction, and generalized notification mechanisms. Our applications work has included libraries for fault-tolerance linear algebra, application frameworks for coupled multiphysics applications, and external frameworks to support the monitoring and response for general applications. Our final goal was to engage the high-end computing community to increase awareness of tools and issues around coordinated end-to-end fault management.« less
Analyzing and Predicting Effort Associated with Finding and Fixing Software Faults

NASA Technical Reports Server (NTRS)

Hamill, Maggie; Goseva-Popstojanova, Katerina

2016-01-01

Context: Software developers spend a significant amount of time fixing faults. However, not many papers have addressed the actual effort needed to fix software faults. Objective: The objective of this paper is twofold: (1) analysis of the effort needed to fix software faults and how it was affected by several factors and (2) prediction of the level of fix implementation effort based on the information provided in software change requests. Method: The work is based on data related to 1200 failures, extracted from the change tracking system of a large NASA mission. The analysis includes descriptive and inferential statistics. Predictions are made using three supervised machine learning algorithms and three sampling techniques aimed at addressing the imbalanced data problem. Results: Our results show that (1) 83% of the total fix implementation effort was associated with only 20% of failures. (2) Both safety critical failures and post-release failures required three times more effort to fix compared to non-critical and pre-release counterparts, respectively. (3) Failures with fixes spread across multiple components or across multiple types of software artifacts required more effort. The spread across artifacts was more costly than spread across components. (4) Surprisingly, some types of faults associated with later life-cycle activities did not require significant effort. (5) The level of fix implementation effort was predicted with 73% overall accuracy using the original, imbalanced data. Using oversampling techniques improved the overall accuracy up to 77%. More importantly, oversampling significantly improved the prediction of the high level effort, from 31% to around 85%. Conclusions: This paper shows the importance of tying software failures to changes made to fix all associated faults, in one or more software components and/or in one or more software artifacts, and the benefit of studying how the spread of faults and other factors affect the fix implementation effort.
Coordinated Fault Tolerance for High-Performance Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Dongarra, Jack; Bosilca, George; et al.

2013-04-08

Our work to meet our goal of end-to-end fault tolerance has focused on two areas: (1) improving fault tolerance in various software currently available and widely used throughout the HEC domain and (2) using fault information exchange and coordination to achieve holistic, systemwide fault tolerance and understanding how to design and implement interfaces for integrating fault tolerance features for multiple layers of the software stack—from the application, math libraries, and programming language runtime to other common system software such as jobs schedulers, resource managers, and monitoring tools.
Transient Faults in Computer Systems

NASA Technical Reports Server (NTRS)

Masson, Gerald M.

1993-01-01

A powerful technique particularly appropriate for the detection of errors caused by transient faults in computer systems was developed. The technique can be implemented in either software or hardware; the research conducted thus far primarily considered software implementations. The error detection technique developed has the distinct advantage of having provably complete coverage of all errors caused by transient faults that affect the output produced by the execution of a program. In other words, the technique does not have to be tuned to a particular error model to enhance error coverage. Also, the correctness of the technique can be formally verified. The technique uses time and software redundancy. The foundation for an effective, low-overhead, software-based certification trail approach to real-time error detection resulting from transient fault phenomena was developed.
Study of fault-tolerant software technology

NASA Technical Reports Server (NTRS)

Slivinski, T.; Broglio, C.; Wild, C.; Goldberg, J.; Levitt, K.; Hitt, E.; Webb, J.

1984-01-01

Presented is an overview of the current state of the art of fault-tolerant software and an analysis of quantitative techniques and models developed to assess its impact. It examines research efforts as well as experience gained from commercial application of these techniques. The paper also addresses the computer architecture and design implications on hardware, operating systems and programming languages (including Ada) of using fault-tolerant software in real-time aerospace applications. It concludes that fault-tolerant software has progressed beyond the pure research state. The paper also finds that, although not perfectly matched, newer architectural and language capabilities provide many of the notations and functions needed to effectively and efficiently implement software fault-tolerance.
Practical Issues in Implementing Software Reliability Measurement

NASA Technical Reports Server (NTRS)

Nikora, Allen P.; Schneidewind, Norman F.; Everett, William W.; Munson, John C.; Vouk, Mladen A.; Musa, John D.

1999-01-01

Many ways of estimating software systems' reliability, or reliability-related quantities, have been developed over the past several years. Of particular interest are methods that can be used to estimate a software system's fault content prior to test, or to discriminate between components that are fault-prone and those that are not. The results of these methods can be used to: 1) More accurately focus scarce fault identification resources on those portions of a software system most in need of it. 2) Estimate and forecast the risk of exposure to residual faults in a software system during operation, and develop risk and safety criteria to guide the release of a software system to fielded use. 3) Estimate the efficiency of test suites in detecting residual faults. 4) Estimate the stability of the software maintenance process.
Software Implemented Fault-Tolerant (SIFT) user's guide

NASA Technical Reports Server (NTRS)

Green, D. F., Jr.; Palumbo, D. L.; Baltrus, D. W.

1984-01-01

Program development for a Software Implemented Fault Tolerant (SIFT) computer system is accomplished in the NASA LaRC AIRLAB facility using a DEC VAX-11 to interface with eight Bendix BDX 930 flight control processors. The interface software which provides this SIFT program development capability was developed by AIRLAB personnel. This technical memorandum describes the application and design of this software in detail, and is intended to assist both the user in performance of SIFT research and the systems programmer responsible for maintaining and/or upgrading the SIFT programming environment.
An implementation and performance measurement of the progressive retry technique

NASA Technical Reports Server (NTRS)

Suri, Gaurav; Huang, Yennun; Wang, Yi-Min; Fuchs, W. Kent; Kintala, Chandra

1995-01-01

This paper describes a recovery technique called progressive retry for bypassing software faults in message-passing applications. The technique is implemented as reusable modules to provide application-level software fault tolerance. The paper describes the implementation of the technique and presents results from the application of progressive retry to two telecommunications systems. the results presented show that the technique is helpful in reducing the total recovery time for message-passing applications.
Study of fault tolerant software technology for dynamic systems

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Zacharias, G. L.

1985-01-01

The major aim of this study is to investigate the feasibility of using systems-based failure detection isolation and compensation (FDIC) techniques in building fault-tolerant software and extending them, whenever possible, to the domain of software fault tolerance. First, it is shown that systems-based FDIC methods can be extended to develop software error detection techniques by using system models for software modules. In particular, it is demonstrated that systems-based FDIC techniques can yield consistency checks that are easier to implement than acceptance tests based on software specifications. Next, it is shown that systems-based failure compensation techniques can be generalized to the domain of software fault tolerance in developing software error recovery procedures. Finally, the feasibility of using fault-tolerant software in flight software is investigated. In particular, possible system and version instabilities, and functional performance degradation that may occur in N-Version programming applications to flight software are illustrated. Finally, a comparative analysis of N-Version and recovery block techniques in the context of generic blocks in flight software is presented.
Flight elements: Fault detection and fault management

NASA Technical Reports Server (NTRS)

Lum, H.; Patterson-Hine, A.; Edge, J. T.; Lawler, D.

1990-01-01

Fault management for an intelligent computational system must be developed using a top down integrated engineering approach. An approach proposed includes integrating the overall environment involving sensors and their associated data; design knowledge capture; operations; fault detection, identification, and reconfiguration; testability; causal models including digraph matrix analysis; and overall performance impacts on the hardware and software architecture. Implementation of the concept to achieve a real time intelligent fault detection and management system will be accomplished via the implementation of several objectives, which are: Development of fault tolerant/FDIR requirement and specification from a systems level which will carry through from conceptual design through implementation and mission operations; Implementation of monitoring, diagnosis, and reconfiguration at all system levels providing fault isolation and system integration; Optimize system operations to manage degraded system performance through system integration; and Lower development and operations costs through the implementation of an intelligent real time fault detection and fault management system and an information management system.
User's guide to programming fault injection and data acquisition in the SIFT environment

NASA Technical Reports Server (NTRS)

Elks, Carl R.; Green, David F.; Palumbo, Daniel L.

1987-01-01

Described are the features, command language, and functional design of the SIFT (Software Implemented Fault Tolerance) fault injection and data acquisition interface software. The document is also intended to assist and guide the SIFT user in defining, developing, and executing SIFT fault injection experiments and the subsequent collection and reduction of that fault injection data. It is also intended to be used in conjunction with the SIFT User's Guide (NASA Technical Memorandum 86289) for reference to SIFT system commands, procedures and functions, and overall guidance in SIFT system programming.
Copilot: Monitoring Embedded Systems

NASA Technical Reports Server (NTRS)

Pike, Lee; Wegmann, Nis; Niller, Sebastian; Goodloe, Alwyn

2012-01-01

Runtime verification (RV) is a natural fit for ultra-critical systems, where correctness is imperative. In ultra-critical systems, even if the software is fault-free, because of the inherent unreliability of commodity hardware and the adversity of operational environments, processing units (and their hosted software) are replicated, and fault-tolerant algorithms are used to compare the outputs. We investigate both software monitoring in distributed fault-tolerant systems, as well as implementing fault-tolerance mechanisms using RV techniques. We describe the Copilot language and compiler, specifically designed for generating monitors for distributed, hard real-time systems. We also describe two case-studies in which we generated Copilot monitors in avionics systems.
An experiment in software reliability

NASA Technical Reports Server (NTRS)

Dunham, J. R.; Pierce, J. L.

1986-01-01

The results of a software reliability experiment conducted in a controlled laboratory setting are reported. The experiment was undertaken to gather data on software failures and is one in a series of experiments being pursued by the Fault Tolerant Systems Branch of NASA Langley Research Center to find a means of credibly performing reliability evaluations of flight control software. The experiment tests a small sample of implementations of radar tracking software having ultra-reliability requirements and uses n-version programming for error detection, and repetitive run modeling for failure and fault rate estimation. The experiment results agree with those of Nagel and Skrivan in that the program error rates suggest an approximate log-linear pattern and the individual faults occurred with significantly different error rates. Additional analysis of the experimental data raises new questions concerning the phenomenon of interacting faults. This phenomenon may provide one explanation for software reliability decay.
Distributed asynchronous microprocessor architectures in fault tolerant integrated flight systems

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1983-01-01

The paper discusses the implementation of fault tolerant digital flight control and navigation systems for rotorcraft application. It is shown that in implementing fault tolerance at the systems level using advanced LSI/VLSI technology, aircraft physical layout and flight systems requirements tend to define a system architecture of distributed, asynchronous microprocessors in which fault tolerance can be achieved locally through hardware redundancy and/or globally through application of analytical redundancy. The effects of asynchronism on the execution of dynamic flight software is discussed. It is shown that if the asynchronous microprocessors have knowledge of time, these errors can be significantly reduced through appropiate modifications of the flight software. Finally, the papear extends previous work to show that through the combined use of time referencing and stable flight algorithms, individual microprocessors can be configured to autonomously tolerate intermittent faults.
An empirical study of flight control software reliability

NASA Technical Reports Server (NTRS)

Dunham, J. R.; Pierce, J. L.

1986-01-01

The results of a laboratory experiment in flight control software reliability are reported. The experiment tests a small sample of implementations of a pitch axis control law for a PA28 aircraft with over 14 million pitch commands with varying levels of additive input and feedback noise. The testing which uses the method of n-version programming for error detection surfaced four software faults in one implementation of the control law. The small number of detected faults precluded the conduct of the error burst analyses. The pitch axis problem provides data for use in constructing a model in the prediction of the reliability of software in systems with feedback. The study is undertaken to find means to perform reliability evaluations of flight control software.

Measurement and analysis of operating system fault tolerance

NASA Technical Reports Server (NTRS)

Lee, I.; Tang, D.; Iyer, R. K.

1992-01-01

This paper demonstrates a methodology to model and evaluate the fault tolerance characteristics of operational software. The methodology is illustrated through case studies on three different operating systems: the Tandem GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Measurements are made on these systems for substantial periods to collect software error and recovery data. In addition to investigating basic dependability characteristics such as major software problems and error distributions, we develop two levels of models to describe error and recovery processes inside an operating system and on multiple instances of an operating system running in a distributed environment. Based on the models, reward analysis is conducted to evaluate the loss of service due to software errors and the effect of the fault-tolerance techniques implemented in the systems. Software error correlation in multicomputer systems is also investigated.
Study of a unified hardware and software fault-tolerant architecture

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan; Alger, Linda; Friend, Steven; Greeley, Gregory; Sacco, Stephen; Adams, Stuart

1989-01-01

A unified architectural concept, called the Fault Tolerant Processor Attached Processor (FTP-AP), that can tolerate hardware as well as software faults is proposed for applications requiring ultrareliable computation capability. An emulation of the FTP-AP architecture, consisting of a breadboard Motorola 68010-based quadruply redundant Fault Tolerant Processor, four VAX 750s as attached processors, and four versions of a transport aircraft yaw damper control law, is used as a testbed in the AIRLAB to examine a number of critical issues. Solutions of several basic problems associated with N-Version software are proposed and implemented on the testbed. This includes a confidence voter to resolve coincident errors in N-Version software. A reliability model of N-Version software that is based upon the recent understanding of software failure mechanisms is also developed. The basic FTP-AP architectural concept appears suitable for hosting N-Version application software while at the same time tolerating hardware failures. Architectural enhancements for greater efficiency, software reliability modeling, and N-Version issues that merit further research are identified.
Fault Management Architectures and the Challenges of Providing Software Assurance

NASA Technical Reports Server (NTRS)

Savarino, Shirley; Fitz, Rhonda; Fesq, Lorraine; Whitman, Gerek

2015-01-01

Fault Management (FM) is focused on safety, the preservation of assets, and maintaining the desired functionality of the system. How FM is implemented varies among missions. Common to most missions is system complexity due to a need to establish a multi-dimensional structure across hardware, software and spacecraft operations. FM is necessary to identify and respond to system faults, mitigate technical risks and ensure operational continuity. Generally, FM architecture, implementation, and software assurance efforts increase with mission complexity. Because FM is a systems engineering discipline with a distributed implementation, providing efficient and effective verification and validation (V&V) is challenging. A breakout session at the 2012 NASA Independent Verification & Validation (IV&V) Annual Workshop titled "V&V of Fault Management: Challenges and Successes" exposed this issue in terms of V&V for a representative set of architectures. NASA's Software Assurance Research Program (SARP) has provided funds to NASA IV&V to extend the work performed at the Workshop session in partnership with NASA's Jet Propulsion Laboratory (JPL). NASA IV&V will extract FM architectures across the IV&V portfolio and evaluate the data set, assess visibility for validation and test, and define software assurance methods that could be applied to the various architectures and designs. This SARP initiative focuses efforts on FM architectures from critical and complex projects within NASA. The identification of particular FM architectures and associated V&V/IV&V techniques provides a data set that can enable improved assurance that a system will adequately detect and respond to adverse conditions. Ultimately, results from this activity will be incorporated into the NASA Fault Management Handbook providing dissemination across NASA, other agencies and the space community. This paper discusses the approach taken to perform the evaluations and preliminary findings from the research.
Fault Management Architectures and the Challenges of Providing Software Assurance

NASA Technical Reports Server (NTRS)

Savarino, Shirley; Fitz, Rhonda; Fesq, Lorraine; Whitman, Gerek

2015-01-01

The satellite systems Fault Management (FM) is focused on safety, the preservation of assets, and maintaining the desired functionality of the system. How FM is implemented varies among missions. Common to most is system complexity due to a need to establish a multi-dimensional structure across hardware, software and operations. This structure is necessary to identify and respond to system faults, mitigate technical risks and ensure operational continuity. These architecture, implementation and software assurance efforts increase with mission complexity. Because FM is a systems engineering discipline with a distributed implementation, providing efficient and effective verification and validation (VV) is challenging. A breakout session at the 2012 NASA Independent Verification Validation (IVV) Annual Workshop titled VV of Fault Management: Challenges and Successes exposed these issues in terms of VV for a representative set of architectures. NASA's IVV is funded by NASA's Software Assurance Research Program (SARP) in partnership with NASA's Jet Propulsion Laboratory (JPL) to extend the work performed at the Workshop session. NASA IVV will extract FM architectures across the IVV portfolio and evaluate the data set for robustness, assess visibility for validation and test, and define software assurance methods that could be applied to the various architectures and designs. This work focuses efforts on FM architectures from critical and complex projects within NASA. The identification of particular FM architectures, visibility, and associated VVIVV techniques provides a data set that can enable higher assurance that a satellite system will adequately detect and respond to adverse conditions. Ultimately, results from this activity will be incorporated into the NASA Fault Management Handbook providing dissemination across NASA, other agencies and the satellite community. This paper discusses the approach taken to perform the evaluations and preliminary findings from the research including identification of FM architectures, visibility observations, and methods utilized for VVIVV.
A Voyager attitude control perspective on fault tolerant systems

NASA Technical Reports Server (NTRS)

Rasmussen, R. D.; Litty, E. C.

1981-01-01

In current spacecraft design, a trend can be observed to achieve greater fault tolerance through the application of on-board software dedicated to detecting and isolating failures. Whether fault tolerance through software can meet the desired objectives depends on very careful consideration and control of the system in which the software is imbedded. The considered investigation has the objective to provide some of the insight needed for the required analysis of the system. A description is given of the techniques which have been developed in this connection during the development of the Voyager spacecraft. The Voyager Galileo Attitude and Articulation Control Subsystem (AACS) fault tolerant design is discussed to emphasize basic lessons learned from this experience. The central driver of hardware redundancy implementation on Voyager was known as the 'single point failure criterion'.
Implementation of an experimental fault-tolerant memory system

NASA Technical Reports Server (NTRS)

Carter, W. C.; Mccarthy, C. E.

1976-01-01

The experimental fault-tolerant memory system described in this paper has been designed to enable the modular addition of spares, to validate the theoretical fault-secure and self-testing properties of the translator/corrector, to provide a basis for experiments using the new testing and correction processes for recovery, and to determine the practicality of such systems. The hardware design and implementation are described, together with methods of fault insertion. The hardware/software interface, including a restricted single error correction/double error detection (SEC/DED) code, is specified. Procedures are carefully described which, (1) test for specified physical faults, (2) ensure that single error corrections are not miscorrections due to triple faults, and (3) enable recovery from double errors.
An SSME High Pressure Oxidizer Turbopump diagnostic system using G2 real-time expert system

NASA Technical Reports Server (NTRS)

Guo, Ten-Huei

1991-01-01

An expert system which diagnoses various seal leakage faults in the High Pressure Oxidizer Turbopump of the SSME was developed using G2 real-time expert system. Three major functions of the software were implemented: model-based data generation, real-time expert system reasoning, and real-time input/output communication. This system is proposed as one module of a complete diagnostic system for the SSME. Diagnosis of a fault is defined as the determination of its type, severity, and likelihood. Since fault diagnosis is often accomplished through the use of heuristic human knowledge, an expert system based approach has been adopted as a paradigm to develop this diagnostic system. To implement this approach, a software shell which can be easily programmed to emulate the human decision process, the G2 Real-Time Expert System, was selected. Lessons learned from this implementation are discussed.
An SSME high pressure oxidizer turbopump diagnostic system using G2(TM) real-time expert system

NASA Technical Reports Server (NTRS)

Guo, Ten-Huei

1991-01-01

An expert system which diagnoses various seal leakage faults in the High Pressure Oxidizer Turbopump of the SSME was developed using G2(TM) real-time expert system. Three major functions of the software were implemented: model-based data generation, real-time expert system reasoning, and real-time input/output communication. This system is proposed as one module of a complete diagnostic system for Space Shuttle Main Engine. Diagnosis of a fault is defined as the determination of its type, severity, and likelihood. Since fault diagnosis is often accomplished through the use of heuristic human knowledge, an expert system based approach was adopted as a paradigm to develop this diagnostic system. To implement this approach, a software shell which can be easily programmed to emulate the human decision process, the G2 Real-Time Expert System, was selected. Lessons learned from this implementation are discussed.
Algorithm-Based Fault Tolerance for Numerical Subroutines

NASA Technical Reports Server (NTRS)

Tumon, Michael; Granat, Robert; Lou, John

2007-01-01

A software library implements a new methodology of detecting faults in numerical subroutines, thus enabling application programs that contain the subroutines to recover transparently from single-event upsets. The software library in question is fault-detecting middleware that is wrapped around the numericalsubroutines. Conventional serial versions (based on LAPACK and FFTW) and a parallel version (based on ScaLAPACK) exist. The source code of the application program that contains the numerical subroutines is not modified, and the middleware is transparent to the user. The methodology used is a type of algorithm- based fault tolerance (ABFT). In ABFT, a checksum is computed before a computation and compared with the checksum of the computational result; an error is declared if the difference between the checksums exceeds some threshold. Novel normalization methods are used in the checksum comparison to ensure correct fault detections independent of algorithm inputs. In tests of this software reported in the peer-reviewed literature, this library was shown to enable detection of 99.9 percent of significant faults while generating no false alarms.
Implementation of a research prototype onboard fault monitoring and diagnosis system

NASA Technical Reports Server (NTRS)

Palmer, Michael T.; Abbott, Kathy H.; Schutte, Paul C.; Ricks, Wendell R.

1987-01-01

Due to the dynamic and complex nature of in-flight fault monitoring and diagnosis, a research effort was undertaken at NASA Langley Research Center to investigate the application of artificial intelligence techniques for improved situational awareness. Under this research effort, concepts were developed and a software architecture was designed to address the complexities of onboard monitoring and diagnosis. This paper describes the implementation of these concepts in a computer program called FaultFinder. The implementation of the monitoring, diagnosis, and interface functions as separate modules is discussed, as well as the blackboard designed for the communication of these modules. Some related issues concerning the future installation of FaultFinder in an aircraft are also discussed.
Galileo spacecraft power distribution and autonomous fault recovery

NASA Technical Reports Server (NTRS)

Detwiler, R. C.

1982-01-01

There is a trend in current spacecraft design to achieve greater fault tolerance through the implemenation of on-board software dedicated to detecting and isolating failures. A combination of hardware and software is utilized in the Galileo power system for autonomous fault recovery. Galileo is a dual-spun spacecraft designed to carry a number of scientific instruments into a series of orbits around the planet Jupiter. In addition to its self-contained scientific payload, it will also carry a probe system which will be separated from the spacecraft some 150 days prior to Jupiter encounter. The Galileo spacecraft is scheduled to be launched in 1985. Attention is given to the power system, the fault protection requirements, and the power fault recovery implementation.
Expert systems applied to fault isolation and energy storage management, phase 2

NASA Technical Reports Server (NTRS)

1987-01-01

A user's guide for the Fault Isolation and Energy Storage (FIES) II system is provided. Included are a brief discussion of the background and scope of this project, a discussion of basic and advanced operating installation and problem determination procedures for the FIES II system and information on hardware and software design and implementation. A number of appendices are provided including a detailed specification for the microprocessor software, a detailed description of the expert system rule base and a description and listings of the LISP interface software.
Software fault tolerance using data diversity

NASA Technical Reports Server (NTRS)

Knight, John C.

1991-01-01

Research on data diversity is discussed. Data diversity relies on a different form of redundancy from existing approaches to software fault tolerance and is substantially less expensive to implement. Data diversity can also be applied to software testing and greatly facilitates the automation of testing. Up to now it has been explored both theoretically and in a pilot study, and has been shown to be a promising technique. The effectiveness of data diversity as an error detection mechanism and the application of data diversity to differential equation solvers are discussed.
Advanced information processing system: Fault injection study and results

NASA Technical Reports Server (NTRS)

Burkhardt, Laura F.; Masotto, Thomas K.; Lala, Jaynarayan H.

1992-01-01

The objective of the AIPS program is to achieve a validated fault tolerant distributed computer system. The goals of the AIPS fault injection study were: (1) to present the fault injection study components addressing the AIPS validation objective; (2) to obtain feedback for fault removal from the design implementation; (3) to obtain statistical data regarding fault detection, isolation, and reconfiguration responses; and (4) to obtain data regarding the effects of faults on system performance. The parameters are described that must be varied to create a comprehensive set of fault injection tests, the subset of test cases selected, the test case measurements, and the test case execution. Both pin level hardware faults using a hardware fault injector and software injected memory mutations were used to test the system. An overview is provided of the hardware fault injector and the associated software used to carry out the experiments. Detailed specifications are given of fault and test results for the I/O Network and the AIPS Fault Tolerant Processor, respectively. The results are summarized and conclusions are given.
Automated Fault Interpretation and Extraction using Improved Supplementary Seismic Datasets

NASA Astrophysics Data System (ADS)

Bollmann, T. A.; Shank, R.

2017-12-01

During the interpretation of seismic volumes, it is necessary to interpret faults along with horizons of interest. With the improvement of technology, the interpretation of faults can be expedited with the aid of different algorithms that create supplementary seismic attributes, such as semblance and coherency. These products highlight discontinuities, but still need a large amount of human interaction to interpret faults and are plagued by noise and stratigraphic discontinuities. Hale (2013) presents a method to improve on these datasets by creating what is referred to as a Fault Likelihood volume. In general, these volumes contain less noise and do not emphasize stratigraphic features. Instead, planar features within a specified strike and dip range are highlighted. Once a satisfactory Fault Likelihood Volume is created, extraction of fault surfaces is much easier. The extracted fault surfaces are then exported to interpretation software for QC. Numerous software packages have implemented this methodology with varying results. After investigating these platforms, we developed a preferred Automated Fault Interpretation workflow.
Predeployment validation of fault-tolerant systems through software-implemented fault insertion

NASA Technical Reports Server (NTRS)

Czeck, Edward W.; Siewiorek, Daniel P.; Segall, Zary Z.

1989-01-01

Fault injection-based automated testing (FIAT) environment, which can be used to experimentally characterize and evaluate distributed realtime systems under fault-free and faulted conditions is described. A survey is presented of validation methodologies. The need for fault insertion based on validation methodologies is demonstrated. The origins and models of faults, and motivation for the FIAT concept are reviewed. FIAT employs a validation methodology which builds confidence in the system through first providing a baseline of fault-free performance data and then characterizing the behavior of the system with faults present. Fault insertion is accomplished through software and allows faults or the manifestation of faults to be inserted by either seeding faults into memory or triggering error detection mechanisms. FIAT is capable of emulating a variety of fault-tolerant strategies and architectures, can monitor system activity, and can automatically orchestrate experiments involving insertion of faults. There is a common system interface which allows ease of use to decrease experiment development and run time. Fault models chosen for experiments on FIAT have generated system responses which parallel those observed in real systems under faulty conditions. These capabilities are shown by two example experiments each using a different fault-tolerance strategy.
The implementation and use of Ada on distributed systems with high reliability requirements

NASA Technical Reports Server (NTRS)

Knight, J. C.

1984-01-01

The use and implementation of Ada in distributed environments in which reliability is the primary concern is investigated. Emphasis is placed on the possibility that a distributed system may be programmed entirely in ADA so that the individual tasks of the system are unconcerned with which processors they are executing on, and that failures may occur in the software or underlying hardware. The primary activities are: (1) Continued development and testing of our fault-tolerant Ada testbed; (2) consideration of desirable language changes to allow Ada to provide useful semantics for failure; (3) analysis of the inadequacies of existing software fault tolerance strategies.
Lessons Learned on Implementing Fault Detection, Isolation, and Recovery (FDIR) in a Ground Launch Environment

NASA Technical Reports Server (NTRS)

Ferrell, Bob A.; Lewis, Mark E.; Perotti, Jose M.; Brown, Barbara L.; Oostdyk, Rebecca L.; Goetz, Jesse W.

2010-01-01

This paper's main purpose is to detail issues and lessons learned regarding designing, integrating, and implementing Fault Detection Isolation and Recovery (FDIR) for Constellation Exploration Program (CxP) Ground Operations at Kennedy Space Center (KSC). Part of the0 overall implementation of National Aeronautics and Space Administration's (NASA's) CxP, FDIR is being implemented in three main components of the program (Ares, Orion, and Ground Operations/Processing). While not initially part of the design baseline for the CxP Ground Operations, NASA felt that FDIR is important enough to develop, that NASA's Exploration Systems Mission Directorate's (ESMD's) Exploration Technology Development Program (ETDP) initiated a task for it under their Integrated System Health Management (ISHM) research area. This task, referred to as the FDIIR project, is a multi-year multi-center effort. The primary purpose of the FDIR project is to develop a prototype and pathway upon which Fault Detection and Isolation (FDI) may be transitioned into the Ground Operations baseline. Currently, Qualtech Systems Inc (QSI) Commercial Off The Shelf (COTS) software products Testability Engineering and Maintenance System (TEAMS) Designer and TEAMS RDS/RT are being utilized in the implementation of FDI within the FDIR project. The TEAMS Designer COTS software product is being utilized to model the system with Functional Fault Models (FFMs). A limited set of systems in Ground Operations are being modeled by the FDIR project, and the entire Ares Launch Vehicle is being modeled under the Functional Fault Analysis (FFA) project at Marshall Space Flight Center (MSFC). Integration of the Ares FFMs and the Ground Processing FFMs is being done under the FDIR project also utilizing the TEAMS Designer COTS software product. One of the most significant challenges related to integration is to ensure that FFMs developed by different organizations can be integrated easily and without errors. Software Interface Control Documents (ICDs) for the FFMs and their usage will be addressed as the solution to this issue. In particular, the advantages and disadvantages of these ICDs across physically separate development groups will be delineated.
Fault-Tree Compiler Program

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Martensen, Anna L.

1992-01-01

FTC, Fault-Tree Compiler program, is reliability-analysis software tool used to calculate probability of top event of fault tree. Five different types of gates allowed in fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N. High-level input language of FTC easy to understand and use. Program supports hierarchical fault-tree-definition feature simplifying process of description of tree and reduces execution time. Solution technique implemented in FORTRAN, and user interface in Pascal. Written to run on DEC VAX computer operating under VMS operating system.
A survey of fault diagnosis technology

NASA Technical Reports Server (NTRS)

Riedesel, Joel

1989-01-01

Existing techniques and methodologies for fault diagnosis are surveyed. The techniques run the gamut from theoretical artificial intelligence work to conventional software engineering applications. They are shown to define a spectrum of implementation alternatives where tradeoffs determine their position on the spectrum. Various tradeoffs include execution time limitations and memory requirements of the algorithms as well as their effectiveness in addressing the fault diagnosis problem.

A fault-tolerant intelligent robotic control system

NASA Technical Reports Server (NTRS)

Marzwell, Neville I.; Tso, Kam Sing

1993-01-01

This paper describes the concept, design, and features of a fault-tolerant intelligent robotic control system being developed for space and commercial applications that require high dependability. The comprehensive strategy integrates system level hardware/software fault tolerance with task level handling of uncertainties and unexpected events for robotic control. The underlying architecture for system level fault tolerance is the distributed recovery block which protects against application software, system software, hardware, and network failures. Task level fault tolerance provisions are implemented in a knowledge-based system which utilizes advanced automation techniques such as rule-based and model-based reasoning to monitor, diagnose, and recover from unexpected events. The two level design provides tolerance of two or more faults occurring serially at any level of command, control, sensing, or actuation. The potential benefits of such a fault tolerant robotic control system include: (1) a minimized potential for damage to humans, the work site, and the robot itself; (2) continuous operation with a minimum of uncommanded motion in the presence of failures; and (3) more reliable autonomous operation providing increased efficiency in the execution of robotic tasks and decreased demand on human operators for controlling and monitoring the robotic servicing routines.
Virtual Platform for See Robustness Verification of Bootloader Embedded Software on Board Solar Orbiter's Energetic Particle Detector

NASA Astrophysics Data System (ADS)

Da Silva, A.; Sánchez Prieto, S.; Polo, O.; Parra Espada, P.

2013-05-01

Because of the tough robustness requirements in space software development, it is imperative to carry out verification tasks at a very early development stage to ensure that the implemented exception mechanisms work properly. All this should be done long time before the real hardware is available. But even if real hardware is available the verification of software fault tolerance mechanisms can be difficult since real faulty situations must be systematically and artificially brought about which can be imposible on real hardware. To solve this problem the Alcala Space Research Group (SRG) has developed a LEON2 virtual platform (Leon2ViP) with fault injection capabilities. This way it is posible to run the exact same target binary software as runs on the physical system in a more controlled and deterministic environment, allowing a more strict requirements verification. Leon2ViP enables unmanned and tightly focused fault injection campaigns, not possible otherwise, in order to expose and diagnose flaws in the software implementation early. Furthermore, the use of a virtual hardware-in-the-loop approach makes it possible to carry out preliminary integration tests with the spacecraft emulator or the sensors. The use of Leon2ViP has meant a signicant improvement, in both time and cost, in the development and verification processes of the Instrument Control Unit boot software on board Solar Orbiter's Energetic Particle Detector.
Autonomous Cryogenics Loading Operations Simulation Software: Knowledgebase Autonomous Test Engineer

NASA Technical Reports Server (NTRS)

Wehner, Walter S.

2012-01-01

The Simulation Software, KATE (Knowledgebase Autonomous Test Engineer), is used to demonstrate the automatic identification of faults in a system. The ACLO (Autonomous Cryogenics Loading Operation) project uses KATE to monitor and find faults in the loading of the cryogenics int o a vehicle fuel tank. The KATE software interfaces with the IHM (Integrated Health Management) systems bus to communicate with other systems that are part of ACLO. One system that KATE uses the IHM bus to communicate with is AIS (Advanced Inspection System). KATE will send messages to AIS when there is a detected anomaly. These messages include visual inspection of specific valves, pressure gauges and control messages to have AIS open or close manual valves. My goals include implementing the connection to the IHM bus within KATE and for the AIS project. I will also be working on implementing changes to KATE's Ul and implementing the physics objects in KATE that will model portions of the cryogenics loading operation.
Concurrent development of fault management hardware and software in the SSM/PMAD. [Space Station Module/Power Management And Distribution

NASA Technical Reports Server (NTRS)

Freeman, Kenneth A.; Walsh, Rick; Weeks, David J.

1988-01-01

Space Station issues in fault management are discussed. The system background is described with attention given to design guidelines and power hardware. A contractually developed fault management system, FRAMES, is integrated with the energy management functions, the control switchgear, and the scheduling and operations management functions. The constraints that shaped the FRAMES system and its implementation are considered.
Graphical workstation capability for reliability modeling

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Koppen, Sandra V.; Haley, Pamela J.

1992-01-01

In addition to computational capabilities, software tools for estimating the reliability of fault-tolerant digital computer systems must also provide a means of interfacing with the user. Described here is the new graphical interface capability of the hybrid automated reliability predictor (HARP), a software package that implements advanced reliability modeling techniques. The graphics oriented (GO) module provides the user with a graphical language for modeling system failure modes through the selection of various fault-tree gates, including sequence-dependency gates, or by a Markov chain. By using this graphical input language, a fault tree becomes a convenient notation for describing a system. In accounting for any sequence dependencies, HARP converts the fault-tree notation to a complex stochastic process that is reduced to a Markov chain, which it can then solve for system reliability. The graphics capability is available for use on an IBM-compatible PC, a Sun, and a VAX workstation. The GO module is written in the C programming language and uses the graphical kernal system (GKS) standard for graphics implementation. The PC, VAX, and Sun versions of the HARP GO module are currently in beta-testing stages.
Formal specification and mechanical verification of SIFT - A fault-tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1982-01-01

The paper describes the methodology being employed to demonstrate rigorously that the SIFT (software-implemented fault-tolerant) computer meets its requirements. The methodology uses a hierarchy of design specifications, expressed in the mathematical domain of multisorted first-order predicate calculus. The most abstract of these, from which almost all details of mechanization have been removed, represents the requirements on the system for reliability and intended functionality. Successive specifications in the hierarchy add design and implementation detail until the PASCAL programs implementing the SIFT executive are reached. A formal proof that a SIFT system in a 'safe' state operates correctly despite the presence of arbitrary faults has been completed all the way from the most abstract specifications to the PASCAL program.
The implementation and use of Ada on distributed systems with high reliability requirements

NASA Technical Reports Server (NTRS)

Knight, J. C.; Gregory, S. T.; Urquhart, J. I. A.

1985-01-01

The use and implementation of Ada in distributed environments in which reliability is the primary concern were investigated. In particular, the concept that a distributed system may be programmed entirely in Ada so that the individual tasks of the system are unconcerned with which processors they are executing on, and that failures may occur in the software or underlying hardware was examined. Progress is discussed for the following areas: continued development and testing of the fault-tolerant Ada testbed; development of suggested changes to Ada so that it might more easily cope with the failure of interest; and design of new approaches to fault-tolerant software in real-time systems, and integration of these ideas into Ada.
Method and system for diagnostics of apparatus

NASA Technical Reports Server (NTRS)

Gorinevsky, Dimitry (Inventor)

2012-01-01

Proposed is a method, implemented in software, for estimating fault state of an apparatus outfitted with sensors. At each execution period the method processes sensor data from the apparatus to obtain a set of parity parameters, which are further used for estimating fault state. The estimation method formulates a convex optimization problem for each fault hypothesis and employs a convex solver to compute fault parameter estimates and fault likelihoods for each fault hypothesis. The highest likelihoods and corresponding parameter estimates are transmitted to a display device or an automated decision and control system. The obtained accurate estimate of fault state can be used to improve safety, performance, or maintenance processes for the apparatus.
Methodology for Designing Fault-Protection Software

NASA Technical Reports Server (NTRS)

Barltrop, Kevin; Levison, Jeffrey; Kan, Edwin

2006-01-01

A document describes a methodology for designing fault-protection (FP) software for autonomous spacecraft. The methodology embodies and extends established engineering practices in the technical discipline of Fault Detection, Diagnosis, Mitigation, and Recovery; and has been successfully implemented in the Deep Impact Spacecraft, a NASA Discovery mission. Based on established concepts of Fault Monitors and Responses, this FP methodology extends the notion of Opinion, Symptom, Alarm (aka Fault), and Response with numerous new notions, sub-notions, software constructs, and logic and timing gates. For example, Monitor generates a RawOpinion, which graduates into Opinion, categorized into no-opinion, acceptable, or unacceptable opinion. RaiseSymptom, ForceSymptom, and ClearSymptom govern the establishment and then mapping to an Alarm (aka Fault). Local Response is distinguished from FP System Response. A 1-to-n and n-to- 1 mapping is established among Monitors, Symptoms, and Responses. Responses are categorized by device versus by function. Responses operate in tiers, where the early tiers attempt to resolve the Fault in a localized step-by-step fashion, relegating more system-level response to later tier(s). Recovery actions are gated by epoch recovery timing, enabling strategy, urgency, MaxRetry gate, hardware availability, hazardous versus ordinary fault, and many other priority gates. This methodology is systematic, logical, and uses multiple linked tables, parameter files, and recovery command sequences. The credibility of the FP design is proven via a fault-tree analysis "top-down" approach, and a functional fault-mode-effects-and-analysis via "bottoms-up" approach. Via this process, the mitigation and recovery strategy(s) per Fault Containment Region scope (width versus depth) the FP architecture.
Software For Fault-Tree Diagnosis Of A System

NASA Technical Reports Server (NTRS)

Iverson, Dave; Patterson-Hine, Ann; Liao, Jack

1993-01-01

Fault Tree Diagnosis System (FTDS) computer program is automated-diagnostic-system program identifying likely causes of specified failure on basis of information represented in system-reliability mathematical models known as fault trees. Is modified implementation of failure-cause-identification phase of Narayanan's and Viswanadham's methodology for acquisition of knowledge and reasoning in analyzing failures of systems. Knowledge base of if/then rules replaced with object-oriented fault-tree representation. Enhancement yields more-efficient identification of causes of failures and enables dynamic updating of knowledge base. Written in C language, C++, and Common LISP.
Analysis of a hardware and software fault tolerant processor for critical applications

NASA Technical Reports Server (NTRS)

Dugan, Joanne B.

1993-01-01

Computer systems for critical applications must be designed to tolerate software faults as well as hardware faults. A unified approach to tolerating hardware and software faults is characterized by classifying faults in terms of duration (transient or permanent) rather than source (hardware or software). Errors arising from transient faults can be handled through masking or voting, but errors arising from permanent faults require system reconfiguration to bypass the failed component. Most errors which are caused by software faults can be considered transient, in that they are input-dependent. Software faults are triggered by a particular set of inputs. Quantitative dependability analysis of systems which exhibit a unified approach to fault tolerance can be performed by a hierarchical combination of fault tree and Markov models. A methodology for analyzing hardware and software fault tolerant systems is applied to the analysis of a hypothetical system, loosely based on the Fault Tolerant Parallel Processor. The models consider both transient and permanent faults, hardware and software faults, independent and related software faults, automatic recovery, and reconfiguration.
Protecting Against Faults in JPL Spacecraft

NASA Technical Reports Server (NTRS)

Morgan, Paula

2007-01-01

A paper discusses techniques for protecting against faults in spacecraft designed and operated by NASA s Jet Propulsion Laboratory (JPL). The paper addresses, more specifically, fault-protection requirements and techniques common to most JPL spacecraft (in contradistinction to unique, mission specific techniques), standard practices in the implementation of these techniques, and fault-protection software architectures. Common requirements include those to protect onboard command, data-processing, and control computers; protect against loss of Earth/spacecraft radio communication; maintain safe temperatures; and recover from power overloads. The paper describes fault-protection techniques as part of a fault-management strategy that also includes functional redundancy, redundant hardware, and autonomous monitoring of (1) the operational and health statuses of spacecraft components, (2) temperatures inside and outside the spacecraft, and (3) allocation of power. The strategy also provides for preprogrammed automated responses to anomalous conditions. In addition, the software running in almost every JPL spacecraft incorporates a general-purpose "Safe Mode" response algorithm that configures the spacecraft in a lower-power state that is safe and predictable, thereby facilitating diagnosis of more complex faults by a team of human experts on Earth.
The embedded software life cycle - An expanded view

NASA Technical Reports Server (NTRS)

Larman, Brian T.; Loesh, Robert E.

1989-01-01

Six common issues that are encountered in the development of software for embedded computer systems are discussed from the perspective of their interrelationships with the development process and/or the system itself. Particular attention is given to concurrent hardware/software development, prototyping, the inaccessibility of the operational system, fault tolerance, the long life cycle, and inheritance. It is noted that the life cycle for embedded software must include elements beyond simply the specification and implementation of the target software.
An autonomous fault detection, isolation, and recovery system for a 20-kHz electric power distribution test bed

NASA Technical Reports Server (NTRS)

Quinn, Todd M.; Walters, Jerry L.

1991-01-01

Future space explorations will require long term human presence in space. Space environments that provide working and living quarters for manned missions are becoming increasingly larger and more sophisticated. Monitor and control of the space environment subsystems by expert system software, which emulate human reasoning processes, could maintain the health of the subsystems and help reduce the human workload. The autonomous power expert (APEX) system was developed to emulate a human expert's reasoning processes used to diagnose fault conditions in the domain of space power distribution. APEX is a fault detection, isolation, and recovery (FDIR) system, capable of autonomous monitoring and control of the power distribution system. APEX consists of a knowledge base, a data base, an inference engine, and various support and interface software. APEX provides the user with an easy-to-use interactive interface. When a fault is detected, APEX will inform the user of the detection. The user can direct APEX to isolate the probable cause of the fault. Once a fault has been isolated, the user can ask APEX to justify its fault isolation and to recommend actions to correct the fault. APEX implementation and capabilities are discussed.
Fault Injection Validation of a Safety-Critical TMR Sysem

NASA Astrophysics Data System (ADS)

Irrera, Ivano; Madeira, Henrique; Zentai, Andras; Hergovics, Beata

2016-08-01

Digital systems and their software are the core technology for controlling and monitoring industrial systems in practically all activity domains. Functional safety standards such as the European standard EN 50128 for railway applications define the procedures and technical requirements for the development of software for railway control and protection systems. The validation of such systems is a highly demanding task. In this paper we discuss the use of fault injection techniques, which have been used extensively in several domains, particularly in the space domain, to complement the traditional procedures to validate a SIL (Safety Integrity Level) 4 system for railway signalling, implementing a TMR (Triple Modular Redundancy) architecture. The fault injection tool is based on JTAG technology. The results of our injection campaign showed a high degree of tolerance to most of the injected faults, but several cases of unexpected behaviour have also been observed, helping understanding worst-case scenarios.
Mission Services Evolution Center Message Bus

NASA Technical Reports Server (NTRS)

Mayorga, Arturo; Bristow, John O.; Butschky, Mike

2011-01-01

The Goddard Mission Services Evolution Center (GMSEC) Message Bus is a robust, lightweight, fault-tolerant middleware implementation that supports all messaging capabilities of the GMSEC API. This architecture is a distributed software system that routes messages based on message subject names and knowledge of the locations in the network of the interested software components.
Fault Management Techniques in Human Spaceflight Operations

NASA Technical Reports Server (NTRS)

O'Hagan, Brian; Crocker, Alan

2006-01-01

This paper discusses human spaceflight fault management operations. Fault detection and response capabilities available in current US human spaceflight programs Space Shuttle and International Space Station are described while emphasizing system design impacts on operational techniques and constraints. Preflight and inflight processes along with products used to anticipate, mitigate and respond to failures are introduced. Examples of operational products used to support failure responses are presented. Possible improvements in the state of the art, as well as prioritization and success criteria for their implementation are proposed. This paper describes how the architecture of a command and control system impacts operations in areas such as the required fault response times, automated vs. manual fault responses, use of workarounds, etc. The architecture includes the use of redundancy at the system and software function level, software capabilities, use of intelligent or autonomous systems, number and severity of software defects, etc. This in turn drives which Caution and Warning (C&W) events should be annunciated, C&W event classification, operator display designs, crew training, flight control team training, and procedure development. Other factors impacting operations are the complexity of a system, skills needed to understand and operate a system, and the use of commonality vs. optimized solutions for software and responses. Fault detection, annunciation, safing responses, and recovery capabilities are explored using real examples to uncover underlying philosophies and constraints. These factors directly impact operations in that the crew and flight control team need to understand what happened, why it happened, what the system is doing, and what, if any, corrective actions they need to perform. If a fault results in multiple C&W events, or if several faults occur simultaneously, the root cause(s) of the fault(s), as well as their vehicle-wide impacts, must be determined in order to maintain situational awareness. This allows both automated and manual recovery operations to focus on the real cause of the fault(s). An appropriate balance must be struck between correcting the root cause failure and addressing the impacts of that fault on other vehicle components. Lastly, this paper presents a strategy for using lessons learned to improve the software, displays, and procedures in addition to determining what is a candidate for automation. Enabling technologies and techniques are identified to promote system evolution from one that requires manual fault responses to one that uses automation and autonomy where they are most effective. These considerations include the value in correcting software defects in a timely manner, automation of repetitive tasks, making time critical responses autonomous, etc. The paper recommends the appropriate use of intelligent systems to determine the root causes of faults and correctly identify separate unrelated faults.
Online Monitoring Technical Basis and Analysis Framework for Large Power Transformers; Interim Report for FY 2012

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nancy J. Lybeck; Vivek Agarwal; Binh T. Pham

The Light Water Reactor Sustainability program at Idaho National Laboratory (INL) is actively conducting research to develop and demonstrate online monitoring (OLM) capabilities for active components in existing Nuclear Power Plants. A pilot project is currently underway to apply OLM to Generator Step-Up Transformers (GSUs) and Emergency Diesel Generators (EDGs). INL and the Electric Power Research Institute (EPRI) are working jointly to implement the pilot project. The EPRI Fleet-Wide Prognostic and Health Management (FW-PHM) Software Suite will be used to implement monitoring in conjunction with utility partners: the Shearon Harris Nuclear Generating Station (owned by Duke Energy for GSUs, andmore » Braidwood Generating Station (owned by Exelon Corporation) for EDGs. This report presents monitoring techniques, fault signatures, and diagnostic and prognostic models for GSUs. GSUs are main transformers that are directly connected to generators, stepping up the voltage from the generator output voltage to the highest transmission voltages for supplying electricity to the transmission grid. Technical experts from Shearon Harris are assisting INL and EPRI in identifying critical faults and defining fault signatures associated with each fault. The resulting diagnostic models will be implemented in the FW-PHM Software Suite and tested using data from Shearon-Harris. Parallel research on EDGs is being conducted, and will be reported in an interim report during the first quarter of fiscal year 2013.« less
SIRU utilization. Volume 2: Software description and program documentation

NASA Technical Reports Server (NTRS)

Oehrle, J.; Whittredge, R.

1973-01-01

A complete description of the additional analysis, development and evaluation provided for the SIRU system as identified in the requirements for the SIRU utilization program is presented. The SIRU configuration is a modular inertial subsystem with hardware and software features that achieve fault tolerant operational capabilities. The SIRU redundant hardware design is formulated about a six gyro and six accelerometer instrument module package. The modules are mounted in this package so that their measurement input axes form a unique symmetrical pattern that corresponds to the array of perpendiculars to the faces of a regular dodecahedron. This six axes array provides redundant independent sensing and the symmetry enables the formulation of an optimal software redundant data processing structure with self-contained fault detection and isolation (FDI) capabilities. Documentation of the additional software and software modifications required to implement the utilization capabilities includes assembly listings and flow charts
Learning from examples - Generation and evaluation of decision trees for software resource analysis

NASA Technical Reports Server (NTRS)

Selby, Richard W.; Porter, Adam A.

1988-01-01

A general solution method for the automatic generation of decision (or classification) trees is investigated. The approach is to provide insights through in-depth empirical characterization and evaluation of decision trees for software resource data analysis. The trees identify classes of objects (software modules) that had high development effort. Sixteen software systems ranging from 3,000 to 112,000 source lines were selected for analysis from a NASA production environment. The collection and analysis of 74 attributes (or metrics), for over 4,700 objects, captured information about the development effort, faults, changes, design style, and implementation style. A total of 9,600 decision trees were automatically generated and evaluated. The trees correctly identified 79.3 percent of the software modules that had high development effort or faults, and the trees generated from the best parameter combinations correctly identified 88.4 percent of the modules on the average.

Maintaining the Health of Software Monitors

NASA Technical Reports Server (NTRS)

Person, Suzette; Rungta, Neha

2013-01-01

Software health management (SWHM) techniques complement the rigorous verification and validation processes that are applied to safety-critical systems prior to their deployment. These techniques are used to monitor deployed software in its execution environment, serving as the last line of defense against the effects of a critical fault. SWHM monitors use information from the specification and implementation of the monitored software to detect violations, predict possible failures, and help the system recover from faults. Changes to the monitored software, such as adding new functionality or fixing defects, therefore, have the potential to impact the correctness of both the monitored software and the SWHM monitor. In this work, we describe how the results of a software change impact analysis technique, Directed Incremental Symbolic Execution (DiSE), can be applied to monitored software to identify the potential impact of the changes on the SWHM monitor software. The results of DiSE can then be used by other analysis techniques, e.g., testing, debugging, to help preserve and improve the integrity of the SWHM monitor as the monitored software evolves.
Designing application software in wide area network settings

NASA Technical Reports Server (NTRS)

Makpangou, Mesaac; Birman, Ken

1990-01-01

Progress in methodologies for developing robust local area network software has not been matched by similar results for wide area settings. The design of application software spanning multiple local area environments is examined. For important classes of applications, simple design techniques are presented that yield fault tolerant wide area programs. An implementation of these techniques as a set of tools for use within the ISIS system is described.
Factors That Affect Software Testability

NASA Technical Reports Server (NTRS)

Voas, Jeffrey M.

1991-01-01

Software faults that infrequently affect software's output are dangerous. When a software fault causes frequent software failures, testing is likely to reveal the fault before the software is releases; when the fault remains undetected during testing, it can cause disaster after the software is installed. A technique for predicting whether a particular piece of software is likely to reveal faults within itself during testing is found in [Voas91b]. A piece of software that is likely to reveal faults within itself during testing is said to have high testability. A piece of software that is not likely to reveal faults within itself during testing is said to have low testability. It is preferable to design software with higher testabilities from the outset, i.e., create software with as high of a degree of testability as possible to avoid the problems of having undetected faults that are associated with low testability. Information loss is a phenomenon that occurs during program execution that increases the likelihood that a fault will remain undetected. In this paper, I identify two brad classes of information loss, define them, and suggest ways of predicting the potential for information loss to occur. We do this in order to decrease the likelihood that faults will remain undetected during testing.
An experiment in software reliability: Additional analyses using data from automated replications

NASA Technical Reports Server (NTRS)

Dunham, Janet R.; Lauterbach, Linda A.

1988-01-01

A study undertaken to collect software error data of laboratory quality for use in the development of credible methods for predicting the reliability of software used in life-critical applications is summarized. The software error data reported were acquired through automated repetitive run testing of three independent implementations of a launch interceptor condition module of a radar tracking problem. The results are based on 100 test applications to accumulate a sufficient sample size for error rate estimation. The data collected is used to confirm the results of two Boeing studies reported in NASA-CR-165836 Software Reliability: Repetitive Run Experimentation and Modeling, and NASA-CR-172378 Software Reliability: Additional Investigations into Modeling With Replicated Experiments, respectively. That is, the results confirm the log-linear pattern of software error rates and reject the hypothesis of equal error rates per individual fault. This rejection casts doubt on the assumption that the program's failure rate is a constant multiple of the number of residual bugs; an assumption which underlies some of the current models of software reliability. data raises new questions concerning the phenomenon of interacting faults.
Advanced microprocessor based power protection system using artificial neural network techniques

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chen, Z.; Kalam, A.; Zayegh, A.

This paper describes an intelligent embedded microprocessor based system for fault classification in power system protection system using advanced 32-bit microprocessor technology. The paper demonstrates the development of protective relay to provide overcurrent protection schemes for fault detection. It also describes a method for power fault classification in three-phase system based on the use of neural network technology. The proposed design is implemented and tested on a single line three phase power system in power laboratory. Both the hardware and software development are described in detail.
Model-Based Fault Diagnosis: Performing Root Cause and Impact Analyses in Real Time

NASA Technical Reports Server (NTRS)

Figueroa, Jorge F.; Walker, Mark G.; Kapadia, Ravi; Morris, Jonathan

2012-01-01

Generic, object-oriented fault models, built according to causal-directed graph theory, have been integrated into an overall software architecture dedicated to monitoring and predicting the health of mission- critical systems. Processing over the generic fault models is triggered by event detection logic that is defined according to the specific functional requirements of the system and its components. Once triggered, the fault models provide an automated way for performing both upstream root cause analysis (RCA), and for predicting downstream effects or impact analysis. The methodology has been applied to integrated system health management (ISHM) implementations at NASA SSC's Rocket Engine Test Stands (RETS).
Online Monitoring Technical Basis and Analysis Framework for Emergency Diesel Generators - Interim Report for FY 2013

DOE Office of Scientific and Technical Information (OSTI.GOV)

Binh T. Pham; Nancy J. Lybeck; Vivek Agarwal

The Light Water Reactor Sustainability program at Idaho National Laboratory is actively conducting research to develop and demonstrate online monitoring capabilities for active components in existing nuclear power plants. Idaho National Laboratory and the Electric Power Research Institute are working jointly to implement a pilot project to apply these capabilities to emergency diesel generators and generator step-up transformers. The Electric Power Research Institute Fleet-Wide Prognostic and Health Management Software Suite will be used to implement monitoring in conjunction with utility partners: Braidwood Generating Station (owned by Exelon Corporation) for emergency diesel generators, and Shearon Harris Nuclear Generating Station (owned bymore » Duke Energy Progress) for generator step-up transformers. This report presents monitoring techniques, fault signatures, and diagnostic and prognostic models for emergency diesel generators. Emergency diesel generators provide backup power to the nuclear power plant, allowing operation of essential equipment such as pumps in the emergency core coolant system during catastrophic events, including loss of offsite power. Technical experts from Braidwood are assisting Idaho National Laboratory and Electric Power Research Institute in identifying critical faults and defining fault signatures associated with each fault. The resulting diagnostic models will be implemented in the Fleet-Wide Prognostic and Health Management Software Suite and tested using data from Braidwood. Parallel research on generator step-up transformers was summarized in an interim report during the fourth quarter of fiscal year 2012.« less
Model prototype utilization in the analysis of fault tolerant control and data processing systems

NASA Astrophysics Data System (ADS)

Kovalev, I. V.; Tsarev, R. Yu; Gruzenkin, D. V.; Prokopenko, A. V.; Knyazkov, A. N.; Laptenok, V. D.

2016-04-01

The procedure assessing the profit of control and data processing system implementation is presented in the paper. The reasonability of model prototype creation and analysis results from the implementing of the approach of fault tolerance provision through the inclusion of structural and software assessment redundancy. The developed procedure allows finding the best ratio between the development cost and the analysis of model prototype and earnings from the results of this utilization and information produced. The suggested approach has been illustrated by the model example of profit assessment and analysis of control and data processing system.
The implementation and use of Ada on distributed systems with high reliability requirements

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

The use and implementation of Ada in distributed environments in which reliability is the primary concern were investigted. A distributed system, programmed entirely in Ada, was studied to assess the use of individual tasks without concern for the processor used. Continued development and testing of the fault tolerant Ada testbed; development of suggested changes to Ada to cope with the failures of interest; design of approaches to fault tolerant software in real time systems, and the integration of these ideas into Ada; and the preparation of various papers and presentations were discussed.
Multicore Considerations for Legacy Flight Software Migration

NASA Technical Reports Server (NTRS)

Vines, Kenneth; Day, Len

2013-01-01

In this paper we will discuss potential benefits and pitfalls when considering a migration from an existing single core code base to a multicore processor implementation. The results of this study present options that should be considered before migrating fault managers, device handlers and tasks with time-constrained requirements to a multicore flight software environment. Possible future multicore test bed demonstrations are also discussed.
A Fuzzy Expert System for Fault Management of Water Supply Recovery in the ALSS Project

NASA Technical Reports Server (NTRS)

Tohala, Vapsi J.

1998-01-01

Modeling with a new software is a challenge. CONFIG is a challenge and is design to work with many types of systems in which discrete and continuous processes occur. The CONFIG software was used to model the two subsystem of the Water Recovery system: ICB and TFB. The model worked manually only for water flows with further implementation to be done in the future. Activities in the models are stiff need to be implemented based on testing of the hardware for phase III. More improvements to CONFIG are in progress to make it a more user friendly software.
Airborne Advanced Reconfigurable Computer System (ARCS)

NASA Technical Reports Server (NTRS)

Bjurman, B. E.; Jenkins, G. M.; Masreliez, C. J.; Mcclellan, K. L.; Templeman, J. E.

1976-01-01

A digital computer subsystem fault-tolerant concept was defined, and the potential benefits and costs of such a subsystem were assessed when used as the central element of a new transport's flight control system. The derived advanced reconfigurable computer system (ARCS) is a triple-redundant computer subsystem that automatically reconfigures, under multiple fault conditions, from triplex to duplex to simplex operation, with redundancy recovery if the fault condition is transient. The study included criteria development covering factors at the aircraft's operation level that would influence the design of a fault-tolerant system for commercial airline use. A new reliability analysis tool was developed for evaluating redundant, fault-tolerant system availability and survivability; and a stringent digital system software design methodology was used to achieve design/implementation visibility.
Practical Methods for Estimating Software Systems Fault Content and Location

NASA Technical Reports Server (NTRS)

Nikora, A.; Schneidewind, N.; Munson, J.

1999-01-01

Over the past several years, we have developed techniques to discriminate between fault-prone software modules and those that are not, to estimate a software system's residual fault content, to identify those portions of a software system having the highest estimated number of faults, and to estimate the effects of requirements changes on software quality.
A novel N-input voting algorithm for X-by-wire fault-tolerant systems.

PubMed

Karimi, Abbas; Zarafshan, Faraneh; Al-Haddad, S A R; Ramli, Abdul Rahman

2014-01-01

Voting is an important operation in multichannel computation paradigm and realization of ultrareliable and real-time control systems that arbitrates among the results of N redundant variants. These systems include N-modular redundant (NMR) hardware systems and diversely designed software systems based on N-version programming (NVP). Depending on the characteristics of the application and the type of selected voter, the voting algorithms can be implemented for either hardware or software systems. In this paper, a novel voting algorithm is introduced for real-time fault-tolerant control systems, appropriate for applications in which N is large. Then, its behavior has been software implemented in different scenarios of error-injection on the system inputs. The results of analyzed evaluations through plots and statistical computations have demonstrated that this novel algorithm does not have the limitations of some popular voting algorithms such as median and weighted; moreover, it is able to significantly increase the reliability and availability of the system in the best case to 2489.7% and 626.74%, respectively, and in the worst case to 3.84% and 1.55%, respectively.
Software dependability in the Tandem GUARDIAN system

NASA Technical Reports Server (NTRS)

Lee, Inhwan; Iyer, Ravishankar K.

1995-01-01

Based on extensive field failure data for Tandem's GUARDIAN operating system this paper discusses evaluation of the dependability of operational software. Software faults considered are major defects that result in processor failures and invoke backup processes to take over. The paper categorizes the underlying causes of software failures and evaluates the effectiveness of the process pair technique in tolerating software faults. A model to describe the impact of software faults on the reliability of an overall system is proposed. The model is used to evaluate the significance of key factors that determine software dependability and to identify areas for improvement. An analysis of the data shows that about 77% of processor failures that are initially considered due to software are confirmed as software problems. The analysis shows that the use of process pairs to provide checkpointing and restart (originally intended for tolerating hardware faults) allows the system to tolerate about 75% of reported software faults that result in processor failures. The loose coupling between processors, which results in the backup execution (the processor state and the sequence of events) being different from the original execution, is a major reason for the measured software fault tolerance. Over two-thirds (72%) of measured software failures are recurrences of previously reported faults. Modeling, based on the data, shows that, in addition to reducing the number of software faults, software dependability can be enhanced by reducing the recurrence rate.
Multi-version software reliability through fault-avoidance and fault-tolerance

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.

1989-01-01

A number of experimental and theoretical issues associated with the practical use of multi-version software to provide run-time tolerance to software faults were investigated. A specialized tool was developed and evaluated for measuring testing coverage for a variety of metrics. The tool was used to collect information on the relationships between software faults and coverage provided by the testing process as measured by different metrics (including data flow metrics). Considerable correlation was found between coverage provided by some higher metrics and the elimination of faults in the code. Back-to-back testing was continued as an efficient mechanism for removal of un-correlated faults, and common-cause faults of variable span. Software reliability estimation methods was also continued based on non-random sampling, and the relationship between software reliability and code coverage provided through testing. New fault tolerance models were formulated. Simulation studies of the Acceptance Voting and Multi-stage Voting algorithms were finished and it was found that these two schemes for software fault tolerance are superior in many respects to some commonly used schemes. Particularly encouraging are the safety properties of the Acceptance testing scheme.
Mechatronics technology in predictive maintenance method

NASA Astrophysics Data System (ADS)

Majid, Nurul Afiqah A.; Muthalif, Asan G. A.

2017-11-01

This paper presents recent mechatronics technology that can help to implement predictive maintenance by combining intelligent and predictive maintenance instrument. Vibration Fault Simulation System (VFSS) is an example of mechatronics system. The focus of this study is the prediction on the use of critical machines to detect vibration. Vibration measurement is often used as the key indicator of the state of the machine. This paper shows the choice of the appropriate strategy in the vibration of diagnostic process of the mechanical system, especially rotating machines, in recognition of the failure during the working process. In this paper, the vibration signature analysis is implemented to detect faults in rotary machining that includes imbalance, mechanical looseness, bent shaft, misalignment, missing blade bearing fault, balancing mass and critical speed. In order to perform vibration signature analysis for rotating machinery faults, studies have been made on how mechatronics technology is used as predictive maintenance methods. Vibration Faults Simulation Rig (VFSR) is designed to simulate and understand faults signatures. These techniques are based on the processing of vibrational data in frequency-domain. The LabVIEW-based spectrum analyzer software is developed to acquire and extract frequency contents of faults signals. This system is successfully tested based on the unique vibration fault signatures that always occur in a rotating machinery.
The Design of a Fault-Tolerant COTS-Based Bus Architecture

NASA Technical Reports Server (NTRS)

Chau, Savio N.; Alkalai, Leon; Burt, John B.; Tai, Ann T.

1999-01-01

In this paper, we report our experiences and findings on the design of a fault-tolerant bus architecture comprised of two COTS buses, the IEEE 1394 and the 12C. This fault-tolerant bus is the backbone system bus for the avionics architecture of the X2000 program at the Jet Propulsion Laboratory. COTS buses are attractive because of the availability of low cost commercial products. However, they are not specifically designed for highly reliable applications such as long-life deep-space missions. The X2000 design team has devised a multi-level fault tolerance approach to compensate for this shortcoming of COTS buses. First, the approach enhances the fault tolerance capabilities of the IEEE 1394 and 12 C buses by adding a layer of fault handling hardware and software. Second, algorithms are developed to enable the IEEE 1394 and the 12 C buses assist each other to isolate and recovery from faults. Third, the set of IEEE 1394 and 12 C buses is duplicated to further enhance system reliability. The X2000 design team has paid special attention to guarantee that all fault tolerance provisions will not cause the bus design to deviate from the commercial standard specifications. Otherwise, the economic attractiveness of using COTS will be diminished. The hardware and software design of the X2000 fault-tolerant bus are being implemented and flight hardware will be delivered to the ST4 and Europa Orbiter missions.
Software Fault Tolerance: A Tutorial

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo

2000-01-01

Because of our present inability to produce error-free software, software fault tolerance is and will continue to be an important consideration in software systems. The root cause of software design errors is the complexity of the systems. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. After a brief overview of the software development processes, we note how hard-to-detect design faults are likely to be introduced during development and how software faults tend to be state-dependent and activated by particular input sequences. Although component reliability is an important quality measure for system level analysis, software reliability is hard to characterize and the use of post-verification reliability estimates remains a controversial issue. For some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing catastrophes. Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Multiversion techniques are based on the assumption that software built differently should fail differently and thus, if one of the redundant versions fails, it is expected that at least one of the other versions will provide an acceptable output. Recovery blocks, N-version programming, and other multiversion techniques are reviewed.
Research, Development and Testing of a Fault-Tolerant FPGA-Based Sequencer for CubeSat Launching Applications

DTIC Science & Technology

2013-03-01

amounts of time and effort to implement. Future testing with commercial, fault-tolerant synthesis software, under a radiation environment, will yield ...initial viewpoint of the author is to take the flash-based FPGA route. This will yield a simple, reconfigurable circuit while providing the added...structure seen in Figure 30. Each of these full adder blocks were replaced in subsequent iterations to yield proper comparison with this baseline

Software/hardware distributed processing network supporting the Ada environment

NASA Astrophysics Data System (ADS)

Wood, Richard J.; Pryk, Zen

1993-09-01

A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
Fault Tolerant Considerations and Methods for Guidance and Control Systems

DTIC Science & Technology

1987-07-01

multifunction devices such as microprocessors with software. In striving toward the economic goal, however, a cost is incurred in a different coin, i.e...therefore been developed which reduces the software risk to acceptable proportions. Several of the techniques thus developed incur no significant cost ...complex that their design and implementation need computerized tools in order to be cost -effective (in a broad sense, including the capability of
A testing-coverage software reliability model considering fault removal efficiency and error generation.

PubMed

Li, Qiuying; Pham, Hoang

2017-01-01

In this paper, we propose a software reliability model that considers not only error generation but also fault removal efficiency combined with testing coverage information based on a nonhomogeneous Poisson process (NHPP). During the past four decades, many software reliability growth models (SRGMs) based on NHPP have been proposed to estimate the software reliability measures, most of which have the same following agreements: 1) it is a common phenomenon that during the testing phase, the fault detection rate always changes; 2) as a result of imperfect debugging, fault removal has been related to a fault re-introduction rate. But there are few SRGMs in the literature that differentiate between fault detection and fault removal, i.e. they seldom consider the imperfect fault removal efficiency. But in practical software developing process, fault removal efficiency cannot always be perfect, i.e. the failures detected might not be removed completely and the original faults might still exist and new faults might be introduced meanwhile, which is referred to as imperfect debugging phenomenon. In this study, a model aiming to incorporate fault introduction rate, fault removal efficiency and testing coverage into software reliability evaluation is developed, using testing coverage to express the fault detection rate and using fault removal efficiency to consider the fault repair. We compare the performance of the proposed model with several existing NHPP SRGMs using three sets of real failure data based on five criteria. The results exhibit that the model can give a better fitting and predictive performance.
Testing Scientific Software: A Systematic Literature Review.

PubMed

Kanewala, Upulee; Bieman, James M

2014-10-01

Scientific software plays an important role in critical decision making, for example making weather predictions based on climate models, and computation of evidence for research publications. Recently, scientists have had to retract publications due to errors caused by software faults. Systematic testing can identify such faults in code. This study aims to identify specific challenges, proposed solutions, and unsolved problems faced when testing scientific software. We conducted a systematic literature survey to identify and analyze relevant literature. We identified 62 studies that provided relevant information about testing scientific software. We found that challenges faced when testing scientific software fall into two main categories: (1) testing challenges that occur due to characteristics of scientific software such as oracle problems and (2) testing challenges that occur due to cultural differences between scientists and the software engineering community such as viewing the code and the model that it implements as inseparable entities. In addition, we identified methods to potentially overcome these challenges and their limitations. Finally we describe unsolved challenges and how software engineering researchers and practitioners can help to overcome them. Scientific software presents special challenges for testing. Specifically, cultural differences between scientist developers and software engineers, along with the characteristics of the scientific software make testing more difficult. Existing techniques such as code clone detection can help to improve the testing process. Software engineers should consider special challenges posed by scientific software such as oracle problems when developing testing techniques.
Secure Proactive Recovery a Hardware Based Mission Assurance Scheme

DTIC Science & Technology

2011-08-01

Room, January. Kalbarczyk, Z., Iyer, R.K., Bagchi, S. and Whisnant, K. (1999) " Chameleon : a software infrastructure for adaptive fault tolerance...components of this evaluation include a JAVA implementation based on Chameleon ARMORs (Kalbarczyk et al. 1999), ARENA simulation (http
An experimental evaluation of the REE SIFT environment for spaceborne applications

NASA Technical Reports Server (NTRS)

Whistnant, K.; Iyer, R. K.; Jones, P.; Some, R.; Rennels, D.

2002-01-01

This paper presents an experimental evaluation of a software-implemented fault tolerance environment built around a set of self-checking ARMOR proceses running on different machines that provide error detection and recovery services to themselves and to spaceborne scientific applications.
A methodology for testing fault-tolerant software

NASA Technical Reports Server (NTRS)

Andrews, D. M.; Mahmood, A.; Mccluskey, E. J.

1985-01-01

A methodology for testing fault tolerant software is presented. There are problems associated with testing fault tolerant software because many errors are masked or corrected by voters, limiter, or automatic channel synchronization. This methodology illustrates how the same strategies used for testing fault tolerant hardware can be applied to testing fault tolerant software. For example, one strategy used in testing fault tolerant hardware is to disable the redundancy during testing. A similar testing strategy is proposed for software, namely, to move the major emphasis on testing earlier in the development cycle (before the redundancy is in place) thus reducing the possibility that undetected errors will be masked when limiters and voters are added.
(Quickly) Testing the Tester via Path Coverage

NASA Technical Reports Server (NTRS)

Groce, Alex

2009-01-01

The configuration complexity and code size of an automated testing framework may grow to a point that the tester itself becomes a significant software artifact, prone to poor configuration and implementation errors. Unfortunately, testing the tester by using old versions of the software under test (SUT) may be impractical or impossible: test framework changes may have been motivated by interface changes in the tested system, or fault detection may become too expensive in terms of computing time to justify running until errors are detected on older versions of the software. We propose the use of path coverage measures as a "quick and dirty" method for detecting many faults in complex test frameworks. We also note the possibility of using techniques developed to diversify state-space searches in model checking to diversify test focus, and an associated classification of tester changes into focus-changing and non-focus-changing modifications.
On the nature of bias and defects in the software specification process

NASA Technical Reports Server (NTRS)

Straub, Pablo A.; Zelkowitz, Marvin V.

1992-01-01

Implementation bias in a specification is an arbitrary constraint in the solution space. This paper describes the problem of bias. Additionally, this paper presents a model of the specification and design processes describing individual subprocesses in terms of precision/detail diagrams and a model of bias in multi-attribute software specifications. While studying how bias is introduced into a specification we realized that software defects and bias are dual problems of a single phenomenon. This was used to explain the large proportion of faults found during the coding phase at the Software Engineering Laboratory at NASA/GSFC.
A testing-coverage software reliability model considering fault removal efficiency and error generation

PubMed Central

Li, Qiuying; Pham, Hoang

2017-01-01

In this paper, we propose a software reliability model that considers not only error generation but also fault removal efficiency combined with testing coverage information based on a nonhomogeneous Poisson process (NHPP). During the past four decades, many software reliability growth models (SRGMs) based on NHPP have been proposed to estimate the software reliability measures, most of which have the same following agreements: 1) it is a common phenomenon that during the testing phase, the fault detection rate always changes; 2) as a result of imperfect debugging, fault removal has been related to a fault re-introduction rate. But there are few SRGMs in the literature that differentiate between fault detection and fault removal, i.e. they seldom consider the imperfect fault removal efficiency. But in practical software developing process, fault removal efficiency cannot always be perfect, i.e. the failures detected might not be removed completely and the original faults might still exist and new faults might be introduced meanwhile, which is referred to as imperfect debugging phenomenon. In this study, a model aiming to incorporate fault introduction rate, fault removal efficiency and testing coverage into software reliability evaluation is developed, using testing coverage to express the fault detection rate and using fault removal efficiency to consider the fault repair. We compare the performance of the proposed model with several existing NHPP SRGMs using three sets of real failure data based on five criteria. The results exhibit that the model can give a better fitting and predictive performance. PMID:28750091
Software-Implemented Fault Tolerance in Communications Systems

NASA Technical Reports Server (NTRS)

Gantenbein, Rex E.

1994-01-01

Software-implemented fault tolerance (SIFT) is used in many computer-based command, control, and communications (C(3)) systems to provide the nearly continuous availability that they require. In the communications subsystem of Space Station Alpha, SIFT algorithms are used to detect and recover from failures in the data and command link between the Station and its ground support. The paper presents a review of these algorithms and discusses how such techniques can be applied to similar systems found in applications such as manufacturing control, military communications, and programmable devices such as pacemakers. With support from the Tracking and Communication Division of NASA's Johnson Space Center, researchers at the University of Wyoming are developing a testbed for evaluating the effectiveness of these algorithms prior to their deployment. This testbed will be capable of simulating a variety of C(3) system failures and recording the response of the Space Station SIFT algorithms to these failures. The design of this testbed and the applicability of the approach in other environments is described.
Static and Dynamic Verification of Critical Software for Space Applications

NASA Astrophysics Data System (ADS)

Moreira, F.; Maia, R.; Costa, D.; Duro, N.; Rodríguez-Dapena, P.; Hjortnaes, K.

Space technology is no longer used only for much specialised research activities or for sophisticated manned space missions. Modern society relies more and more on space technology and applications for every day activities. Worldwide telecommunications, Earth observation, navigation and remote sensing are only a few examples of space applications on which we rely daily. The European driven global navigation system Galileo and its associated applications, e.g. air traffic management, vessel and car navigation, will significantly expand the already stringent safety requirements for space based applications Apart from their usefulness and practical applications, every single piece of onboard software deployed into the space represents an enormous investment. With a long lifetime operation and being extremely difficult to maintain and upgrade, at least when comparing with "mainstream" software development, the importance of ensuring their correctness before deployment is immense. Verification &Validation techniques and technologies have a key role in ensuring that the onboard software is correct and error free, or at least free from errors that can potentially lead to catastrophic failures. Many RAMS techniques including both static criticality analysis and dynamic verification techniques have been used as a means to verify and validate critical software and to ensure its correctness. But, traditionally, these have been isolated applied. One of the main reasons is the immaturity of this field in what concerns to its application to the increasing software product(s) within space systems. This paper presents an innovative way of combining both static and dynamic techniques exploiting their synergy and complementarity for software fault removal. The methodology proposed is based on the combination of Software FMEA and FTA with Fault-injection techniques. The case study herein described is implemented with support from two tools: The SoftCare tool for the SFMEA and SFTA, and the Xception tool for fault-injection. Keywords: Verification &Validation, RAMS, Onboard software, SFMEA, STA, Fault-injection 1 This work is being performed under the project STADY Applied Static And Dynamic Verification Of Critical Software, ESA/ESTEC Contract Nr. 15751/02/NL/LvH.
[Advanced Development for Space Robotics With Emphasis on Fault Tolerance Technology

NASA Technical Reports Server (NTRS)

Tesar, Delbert

1997-01-01

This report describes work developing fault tolerant redundant robotic architectures and adaptive control strategies for robotic manipulator systems which can dynamically accommodate drastic robot manipulator mechanism, sensor or control failures and maintain stable end-point trajectory control with minimum disturbance. Kinematic designs of redundant, modular, reconfigurable arms for fault tolerance were pursued at a fundamental level. The approach developed robotic testbeds to evaluate disturbance responses of fault tolerant concepts in robotic mechanisms and controllers. The development was implemented in various fault tolerant mechanism testbeds including duality in the joint servo motor modules, parallel and serial structural architectures, and dual arms. All have real-time adaptive controller technologies to react to mechanism or controller disturbances (failures) to perform real-time reconfiguration to continue the task operations. The developments fall into three main areas: hardware, software, and theoretical.
Onboard Sensor Data Qualification in Human-Rated Launch Vehicles

NASA Technical Reports Server (NTRS)

Wong, Edmond; Melcher, Kevin J.; Maul, William A.; Chicatelli, Amy K.; Sowers, Thomas S.; Fulton, Christopher; Bickford, Randall

2012-01-01

The avionics system software for human-rated launch vehicles requires an implementation approach that is robust to failures, especially the failure of sensors used to monitor vehicle conditions that might result in an abort determination. Sensor measurements provide the basis for operational decisions on human-rated launch vehicles. This data is often used to assess the health of system or subsystem components, to identify failures, and to take corrective action. An incorrect conclusion and/or response may result if the sensor itself provides faulty data, or if the data provided by the sensor has been corrupted. Operational decisions based on faulty sensor data have the potential to be catastrophic, resulting in loss of mission or loss of crew. To prevent these later situations from occurring, a Modular Architecture and Generalized Methodology for Sensor Data Qualification in Human-rated Launch Vehicles has been developed. Sensor Data Qualification (SDQ) is a set of algorithms that can be implemented in onboard flight software, and can be used to qualify data obtained from flight-critical sensors prior to the data being used by other flight software algorithms. Qualified data has been analyzed by SDQ and is determined to be a true representation of the sensed system state; that is, the sensor data is determined not to be corrupted by sensor faults or signal transmission faults. Sensor data can become corrupted by faults at any point in the signal path between the sensor and the flight computer. Qualifying the sensor data has the benefit of ensuring that erroneous data is identified and flagged before otherwise being used for operational decisions, thus increasing confidence in the response of the other flight software processes using the qualified data, and decreasing the probability of false alarms or missed detections.
Assessing Survivability Using Software Fault Injection

DTIC Science & Technology

2001-04-01

UNCLASSIFIED Defense Technical Information Center Compilation Part Notice ADPO10875 TITLE: Assessing Survivability Using Software Fault Injection...Esc to exit .......................................................................... = 11-1 Assessing Survivability Using Software Fault Injection...Jeffrey Voas Reliable Software Technologies 21351 Ridgetop Circle, #400 Dulles, VA 20166 jmvoas@rstcorp.crom Abstract approved sources have the
Off-fault plasticity in three-dimensional dynamic rupture simulations using a modal Discontinuous Galerkin method on unstructured meshes: Implementation, verification, and application

NASA Astrophysics Data System (ADS)

Wollherr, Stephanie; Gabriel, Alice-Agnes; Uphoff, Carsten

2018-05-01

The dynamics and potential size of earthquakes depend crucially on rupture transfers between adjacent fault segments. To accurately describe earthquake source dynamics, numerical models can account for realistic fault geometries and rheologies such as nonlinear inelastic processes off the slip interface. We present implementation, verification, and application of off-fault Drucker-Prager plasticity in the open source software SeisSol (www.seissol.org). SeisSol is based on an arbitrary high-order derivative modal Discontinuous Galerkin (ADER-DG) method using unstructured, tetrahedral meshes specifically suited for complex geometries. Two implementation approaches are detailed, modelling plastic failure either employing sub-elemental quadrature points or switching to nodal basis coefficients. At fine fault discretizations the nodal basis approach is up to 6 times more efficient in terms of computational costs while yielding comparable accuracy. Both methods are verified in community benchmark problems and by three dimensional numerical h- and p-refinement studies with heterogeneous initial stresses. We observe no spectral convergence for on-fault quantities with respect to a given reference solution, but rather discuss a limitation to low-order convergence for heterogeneous 3D dynamic rupture problems. For simulations including plasticity, a high fault resolution may be less crucial than commonly assumed, due to the regularization of peak slip rate and an increase of the minimum cohesive zone width. In large-scale dynamic rupture simulations based on the 1992 Landers earthquake, we observe high rupture complexity including reverse slip, direct branching, and dynamic triggering. The spatio-temporal distribution of rupture transfers are altered distinctively by plastic energy absorption, correlated with locations of geometrical fault complexity. Computational cost increases by 7% when accounting for off-fault plasticity in the demonstrating application. Our results imply that the combination of fully 3D dynamic modelling, complex fault geometries, and off-fault plastic yielding is important to realistically capture dynamic rupture transfers in natural fault systems.
Automated Monitoring with a BSP Fault-Detection Test

NASA Technical Reports Server (NTRS)

Bickford, Randall L.; Herzog, James P.

2003-01-01

The figure schematically illustrates a method and procedure for automated monitoring of an asset, as well as a hardware- and-software system that implements the method and procedure. As used here, asset could signify an industrial process, power plant, medical instrument, aircraft, or any of a variety of other systems that generate electronic signals (e.g., sensor outputs). In automated monitoring, the signals are digitized and then processed in order to detect faults and otherwise monitor operational status and integrity of the monitored asset. The major distinguishing feature of the present method is that the fault-detection function is implemented by use of a Bayesian sequential probability (BSP) technique. This technique is superior to other techniques for automated monitoring because it affords sensitivity, not only to disturbances in the mean values, but also to very subtle changes in the statistical characteristics (variance, skewness, and bias) of the monitored signals.
MIROS: A Hybrid Real-Time Energy-Efficient Operating System for the Resource-Constrained Wireless Sensor Nodes

PubMed Central

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; Gholami, Khalid El

2014-01-01

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant. PMID:25248069
MIROS: a hybrid real-time energy-efficient operating system for the resource-constrained wireless sensor nodes.

PubMed

Liu, Xing; Hou, Kun Mean; de Vaulx, Christophe; Shi, Hongling; El Gholami, Khalid

2014-09-22

Operating system (OS) technology is significant for the proliferation of the wireless sensor network (WSN). With an outstanding OS; the constrained WSN resources (processor; memory and energy) can be utilized efficiently. Moreover; the user application development can be served soundly. In this article; a new hybrid; real-time; memory-efficient; energy-efficient; user-friendly and fault-tolerant WSN OS MIROS is designed and implemented. MIROS implements the hybrid scheduler and the dynamic memory allocator. Real-time scheduling can thus be achieved with low memory consumption. In addition; it implements a mid-layer software EMIDE (Efficient Mid-layer Software for User-Friendly Application Development Environment) to decouple the WSN application from the low-level system. The application programming process can consequently be simplified and the application reprogramming performance improved. Moreover; it combines both the software and the multi-core hardware techniques to conserve the energy resources; improve the node reliability; as well as achieve a new debugging method. To evaluate the performance of MIROS; it is compared with the other WSN OSes (TinyOS; Contiki; SOS; openWSN and mantisOS) from different OS concerns. The final evaluation results prove that MIROS is suitable to be used even on the tight resource-constrained WSN nodes. It can support the real-time WSN applications. Furthermore; it is energy efficient; user friendly and fault tolerant.
Numerical simulations of earthquakes and the dynamics of fault systems using the Finite Element method.

NASA Astrophysics Data System (ADS)

Kettle, L. M.; Mora, P.; Weatherley, D.; Gross, L.; Xing, H.

2006-12-01

Simulations using the Finite Element method are widely used in many engineering applications and for the solution of partial differential equations (PDEs). Computational models based on the solution of PDEs play a key role in earth systems simulations. We present numerical modelling of crustal fault systems where the dynamic elastic wave equation is solved using the Finite Element method. This is achieved using a high level computational modelling language, escript, available as open source software from ACcESS (Australian Computational Earth Systems Simulator), the University of Queensland. Escript is an advanced geophysical simulation software package developed at ACcESS which includes parallel equation solvers, data visualisation and data analysis software. The escript library was implemented to develop a flexible Finite Element model which reliably simulates the mechanism of faulting and the physics of earthquakes. Both 2D and 3D elastodynamic models are being developed to study the dynamics of crustal fault systems. Our final goal is to build a flexible model which can be applied to any fault system with user-defined geometry and input parameters. To study the physics of earthquake processes, two different time scales must be modelled, firstly the quasi-static loading phase which gradually increases stress in the system (~100years), and secondly the dynamic rupture process which rapidly redistributes stress in the system (~100secs). We will discuss the solution of the time-dependent elastic wave equation for an arbitrary fault system using escript. This involves prescribing the correct initial stress distribution in the system to simulate the quasi-static loading of faults to failure; determining a suitable frictional constitutive law which accurately reproduces the dynamics of the stick/slip instability at the faults; and using a robust time integration scheme. These dynamic models generate data and information that can be used for earthquake forecasting.

Fault Tree Analysis Application for Safety and Reliability

NASA Technical Reports Server (NTRS)

Wallace, Dolores R.

2003-01-01

Many commercial software tools exist for fault tree analysis (FTA), an accepted method for mitigating risk in systems. The method embedded in the tools identifies a root as use in system components, but when software is identified as a root cause, it does not build trees into the software component. No commercial software tools have been built specifically for development and analysis of software fault trees. Research indicates that the methods of FTA could be applied to software, but the method is not practical without automated tool support. With appropriate automated tool support, software fault tree analysis (SFTA) may be a practical technique for identifying the underlying cause of software faults that may lead to critical system failures. We strive to demonstrate that existing commercial tools for FTA can be adapted for use with SFTA, and that applied to a safety-critical system, SFTA can be used to identify serious potential problems long before integrator and system testing.
The cost of software fault tolerance

NASA Technical Reports Server (NTRS)

Migneault, G. E.

1982-01-01

The proposed use of software fault tolerance techniques as a means of reducing software costs in avionics and as a means of addressing the issue of system unreliability due to faults in software is examined. A model is developed to provide a view of the relationships among cost, redundancy, and reliability which suggests strategies for software development and maintenance which are not conventional.
Technical Basis for Evaluating Software-Related Common-Cause Failures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Muhlheim, Michael David; Wood, Richard

2016-04-01

The instrumentation and control (I&C) system architecture at a nuclear power plant (NPP) incorporates protections against common-cause failures (CCFs) through the use of diversity and defense-in-depth. Even for well-established analog-based I&C system designs, the potential for CCFs of multiple systems (or redundancies within a system) constitutes a credible threat to defeating the defense-in-depth provisions within the I&C system architectures. The integration of digital technologies into the I&C systems provides many advantages compared to the aging analog systems with respect to reliability, maintenance, operability, and cost effectiveness. However, maintaining the diversity and defense-in-depth for both the hardware and software within themore » digital system is challenging. In fact, the introduction of digital technologies may actually increase the potential for CCF vulnerabilities because of the introduction of undetected systematic faults. These systematic faults are defined as a “design fault located in a software component” and at a high level, are predominately the result of (1) errors in the requirement specification, (2) inadequate provisions to account for design limits (e.g., environmental stress), or (3) technical faults incorporated in the internal system (or architectural) design or implementation. Other technology-neutral CCF concerns include hardware design errors, equipment qualification deficiencies, installation or maintenance errors, instrument loop scaling and setpoint mistakes.« less
A Generic Modeling Process to Support Functional Fault Model Development

NASA Technical Reports Server (NTRS)

Maul, William A.; Hemminger, Joseph A.; Oostdyk, Rebecca; Bis, Rachael A.

2016-01-01

Functional fault models (FFMs) are qualitative representations of a system's failure space that are used to provide a diagnostic of the modeled system. An FFM simulates the failure effect propagation paths within a system between failure modes and observation points. These models contain a significant amount of information about the system including the design, operation and off nominal behavior. The development and verification of the models can be costly in both time and resources. In addition, models depicting similar components can be distinct, both in appearance and function, when created individually, because there are numerous ways of representing the failure space within each component. Generic application of FFMs has the advantages of software code reuse: reduction of time and resources in both development and verification, and a standard set of component models from which future system models can be generated with common appearance and diagnostic performance. This paper outlines the motivation to develop a generic modeling process for FFMs at the component level and the effort to implement that process through modeling conventions and a software tool. The implementation of this generic modeling process within a fault isolation demonstration for NASA's Advanced Ground System Maintenance (AGSM) Integrated Health Management (IHM) project is presented and the impact discussed.
Testing Scientific Software: A Systematic Literature Review

PubMed Central

Kanewala, Upulee; Bieman, James M.

2014-01-01

Context Scientific software plays an important role in critical decision making, for example making weather predictions based on climate models, and computation of evidence for research publications. Recently, scientists have had to retract publications due to errors caused by software faults. Systematic testing can identify such faults in code. Objective This study aims to identify specific challenges, proposed solutions, and unsolved problems faced when testing scientific software. Method We conducted a systematic literature survey to identify and analyze relevant literature. We identified 62 studies that provided relevant information about testing scientific software. Results We found that challenges faced when testing scientific software fall into two main categories: (1) testing challenges that occur due to characteristics of scientific software such as oracle problems and (2) testing challenges that occur due to cultural differences between scientists and the software engineering community such as viewing the code and the model that it implements as inseparable entities. In addition, we identified methods to potentially overcome these challenges and their limitations. Finally we describe unsolved challenges and how software engineering researchers and practitioners can help to overcome them. Conclusions Scientific software presents special challenges for testing. Specifically, cultural differences between scientist developers and software engineers, along with the characteristics of the scientific software make testing more difficult. Existing techniques such as code clone detection can help to improve the testing process. Software engineers should consider special challenges posed by scientific software such as oracle problems when developing testing techniques. PMID:25125798
Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1991-01-01

An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

Multiversion or N-version programming was proposed as a method of providing fault tolerance in software. The approach requires the separate, independent preparation of multiple versions of a piece of software for some application. Specific topics addressed are: failure probabilities in N-version systems, consistent comparison in N-version systems, descriptions of the faults found in the Knight and Leveson experiment, analytic models of comparison testing, characteristics of the input regions that trigger faults, fault tolerance through data diversity, and the relationship between failures caused by automatically seeded faults.
Model Transformation for a System of Systems Dependability Safety Case

NASA Technical Reports Server (NTRS)

Murphy, Judy; Driskell, Stephen B.

2010-01-01

Software plays an increasingly larger role in all aspects of NASA's science missions. This has been extended to the identification, management and control of faults which affect safety-critical functions and by default, the overall success of the mission. Traditionally, the analysis of fault identification, management and control are hardware based. Due to the increasing complexity of system, there has been a corresponding increase in the complexity in fault management software. The NASA Independent Validation & Verification (IV&V) program is creating processes and procedures to identify, and incorporate safety-critical software requirements along with corresponding software faults so that potential hazards may be mitigated. This Specific to Generic ... A Case for Reuse paper describes the phases of a dependability and safety study which identifies a new, process to create a foundation for reusable assets. These assets support the identification and management of specific software faults and, their transformation from specific to generic software faults. This approach also has applications to other systems outside of the NASA environment. This paper addresses how a mission specific dependability and safety case is being transformed to a generic dependability and safety case which can be reused for any type of space mission with an emphasis on software fault conditions.
Preliminary design of the redundant software experiment

NASA Technical Reports Server (NTRS)

Campbell, Roy; Deimel, Lionel; Eckhardt, Dave, Jr.; Kelly, John; Knight, John; Lauterbach, Linda; Lee, Larry; Mcallister, Dave; Mchugh, John

1985-01-01

The goal of the present experiment is to characterize the fault distributions of highly reliable software replicates, constructed using techniques and environments which are similar to those used in comtemporary industrial software facilities. The fault distributions and their effect on the reliability of fault tolerant configurations of the software will be determined through extensive life testing of the replicates against carefully constructed randomly generated test data. Each detected error will be carefully analyzed to provide insight in to their nature and cause. A direct objective is to develop techniques for reducing the intensity of coincident errors, thus increasing the reliability gain which can be achieved with fault tolerance. Data on the reliability gains realized, and the cost of the fault tolerant configurations can be used to design a companion experiment to determine the cost effectiveness of the fault tolerant strategy. Finally, the data and analysis produced by this experiment will be valuable to the software engineering community as a whole because it will provide a useful insight into the nature and cause of hard to find, subtle faults which escape standard software engineering validation techniques and thus persist far into the software life cycle.
Partitioning in Avionics Architectures: Requirements, Mechanisms, and Assurance

NASA Technical Reports Server (NTRS)

Rushby, John

1999-01-01

Automated aircraft control has traditionally been divided into distinct "functions" that are implemented separately (e.g., autopilot, autothrottle, flight management); each function has its own fault-tolerant computer system, and dependencies among different functions are generally limited to the exchange of sensor and control data. A by-product of this "federated" architecture is that faults are strongly contained within the computer system of the function where they occur and cannot readily propagate to affect the operation of other functions. More modern avionics architectures contemplate supporting multiple functions on a single, shared, fault-tolerant computer system where natural fault containment boundaries are less sharply defined. Partitioning uses appropriate hardware and software mechanisms to restore strong fault containment to such integrated architectures. This report examines the requirements for partitioning, mechanisms for their realization, and issues in providing assurance for partitioning. Because partitioning shares some concerns with computer security, security models are reviewed and compared with the concerns of partitioning.
Development of a space-systems network testbed

NASA Technical Reports Server (NTRS)

Lala, Jaynarayan; Alger, Linda; Adams, Stuart; Burkhardt, Laura; Nagle, Gail; Murray, Nicholas

1988-01-01

This paper describes a communications network testbed which has been designed to allow the development of architectures and algorithms that meet the functional requirements of future NASA communication systems. The central hardware components of the Network Testbed are programmable circuit switching communication nodes which can be adapted by software or firmware changes to customize the testbed to particular architectures and algorithms. Fault detection, isolation, and reconfiguration has been implemented in the Network with a hybrid approach which utilizes features of both centralized and distributed techniques to provide efficient handling of faults within the Network.
Software fault tolerance for real-time avionics systems

NASA Technical Reports Server (NTRS)

Anderson, T.; Knight, J. C.

1983-01-01

Avionics systems have very high reliability requirements and are therefore prime candidates for the inclusion of fault tolerance techniques. In order to provide tolerance to software faults, some form of state restoration is usually advocated as a means of recovery. State restoration can be very expensive for systems which utilize concurrent processes. The concurrency present in most avionics systems and the further difficulties introduced by timing constraints imply that providing tolerance for software faults may be inordinately expensive or complex. A straightforward pragmatic approach to software fault tolerance which is believed to be applicable to many real-time avionics systems is proposed. A classification system for software errors is presented together with approaches to recovery and continued service for each error type.
The environmental control and life support system advanced automation project. Phase 1: Application evaluation

NASA Technical Reports Server (NTRS)

Dewberry, Brandon S.

1990-01-01

The Environmental Control and Life Support System (ECLSS) is a Freedom Station distributed system with inherent applicability to advanced automation primarily due to the comparatively large reaction times of its subsystem processes. This allows longer contemplation times in which to form a more intelligent control strategy and to detect or prevent faults. The objective of the ECLSS Advanced Automation Project is to reduce the flight and ground manpower needed to support the initial and evolutionary ECLS system. The approach is to search out and make apparent those processes in the baseline system which are in need of more automatic control and fault detection strategies, to influence the ECLSS design by suggesting software hooks and hardware scars which will allow easy adaptation to advanced algorithms, and to develop complex software prototypes which fit into the ECLSS software architecture and will be shown in an ECLSS hardware testbed to increase the autonomy of the system. Covered here are the preliminary investigation and evaluation process, aimed at searching the ECLSS for candidate functions for automation and providing a software hooks and hardware scars analysis. This analysis shows changes needed in the baselined system for easy accommodation of knowledge-based or other complex implementations which, when integrated in flight or ground sustaining engineering architectures, will produce a more autonomous and fault tolerant Environmental Control and Life Support System.
Verifying Diagnostic Software

NASA Technical Reports Server (NTRS)

Lindsey, Tony; Pecheur, Charles

2004-01-01

Livingstone PathFinder (LPF) is a simulation-based computer program for verifying autonomous diagnostic software. LPF is designed especially to be applied to NASA s Livingstone computer program, which implements a qualitative-model-based algorithm that diagnoses faults in a complex automated system (e.g., an exploratory robot, spacecraft, or aircraft). LPF forms a software test bed containing a Livingstone diagnosis engine, embedded in a simulated operating environment consisting of a simulator of the system to be diagnosed by Livingstone and a driver program that issues commands and faults according to a nondeterministic scenario provided by the user. LPF runs the test bed through all executions allowed by the scenario, checking for various selectable error conditions after each step. All components of the test bed are instrumented, so that execution can be single-stepped both backward and forward. The architecture of LPF is modular and includes generic interfaces to facilitate substitution of alternative versions of its different parts. Altogether, LPF provides a flexible, extensible framework for simulation-based analysis of diagnostic software; these characteristics also render it amenable to application to diagnostic programs other than Livingstone.
Sensor fault detection and isolation via high-gain observers: application to a double-pipe heat exchanger.

PubMed

Escobar, R F; Astorga-Zaragoza, C M; Téllez-Anguiano, A C; Juárez-Romero, D; Hernández, J A; Guerrero-Ramírez, G V

2011-07-01

This paper deals with fault detection and isolation (FDI) in sensors applied to a concentric-pipe counter-flow heat exchanger. The proposed FDI is based on the analytical redundancy implementing nonlinear high-gain observers which are used to generate residuals when a sensor fault is presented (as software sensors). By evaluating the generated residual, it is possible to switch between the sensor and the observer when a failure is detected. Experiments in a heat exchanger pilot validate the effectiveness of the approach. The FDI technique is easy to implement allowing the industries to have an excellent alternative tool to keep their heat transfer process under supervision. The main contribution of this work is based on a dynamic model with heat transfer coefficients which depend on temperature and flow used to estimate the output temperatures of a heat exchanger. This model provides a satisfactory approximation of the states of the heat exchanger in order to allow its implementation in a FDI system used to perform supervision tasks. Copyright © 2011 ISA. Published by Elsevier Ltd. All rights reserved.
Dynamic rupture simulations on complex fault zone structures with off-fault plasticity using the ADER-DG method

NASA Astrophysics Data System (ADS)

Wollherr, Stephanie; Gabriel, Alice-Agnes; Igel, Heiner

2015-04-01

In dynamic rupture models, high stress concentrations at rupture fronts have to to be accommodated by off-fault inelastic processes such as plastic deformation. As presented in (Roten et al., 2014), incorporating plastic yielding can significantly reduce earlier predictions of ground motions in the Los Angeles Basin. Further, an inelastic response of materials surrounding a fault potentially has a strong impact on surface displacement and is therefore a key aspect in understanding the triggering of tsunamis through floor uplifting. We present an implementation of off-fault-plasticity and its verification for the software package SeisSol, an arbitrary high-order derivative discontinuous Galerkin (ADER-DG) method. The software recently reached multi-petaflop/s performance on some of the largest supercomputers worldwide and was a Gordon Bell prize finalist application in 2014 (Heinecke et al., 2014). For the nonelastic calculations we impose a Drucker-Prager yield criterion in shear stress with a viscous regularization following (Andrews, 2005). It permits the smooth relaxation of high stress concentrations induced in the dynamic rupture process. We verify the implementation by comparison to the SCEC/USGS Spontaneous Rupture Code Verification Benchmarks. The results of test problem TPV13 with a 60-degree dipping normal fault show that SeisSol is in good accordance with other codes. Additionally we aim to explore the numerical characteristics of the off-fault plasticity implementation by performing convergence tests for the 2D code. The ADER-DG method is especially suited for complex geometries by using unstructured tetrahedral meshes. Local adaptation of the mesh resolution enables a fine sampling of the cohesive zone on the fault while simultaneously satisfying the dispersion requirements of wave propagation away from the fault. In this context we will investigate the influence of off-fault-plasticity on geometrically complex fault zone structures like subduction zones or branched faults. Studying the interplay of stress conditions and angle dependence of neighbouring branches including inelastic material behaviour and its effects on rupture jumps and seismic activation helps to advance our understanding of earthquake source processes. An application is the simulation of a real large-scale subduction zone scenario including plasticity to validate the coupling of our dynamic rupture calculations to a tsunami model in the framework of the ASCETE project (http://www.ascete.de/). Andrews, D. J. (2005): Rupture dynamics with energy loss outside the slip zone, J. Geophys. Res., 110, B01307. Heinecke, A. (2014), A. Breuer, S. Rettenberger, M. Bader, A.-A. Gabriel, C. Pelties, A. Bode, W. Barth, K. Vaidyanathan, M. Smelyanskiy and P. Dubey: Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers. In Supercomputing 2014, The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, New Orleans, LA, USA, November 2014. Roten, D. (2014), K. B. Olsen, S.M. Day, Y. Cui, and D. Fäh: Expected seismic shaking in Los Angeles reduced by San Andreas fault zone plasticity, Geophys. Res. Lett., 41, 2769-2777.
A study of the relationship between the performance and dependability of a fault-tolerant computer

NASA Technical Reports Server (NTRS)

Goswami, Kumar K.

1994-01-01

This thesis studies the relationship by creating a tool (FTAPE) that integrates a high stress workload generator with fault injection and by using the tool to evaluate system performance under error conditions. The workloads are comprised of processes which are formed from atomic components that represent CPU, memory, and I/O activity. The fault injector is software-implemented and is capable of injecting any memory addressable location, including special registers and caches. This tool has been used to study a Tandem Integrity S2 Computer. Workloads with varying numbers of processes and varying compositions of CPU, memory, and I/O activity are first characterized in terms of performance. Then faults are injected into these workloads. The results show that as the number of concurrent processes increases, the mean fault latency initially increases due to increased contention for the CPU. However, for even higher numbers of processes (less than 3 processes), the mean latency decreases because long latency faults are paged out before they can be activated.
A second generation experiment in fault-tolerant software

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

The primary goal was to determine whether the application of fault tolerance to software increases its reliability if the cost of production is the same as for an equivalent nonfault tolerance version derived from the same requirements specification. Software development protocols are discussed. The feasibility of adapting to software design fault tolerance the technique of N-fold Modular Redundancy with majority voting was studied.
Award ER25750: Coordinated Infrastructure for Fault Tolerance Systems Indiana University Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lumsdaine, Andrew

2013-03-08

The main purpose of the Coordinated Infrastructure for Fault Tolerance in Systems initiative has been to conduct research with a goal of providing end-to-end fault tolerance on a systemwide basis for applications and other system software. While fault tolerance has been an integral part of most high-performance computing (HPC) system software developed over the past decade, it has been treated mostly as a collection of isolated stovepipes. Visibility and response to faults has typically been limited to the particular hardware and software subsystems in which they are initially observed. Little fault information is shared across subsystems, allowing little flexibility ormore » control on a system-wide basis, making it practically impossible to provide cohesive end-to-end fault tolerance in support of scientific applications. As an example, consider faults such as communication link failures that can be seen by a network library but are not directly visible to the job scheduler, or consider faults related to node failures that can be detected by system monitoring software but are not inherently visible to the resource manager. If information about such faults could be shared by the network libraries or monitoring software, then other system software, such as a resource manager or job scheduler, could ensure that failed nodes or failed network links were excluded from further job allocations and that further diagnosis could be performed. As a founding member and one of the lead developers of the Open MPI project, our efforts over the course of this project have been focused on making Open MPI more robust to failures by supporting various fault tolerance techniques, and using fault information exchange and coordination between MPI and the HPC system software stack from the application, numeric libraries, and programming language runtime to other common system components such as jobs schedulers, resource managers, and monitoring tools.« less
Integrated Application of Active Controls (IAAC) technology to an advanced subsonic transport project: Test act system validation

NASA Technical Reports Server (NTRS)

1985-01-01

The primary objective of the Test Active Control Technology (ACT) System laboratory tests was to verify and validate the system concept, hardware, and software. The initial lab tests were open loop hardware tests of the Test ACT System as designed and built. During the course of the testing, minor problems were uncovered and corrected. Major software tests were run. The initial software testing was also open loop. These tests examined pitch control laws, wing load alleviation, signal selection/fault detection (SSFD), and output management. The Test ACT System was modified to interface with the direct drive valve (DDV) modules. The initial testing identified problem areas with DDV nonlinearities, valve friction induced limit cycling, DDV control loop instability, and channel command mismatch. The other DDV issue investigated was the ability to detect and isolate failures. Some simple schemes for failure detection were tested but were not completely satisfactory. The Test ACT System architecture continues to appear promising for ACT/FBW applications in systems that must be immune to worst case generic digital faults, and be able to tolerate two sequential nongeneric faults with no reduction in performance. The challenge in such an implementation would be to keep the analog element sufficiently simple to achieve the necessary reliability.

Software fault-tolerance by design diversity DEDIX: A tool for experiments

NASA Technical Reports Server (NTRS)

Avizienis, A.; Gunningberg, P.; Kelly, J. P. J.; Lyu, R. T.; Strigini, L.; Traverse, P. J.; Tso, K. S.; Voges, U.

1986-01-01

The use of multiple versions of a computer program, independently designed from a common specification, to reduce the effects of an error is discussed. If these versions are designed by independent programming teams, it is expected that a fault in one version will not have the same behavior as any fault in the other versions. Since the errors in the output of the versions are different and uncorrelated, it is possible to run the versions concurrently, cross-check their results at prespecified points, and mask errors. A DEsign DIversity eXperiments (DEDIX) testbed was implemented to study the influence of common mode errors which can result in a failure of the entire system. The layered design of DEDIX and its decision algorithm are described.
SFT: Scalable Fault Tolerance

DOE Office of Scientific and Technical Information (OSTI.GOV)

Petrini, Fabrizio; Nieplocha, Jarek; Tipparaju, Vinod

2006-04-15

In this paper we will present a new technology that we are currently developing within the SFT: Scalable Fault Tolerance FastOS project which seeks to implement fault tolerance at the operating system level. Major design goals include dynamic reallocation of resources to allow continuing execution in the presence of hardware failures, very high scalability, high efficiency (low overhead), and transparency—requiring no changes to user applications. Our technology is based on a global coordination mechanism, that enforces transparent recovery lines in the system, and TICK, a lightweight, incremental checkpointing software architecture implemented as a Linux kernel module. TICK is completely user-transparentmore » and does not require any changes to user code or system libraries; it is highly responsive: an interrupt, such as a timer interrupt, can trigger a checkpoint in as little as 2.5μs; and it supports incremental and full checkpoints with minimal overhead—less than 6% with full checkpointing to disk performed as frequently as once per minute.« less
Optimizing Automatic Deployment Using Non-functional Requirement Annotations

NASA Astrophysics Data System (ADS)

Kugele, Stefan; Haberl, Wolfgang; Tautschnig, Michael; Wechs, Martin

Model-driven development has become common practice in design of safety-critical real-time systems. High-level modeling constructs help to reduce the overall system complexity apparent to developers. This abstraction caters for fewer implementation errors in the resulting systems. In order to retain correctness of the model down to the software executed on a concrete platform, human faults during implementation must be avoided. This calls for an automatic, unattended deployment process including allocation, scheduling, and platform configuration.
Experiments in fault tolerant software reliability

NASA Technical Reports Server (NTRS)

Mcallister, David F.; Tai, K. C.; Vouk, Mladen A.

1987-01-01

The reliability of voting was evaluated in a fault-tolerant software system for small output spaces. The effectiveness of the back-to-back testing process was investigated. Version 3.0 of the RSDIMU-ATS, a semi-automated test bed for certification testing of RSDIMU software, was prepared and distributed. Software reliability estimation methods based on non-random sampling are being studied. The investigation of existing fault-tolerance models was continued and formulation of new models was initiated.
The Curiosity Mars Rover's Fault Protection Engine

NASA Technical Reports Server (NTRS)

Benowitz, Ed

2014-01-01

The Curiosity Rover, currently operating on Mars, contains flight software onboard to autonomously handle aspects of system fault protection. Over 1000 monitors and 39 responses are present in the flight software. Orchestrating these behaviors is the flight software's fault protection engine. In this paper, we discuss the engine's design, responsibilities, and present some lessons learned for future missions.
Hierarchical specification of the SIFT fault tolerant flight control system

NASA Technical Reports Server (NTRS)

Melliar-Smith, P. M.; Schwartz, R. L.

1981-01-01

The specification and mechanical verification of the Software Implemented Fault Tolerance (SIFT) flight control system is described. The methodology employed in the verification effort is discussed, and a description of the hierarchical models of the SIFT system is given. To meet the objective of NASA for the reliability of safety critical flight control systems, the SIFT computer must achieve a reliability well beyond the levels at which reliability can be actually measured. The methodology employed to demonstrate rigorously that the SIFT computer meets as reliability requirements is described. The hierarchy of design specifications from very abstract descriptions of system function down to the actual implementation is explained. The most abstract design specifications can be used to verify that the system functions correctly and with the desired reliability since almost all details of the realization were abstracted out. A succession of lower level models refine these specifications to the level of the actual implementation, and can be used to demonstrate that the implementation has the properties claimed of the abstract design specifications.
Autonomous power system brassboard

NASA Technical Reports Server (NTRS)

Merolla, Anthony

1992-01-01

The Autonomous Power System (APS) brassboard is a 20 kHz power distribution system which has been developed at NASA Lewis Research Center, Cleveland, Ohio. The brassboard exists to provide a realistic hardware platform capable of testing artificially intelligent (AI) software. The brassboard's power circuit topology is based upon a Power Distribution Control Unit (PDCU), which is a subset of an advanced development 20 kHz electrical power system (EPS) testbed, originally designed for Space Station Freedom (SSF). The APS program is designed to demonstrate the application of intelligent software as a fault detection, isolation, and recovery methodology for space power systems. This report discusses both the hardware and software elements used to construct the present configuration of the brassboard. The brassboard power components are described. These include the solid-state switches (herein referred to as switchgear), transformers, sources, and loads. Closely linked to this power portion of the brassboard is the first level of embedded control. Hardware used to implement this control and its associated software is discussed. An Ada software program, developed by Lewis Research Center's Space Station Freedom Directorate for their 20 kHz testbed, is used to control the brassboard's switchgear, as well as monitor key brassboard parameters through sensors located within these switches. The Ada code is downloaded from a PC/AT, and is resident within the 8086 microprocessor-based embedded controllers. The PC/AT is also used for smart terminal emulation, capable of controlling the switchgear as well as displaying data from them. Intelligent control is provided through use of a T1 Explorer and the Autonomous Power Expert (APEX) LISP software. Real-time load scheduling is implemented through use of a 'C' program-based scheduling engine. The methods of communication between these computers and the brassboard are explored. In order to evaluate the features of both the brassboard hardware and intelligent controlling software, fault circuits have been developed and integrated as part of the brassboard. A description of these fault circuits and their function is included. The brassboard has become an extremely useful test facility, promoting artificial intelligence (AI) applications for power distribution systems. However, there are elements of the brassboard which could be enhanced, thus improving system performance. Modifications and enhancements to improve the brassboard's operation are discussed.
Radiation Hardening by Software Techniques on FPGAs: Flight Experiment Evaluation and Results

NASA Technical Reports Server (NTRS)

Schmidt, Andrew G.; Flatley, Thomas

2017-01-01

We present our work on implementing Radiation Hardening by Software (RHBSW) techniques on the Xilinx Virtex5 FPGAs PowerPC 440 processors on the SpaceCube 2.0 platform. The techniques have been matured and tested through simulation modeling, fault emulation, laser fault injection and now in a flight experiment, as part of the Space Test Program- Houston 4-ISS SpaceCube Experiment 2.0 (STP-H4-ISE 2.0). This work leverages concepts such as heartbeat monitoring, control flow assertions, and checkpointing, commonly used in the High Performance Computing industry, and adapts them for use in remote sensing embedded systems. These techniques are extremely low overhead (typically <1.3%), enabling a 3.3x gain in processing performance as compared to the equivalent traditionally radiation hardened processor. The recently concluded STP-H4 flight experiment was an opportunity to upgrade the RHBSW techniques for the Virtex5 FPGA and demonstrate them on-board the ISS to achieve TRL 7. This work details the implementation of the RHBSW techniques, that were previously developed for the Virtex4-based SpaceCube 1.0 platform, on the Virtex5-based SpaceCube 2.0 flight platform. The evaluation spans the development and integration with flight software, remotely uploading the new experiment to the ISS SpaceCube 2.0 platform, and conducting the experiment continuously for 16 days before the platform was decommissioned. The experiment was conducted on two PowerPCs embedded within the Virtex5 FPGA devices and the experiment collected 19,400 checkpoints, processed 253,482 status messages, and incurred 0 faults. These results are highly encouraging and future work is looking into longer duration testing as part of the STP-H5 flight experiment.
Hierarchical Simulation to Assess Hardware and Software Dependability

NASA Technical Reports Server (NTRS)

Ries, Gregory Lawrence

1997-01-01

This thesis presents a method for conducting hierarchical simulations to assess system hardware and software dependability. The method is intended to model embedded microprocessor systems. A key contribution of the thesis is the idea of using fault dictionaries to propagate fault effects upward from the level of abstraction where a fault model is assumed to the system level where the ultimate impact of the fault is observed. A second important contribution is the analysis of the software behavior under faults as well as the hardware behavior. The simulation method is demonstrated and validated in four case studies analyzing Myrinet, a commercial, high-speed networking system. One key result from the case studies shows that the simulation method predicts the same fault impact 87.5% of the time as is obtained by similar fault injections into a real Myrinet system. Reasons for the remaining discrepancy are examined in the thesis. A second key result shows the reduction in the number of simulations needed due to the fault dictionary method. In one case study, 500 faults were injected at the chip level, but only 255 propagated to the system level. Of these 255 faults, 110 shared identical fault dictionary entries at the system level and so did not need to be resimulated. The necessary number of system-level simulations was therefore reduced from 500 to 145. Finally, the case studies show how the simulation method can be used to improve the dependability of the target system. The simulation analysis was used to add recovery to the target software for the most common fault propagation mechanisms that would cause the software to hang. After the modification, the number of hangs was reduced by 60% for fault injections into the real system.
Software reliability through fault-avoidance and fault-tolerance

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.

1993-01-01

Strategies and tools for the testing, risk assessment and risk control of dependable software-based systems were developed. Part of this project consists of studies to enable the transfer of technology to industry, for example the risk management techniques for safety-concious systems. Theoretical investigations of Boolean and Relational Operator (BRO) testing strategy were conducted for condition-based testing. The Basic Graph Generation and Analysis tool (BGG) was extended to fully incorporate several variants of the BRO metric. Single- and multi-phase risk, coverage and time-based models are being developed to provide additional theoretical and empirical basis for estimation of the reliability and availability of large, highly dependable software. A model for software process and risk management was developed. The use of cause-effect graphing for software specification and validation was investigated. Lastly, advanced software fault-tolerance models were studied to provide alternatives and improvements in situations where simple software fault-tolerance strategies break down.
Fleet-Wide Prognostic and Health Management Suite: Asset Fault Signature Database

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vivek Agarwal; Nancy J. Lybeck; Randall Bickford

Proactive online monitoring in the nuclear industry is being explored using the Electric Power Research Institute’s Fleet-Wide Prognostic and Health Management (FW-PHM) Suite software. The FW-PHM Suite is a set of web-based diagnostic and prognostic tools and databases that serves as an integrated health monitoring architecture. The FW-PHM Suite has four main modules: (1) Diagnostic Advisor, (2) Asset Fault Signature (AFS) Database, (3) Remaining Useful Life Advisor, and (4) Remaining Useful Life Database. The paper focuses on the AFS Database of the FW-PHM Suite, which is used to catalog asset fault signatures. A fault signature is a structured representation ofmore » the information that an expert would use to first detect and then verify the occurrence of a specific type of fault. The fault signatures developed to assess the health status of generator step-up transformers are described in the paper. The developed fault signatures capture this knowledge and implement it in a standardized approach, thereby streamlining the diagnostic and prognostic process. This will support the automation of proactive online monitoring techniques in nuclear power plants to diagnose incipient faults, perform proactive maintenance, and estimate the remaining useful life of assets.« less
Software Construction and Analysis Tools for Future Space Missions

NASA Technical Reports Server (NTRS)

Lowry, Michael R.; Clancy, Daniel (Technical Monitor)

2002-01-01

NASA and its international partners will increasingly depend on software-based systems to implement advanced functions for future space missions, such as Martian rovers that autonomously navigate long distances exploring geographic features formed by surface water early in the planet's history. The software-based functions for these missions will need to be robust and highly reliable, raising significant challenges in the context of recent Mars mission failures attributed to software faults. After reviewing these challenges, this paper describes tools that have been developed at NASA Ames that could contribute to meeting these challenges; 1) Program synthesis tools based on automated inference that generate documentation for manual review and annotations for automated certification. 2) Model-checking tools for concurrent object-oriented software that achieve memorability through synergy with program abstraction and static analysis tools.
Synchronization and fault-masking in redundant real-time systems

NASA Technical Reports Server (NTRS)

Krishna, C. M.; Shin, K. G.; Butler, R. W.

1983-01-01

A real time computer may fail because of massive component failures or not responding quickly enough to satisfy real time requirements. An increase in redundancy - a conventional means of improving reliability - can improve the former but can - in some cases - degrade the latter considerably due to the overhead associated with redundancy management, namely the time delay resulting from synchronization and voting/interactive consistency techniques. The implications of synchronization and voting/interactive consistency algorithms in N-modular clusters on reliability are considered. All these studies were carried out in the context of real time applications. As a demonstrative example, we have analyzed results from experiments conducted at the NASA Airlab on the Software Implemented Fault Tolerance (SIFT) computer. This analysis has indeed indicated that in most real time applications, it is better to employ hardware synchronization instead of software synchronization and not allow reconfiguration.
A general software reliability process simulation technique

NASA Technical Reports Server (NTRS)

Tausworthe, Robert C.

1991-01-01

The structure and rationale of the generalized software reliability process, together with the design and implementation of a computer program that simulates this process are described. Given assumed parameters of a particular project, the users of this program are able to generate simulated status timelines of work products, numbers of injected anomalies, and the progress of testing, fault isolation, repair, validation, and retest. Such timelines are useful in comparison with actual timeline data, for validating the project input parameters, and for providing data for researchers in reliability prediction modeling.
Software reliability models for fault-tolerant avionics computers and related topics

NASA Technical Reports Server (NTRS)

Miller, Douglas R.

1987-01-01

Software reliability research is briefly described. General research topics are reliability growth models, quality of software reliability prediction, the complete monotonicity property of reliability growth, conceptual modelling of software failure behavior, assurance of ultrahigh reliability, and analysis techniques for fault-tolerant systems.
Guest Editor's Introduction: Special section on dependable distributed systems

NASA Astrophysics Data System (ADS)

Fetzer, Christof

1999-09-01

We rely more and more on computers. For example, the Internet reshapes the way we do business. A `computer outage' can cost a company a substantial amount of money. Not only with respect to the business lost during an outage, but also with respect to the negative publicity the company receives. This is especially true for Internet companies. After recent computer outages of Internet companies, we have seen a drastic fall of the shares of the affected companies. There are multiple causes for computer outages. Although computer hardware becomes more reliable, hardware related outages remain an important issue. For example, some of the recent computer outages of companies were caused by failed memory and system boards, and even by crashed disks - a failure type which can easily be masked using disk mirroring. Transient hardware failures might also look like software failures and, hence, might be incorrectly classified as such. However, many outages are software related. Faulty system software, middleware, and application software can crash a system. Dependable computing systems are systems we can rely on. Dependable systems are, by definition, reliable, available, safe and secure [3]. This special section focuses on issues related to dependable distributed systems. Distributed systems have the potential to be more dependable than a single computer because the probability that all computers in a distributed system fail is smaller than the probability that a single computer fails. However, if a distributed system is not built well, it is potentially less dependable than a single computer since the probability that at least one computer in a distributed system fails is higher than the probability that one computer fails. For example, if the crash of any computer in a distributed system can bring the complete system to a halt, the system is less dependable than a single-computer system. Building dependable distributed systems is an extremely difficult task. There is no silver bullet solution. Instead one has to apply a variety of engineering techniques [2]: fault-avoidance (minimize the occurrence of faults, e.g. by using a proper design process), fault-removal (remove faults before they occur, e.g. by testing), fault-evasion (predict faults by monitoring and reconfigure the system before failures occur), and fault-tolerance (mask and/or contain failures). Building a system from scratch is an expensive and time consuming effort. To reduce the cost of building dependable distributed systems, one would choose to use commercial off-the-shelf (COTS) components whenever possible. The usage of COTS components has several potential advantages beyond minimizing costs. For example, through the widespread usage of a COTS component, design failures might be detected and fixed before the component is used in a dependable system. Custom-designed components have to mature without the widespread in-field testing of COTS components. COTS components have various potential disadvantages when used in dependable systems. For example, minimizing the time to market might lead to the release of components with inherent design faults (e.g. use of `shortcuts' that only work most of the time). In addition, the components might be more complex than needed and, hence, potentially have more design faults than simpler components. However, given economic constraints and the ability to cope with some of the problems using fault-evasion and fault-tolerance, only for a small percentage of systems can one justify not using COTS components. Distributed systems built from current COTS components are asynchronous systems in the sense that there exists no a priori known bound on the transmission delay of messages or the execution time of processes. When designing a distributed algorithm, one would like to make sure (e.g. by testing or verification) that it is correct, i.e. satisfies its specification. Many distributed algorithms make use of consensus (eventually all non-crashed processes have to agree on a value), leader election (a crashed leader is eventually replaced by a new leader, but at any time there is at most one leader) or a group membership detection service (a crashed process is eventually suspected to have crashed but only crashed processes are suspected). From a theoretical point of view, the service specifications given for such services are not implementable in asynchronous systems. In particular, for each implementation one can derive a counter example in which the service violates its specification. From a practical point of view, the consensus, the leader election, and the membership detection problem are solvable in asynchronous distributed systems. In this special section, Raynal and Tronel show how to bridge this difference by showing how to implement the group membership detection problem with a negligible probability [1] to fail in an asynchronous system. The group membership detection problem is specified by a liveness condition (L) and a safety property (S): (L) if a process p crashes, then eventually every non-crashed process q has to suspect that p has crashed; and (S) if a process q suspects p, then p has indeed crashed. One can show that either (L) or (S) is implementable, but one cannot implement both (L) and (S) at the same time in an asynchronous system. In practice, one only needs to implement (L) and (S) such that the probability that (L) or (S) is violated becomes negligible. Raynal and Tronel propose and analyse a protocol that implements (L) with certainty and that can be tuned such that the probability that (S) is violated becomes negligible. Designing and implementing distributed fault-tolerant protocols for asynchronous systems is a difficult but not an impossible task. A fault-tolerant protocol has to detect and mask certain failure classes, e.g. crash failures and message omission failures. There is a trade-off between the performance of a fault-tolerant protocol and the failure classes the protocol can tolerate. One wants to tolerate as many failure classes as needed to satisfy the stochastic requirements of the protocol [1] while still maintaining a sufficient performance. Since clients of a protocol have different requirements with respect to the performance/fault-tolerance trade-off, one would like to be able to customize protocols such that one can select an appropriate performance/fault-tolerance trade-off. In this special section Hiltunen et al describe how one can compose protocols from micro-protocols in their Cactus system. They show how a group RPC system can be tailored to the needs of a client. In particular, they show how considering additional failure classes affects the performance of a group RPC system. References [1] Cristian F 1991 Understanding fault-tolerant distributed systems Communications of ACM 34 (2) 56-78 [2] Heimerdinger W L and Weinstock C B 1992 A conceptual framework for system fault tolerance Technical Report 92-TR-33, CMU/SEI [3] Laprie J C (ed) 1992 Dependability: Basic Concepts and Terminology (Vienna: Springer)
FPGA-Based, Self-Checking, Fault-Tolerant Computers

NASA Technical Reports Server (NTRS)

Some, Raphael; Rennels, David

2004-01-01

A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
Analyzing Software Errors in Safety-Critical Embedded Systems

NASA Technical Reports Server (NTRS)

Lutz, Robyn R.

1994-01-01

This paper analyzes the root causes of safty-related software faults identified as potentially hazardous to the system are distributed somewhat differently over the set of possible error causes than non-safety-related software faults.
Development and validation of techniques for improving software dependability

NASA Technical Reports Server (NTRS)

Knight, John C.

1992-01-01

A collection of document abstracts are presented on the topic of improving software dependability through NASA grant NAG-1-1123. Specific topics include: modeling of error detection; software inspection; test cases; Magnetic Stereotaxis System safety specifications and fault trees; and injection of synthetic faults into software.
Model-Based Verification and Validation of Spacecraft Avionics

NASA Technical Reports Server (NTRS)

Khan, M. Omair; Sievers, Michael; Standley, Shaun

2012-01-01

Verification and Validation (V&V) at JPL is traditionally performed on flight or flight-like hardware running flight software. For some time, the complexity of avionics has increased exponentially while the time allocated for system integration and associated V&V testing has remained fixed. There is an increasing need to perform comprehensive system level V&V using modeling and simulation, and to use scarce hardware testing time to validate models; the norm for thermal and structural V&V for some time. Our approach extends model-based V&V to electronics and software through functional and structural models implemented in SysML. We develop component models of electronics and software that are validated by comparison with test results from actual equipment. The models are then simulated enabling a more complete set of test cases than possible on flight hardware. SysML simulations provide access and control of internal nodes that may not be available in physical systems. This is particularly helpful in testing fault protection behaviors when injecting faults is either not possible or potentially damaging to the hardware. We can also model both hardware and software behaviors in SysML, which allows us to simulate hardware and software interactions. With an integrated model and simulation capability we can evaluate the hardware and software interactions and identify problems sooner. The primary missing piece is validating SysML model correctness against hardware; this experiment demonstrated such an approach is possible.

Distributed controller clustering in software defined networks.

PubMed

Abdelaziz, Ahmed; Fong, Ang Tan; Gani, Abdullah; Garba, Usman; Khan, Suleman; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond

2017-01-01

Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability.
A coverage and slicing dependencies analysis for seeking software security defects.

PubMed

He, Hui; Zhang, Dongyan; Liu, Min; Zhang, Weizhe; Gao, Dongmin

2014-01-01

Software security defects have a serious impact on the software quality and reliability. It is a major hidden danger for the operation of a system that a software system has some security flaws. When the scale of the software increases, its vulnerability has becoming much more difficult to find out. Once these vulnerabilities are exploited, it may lead to great loss. In this situation, the concept of Software Assurance is carried out by some experts. And the automated fault localization technique is a part of the research of Software Assurance. Currently, automated fault localization method includes coverage based fault localization (CBFL) and program slicing. Both of the methods have their own location advantages and defects. In this paper, we have put forward a new method, named Reverse Data Dependence Analysis Model, which integrates the two methods by analyzing the program structure. On this basis, we finally proposed a new automated fault localization method. This method not only is automation lossless but also changes the basic location unit into single sentence, which makes the location effect more accurate. Through several experiments, we proved that our method is more effective. Furthermore, we analyzed the effectiveness among these existing methods and different faults.
High-throughput state-machine replication using software transactional memory.

PubMed

Zhao, Wenbing; Yang, William; Zhang, Honglei; Yang, Jack; Luo, Xiong; Zhu, Yueqin; Yang, Mary; Luo, Chaomin

2016-11-01

State-machine replication is a common way of constructing general purpose fault tolerance systems. To ensure replica consistency, requests must be executed sequentially according to some total order at all non-faulty replicas. Unfortunately, this could severely limit the system throughput. This issue has been partially addressed by identifying non-conflicting requests based on application semantics and executing these requests concurrently. However, identifying and tracking non-conflicting requests require intimate knowledge of application design and implementation, and a custom fault tolerance solution developed for one application cannot be easily adopted by other applications. Software transactional memory offers a new way of constructing concurrent programs. In this article, we present the mechanisms needed to retrofit existing concurrency control algorithms designed for software transactional memory for state-machine replication. The main benefit for using software transactional memory in state-machine replication is that general purpose concurrency control mechanisms can be designed without deep knowledge of application semantics. As such, new fault tolerance systems based on state-machine replications with excellent throughput can be easily designed and maintained. In this article, we introduce three different concurrency control mechanisms for state-machine replication using software transactional memory, namely, ordered strong strict two-phase locking, conventional timestamp-based multiversion concurrency control, and speculative timestamp-based multiversion concurrency control. Our experiments show that speculative timestamp-based multiversion concurrency control mechanism has the best performance in all types of workload, the conventional timestamp-based multiversion concurrency control offers the worst performance due to high abort rate in the presence of even moderate contention between transactions. The ordered strong strict two-phase locking mechanism offers the simplest solution with excellent performance in low contention workload, and fairly good performance in high contention workload.
High-throughput state-machine replication using software transactional memory

PubMed Central

Yang, William; Zhang, Honglei; Yang, Jack; Luo, Xiong; Zhu, Yueqin; Yang, Mary; Luo, Chaomin

2017-01-01

State-machine replication is a common way of constructing general purpose fault tolerance systems. To ensure replica consistency, requests must be executed sequentially according to some total order at all non-faulty replicas. Unfortunately, this could severely limit the system throughput. This issue has been partially addressed by identifying non-conflicting requests based on application semantics and executing these requests concurrently. However, identifying and tracking non-conflicting requests require intimate knowledge of application design and implementation, and a custom fault tolerance solution developed for one application cannot be easily adopted by other applications. Software transactional memory offers a new way of constructing concurrent programs. In this article, we present the mechanisms needed to retrofit existing concurrency control algorithms designed for software transactional memory for state-machine replication. The main benefit for using software transactional memory in state-machine replication is that general purpose concurrency control mechanisms can be designed without deep knowledge of application semantics. As such, new fault tolerance systems based on state-machine replications with excellent throughput can be easily designed and maintained. In this article, we introduce three different concurrency control mechanisms for state-machine replication using software transactional memory, namely, ordered strong strict two-phase locking, conventional timestamp-based multiversion concurrency control, and speculative timestamp-based multiversion concurrency control. Our experiments show that speculative timestamp-based multiversion concurrency control mechanism has the best performance in all types of workload, the conventional timestamp-based multiversion concurrency control offers the worst performance due to high abort rate in the presence of even moderate contention between transactions. The ordered strong strict two-phase locking mechanism offers the simplest solution with excellent performance in low contention workload, and fairly good performance in high contention workload. PMID:29075049
Fault recovery characteristics of the fault tolerant multi-processor

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1990-01-01

The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.
NHPP-Based Software Reliability Models Using Equilibrium Distribution

NASA Astrophysics Data System (ADS)

Xiao, Xiao; Okamura, Hiroyuki; Dohi, Tadashi

Non-homogeneous Poisson processes (NHPPs) have gained much popularity in actual software testing phases to estimate the software reliability, the number of remaining faults in software and the software release timing. In this paper, we propose a new modeling approach for the NHPP-based software reliability models (SRMs) to describe the stochastic behavior of software fault-detection processes. The fundamental idea is to apply the equilibrium distribution to the fault-detection time distribution in NHPP-based modeling. We also develop efficient parameter estimation procedures for the proposed NHPP-based SRMs. Through numerical experiments, it can be concluded that the proposed NHPP-based SRMs outperform the existing ones in many data sets from the perspective of goodness-of-fit and prediction performance.
A study of fault prediction and reliability assessment in the SEL environment

NASA Technical Reports Server (NTRS)

Basili, Victor R.; Patnaik, Debabrata

1986-01-01

An empirical study on estimation and prediction of faults, prediction of fault detection and correction effort, and reliability assessment in the Software Engineering Laboratory environment (SEL) is presented. Fault estimation using empirical relationships and fault prediction using curve fitting method are investigated. Relationships between debugging efforts (fault detection and correction effort) in different test phases are provided, in order to make an early estimate of future debugging effort. This study concludes with the fault analysis, application of a reliability model, and analysis of a normalized metric for reliability assessment and reliability monitoring during development of software.
Integrated software health management for aerospace guidance, navigation, and control systems: A probabilistic reasoning approach

NASA Astrophysics Data System (ADS)

Mbaya, Timmy

Embedded Aerospace Systems have to perform safety and mission critical operations in a real-time environment where timing and functional correctness are extremely important. Guidance, Navigation, and Control (GN&C) systems substantially rely on complex software interfacing with hardware in real-time; any faults in software or hardware, or their interaction could result in fatal consequences. Integrated Software Health Management (ISWHM) provides an approach for detection and diagnosis of software failures while the software is in operation. The ISWHM approach is based on probabilistic modeling of software and hardware sensors using a Bayesian network. To meet memory and timing constraints of real-time embedded execution, the Bayesian network is compiled into an Arithmetic Circuit, which is used for on-line monitoring. This type of system monitoring, using an ISWHM, provides automated reasoning capabilities that compute diagnoses in a timely manner when failures occur. This reasoning capability enables time-critical mitigating decisions and relieves the human agent from the time-consuming and arduous task of foraging through a multitude of isolated---and often contradictory---diagnosis data. For the purpose of demonstrating the relevance of ISWHM, modeling and reasoning is performed on a simple simulated aerospace system running on a real-time operating system emulator, the OSEK/Trampoline platform. Models for a small satellite and an F-16 fighter jet GN&C (Guidance, Navigation, and Control) system have been implemented. Analysis of the ISWHM is then performed by injecting faults and analyzing the ISWHM's diagnoses.
Use of Field Programmable Gate Array Technology in Future Space Avionics

NASA Technical Reports Server (NTRS)

Ferguson, Roscoe C.; Tate, Robert

2005-01-01

Fulfilling NASA's new vision for space exploration requires the development of sustainable, flexible and fault tolerant spacecraft control systems. The traditional development paradigm consists of the purchase or fabrication of hardware boards with fixed processor and/or Digital Signal Processing (DSP) components interconnected via a standardized bus system. This is followed by the purchase and/or development of software. This paradigm has several disadvantages for the development of systems to support NASA's new vision. Building a system to be fault tolerant increases the complexity and decreases the performance of included software. Standard bus design and conventional implementation produces natural bottlenecks. Configuring hardware components in systems containing common processors and DSPs is difficult initially and expensive or impossible to change later. The existence of Hardware Description Languages (HDLs), the recent increase in performance, density and radiation tolerance of Field Programmable Gate Arrays (FPGAs), and Intellectual Property (IP) Cores provides the technology for reprogrammable Systems on a Chip (SOC). This technology supports a paradigm better suited for NASA's vision. Hardware and software production are melded for more effective development; they can both evolve together over time. Designers incorporating this technology into future avionics can benefit from its flexibility. Systems can be designed with improved fault isolation and tolerance using hardware instead of software. Also, these designs can be protected from obsolescence problems where maintenance is compromised via component and vendor availability.To investigate the flexibility of this technology, the core of the Central Processing Unit and Input/Output Processor of the Space Shuttle AP101S Computer were prototyped in Verilog HDL and synthesized into an Altera Stratix FPGA.
Testing and Validating Machine Learning Classifiers by Metamorphic Testing☆

PubMed Central

Xie, Xiaoyuan; Ho, Joshua W. K.; Murphy, Christian; Kaiser, Gail; Xu, Baowen; Chen, Tsong Yueh

2011-01-01

Machine Learning algorithms have provided core functionality to many application domains - such as bioinformatics, computational linguistics, etc. However, it is difficult to detect faults in such applications because often there is no “test oracle” to verify the correctness of the computed outputs. To help address the software quality, in this paper we present a technique for testing the implementations of machine learning classification algorithms which support such applications. Our approach is based on the technique “metamorphic testing”, which has been shown to be effective to alleviate the oracle problem. Also presented include a case study on a real-world machine learning application framework, and a discussion of how programmers implementing machine learning algorithms can avoid the common pitfalls discovered in our study. We also conduct mutation analysis and cross-validation, which reveal that our method has high effectiveness in killing mutants, and that observing expected cross-validation result alone is not sufficiently effective to detect faults in a supervised classification program. The effectiveness of metamorphic testing is further confirmed by the detection of real faults in a popular open-source classification program. PMID:21532969
Aircraft Fault Detection and Classification Using Multi-Level Immune Learning Detection

NASA Technical Reports Server (NTRS)

Wong, Derek; Poll, Scott; KrishnaKumar, Kalmanje

2005-01-01

This work is an extension of a recently developed software tool called MILD (Multi-level Immune Learning Detection), which implements a negative selection algorithm for anomaly and fault detection that is inspired by the human immune system. The immunity-based approach can detect a broad spectrum of known and unforeseen faults. We extend MILD by applying a neural network classifier to identify the pattern of fault detectors that are activated during fault detection. Consequently, MILD now performs fault detection and identification of the system under investigation. This paper describes the application of MILD to detect and classify faults of a generic transport aircraft augmented with an intelligent flight controller. The intelligent control architecture is designed to accommodate faults without the need to explicitly identify them. Adding knowledge about the existence and type of a fault will improve the handling qualities of a degraded aircraft and impact tactical and strategic maneuvering decisions. In addition, providing fault information to the pilot is important for maintaining situational awareness so that he can avoid performing an action that might lead to unexpected behavior - e.g., an action that exceeds the remaining control authority of the damaged aircraft. We discuss the detection and classification results of simulated failures of the aircraft's control system and show that MILD is effective at determining the problem with low false alarm and misclassification rates.
Development of Asset Fault Signatures for Prognostic and Health Management in the Nuclear Industry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vivek Agarwal; Nancy J. Lybeck; Randall Bickford

2014-06-01

Proactive online monitoring in the nuclear industry is being explored using the Electric Power Research Institute’s Fleet-Wide Prognostic and Health Management (FW-PHM) Suite software. The FW-PHM Suite is a set of web-based diagnostic and prognostic tools and databases that serves as an integrated health monitoring architecture. The FW-PHM Suite has four main modules: Diagnostic Advisor, Asset Fault Signature (AFS) Database, Remaining Useful Life Advisor, and Remaining Useful Life Database. This paper focuses on development of asset fault signatures to assess the health status of generator step-up generators and emergency diesel generators in nuclear power plants. Asset fault signatures describe themore » distinctive features based on technical examinations that can be used to detect a specific fault type. At the most basic level, fault signatures are comprised of an asset type, a fault type, and a set of one or more fault features (symptoms) that are indicative of the specified fault. The AFS Database is populated with asset fault signatures via a content development exercise that is based on the results of intensive technical research and on the knowledge and experience of technical experts. The developed fault signatures capture this knowledge and implement it in a standardized approach, thereby streamlining the diagnostic and prognostic process. This will support the automation of proactive online monitoring techniques in nuclear power plants to diagnose incipient faults, perform proactive maintenance, and estimate the remaining useful life of assets.« less
MER Surface Phase; Blurring the Line Between Fault Protection and What is Supposed to Happen

NASA Technical Reports Server (NTRS)

Reeves, Glenn E.

2008-01-01

An assessment on the limitations of communication with MER rovers and how such constraints drove the system design, flight software and fault protection architecture, blurring the line between traditional fault protection and expected nominal behavior, and requiring the most novel autonomous and semi-autonomous elements of the vehicle software including communication, surface mobility, attitude knowledge acquisition, fault protection, and the activity arbitration service.
Quantum Computing Architectural Design

NASA Astrophysics Data System (ADS)

West, Jacob; Simms, Geoffrey; Gyure, Mark

2006-03-01

Large scale quantum computers will invariably require scalable architectures in addition to high fidelity gate operations. Quantum computing architectural design (QCAD) addresses the problems of actually implementing fault-tolerant algorithms given physical and architectural constraints beyond those of basic gate-level fidelity. Here we introduce a unified framework for QCAD that enables the scientist to study the impact of varying error correction schemes, architectural parameters including layout and scheduling, and physical operations native to a given architecture. Our software package, aptly named QCAD, provides compilation, manipulation/transformation, multi-paradigm simulation, and visualization tools. We demonstrate various features of the QCAD software package through several examples.
Heat exchanger expert system logic

NASA Technical Reports Server (NTRS)

Cormier, R.

1988-01-01

The reduction is described of the operation and fault diagnostics of a Deep Space Network heat exchanger to a rule base by the application of propositional calculus to a set of logic statements. The value of this approach lies in the ease of converting the logic and subsequently implementing it on a computer as an expert system. The rule base was written in Process Intelligent Control software.
V&V of Fault Management: Challenges and Successes

NASA Technical Reports Server (NTRS)

Fesq, Lorraine M.; Costello, Ken; Ohi, Don; Lu, Tiffany; Newhouse, Marilyn

2013-01-01

This paper describes the results of a special breakout session of the NASA Independent Verification and Validation (IV&V) Workshop held in the fall of 2012 entitled "V&V of Fault Management: Challenges and Successes." The NASA IV&V Program is in a unique position to interact with projects across all of the NASA development domains. Using this unique opportunity, the IV&V program convened a breakout session to enable IV&V teams to share their challenges and successes with respect to the V&V of Fault Management (FM) architectures and software. The presentations and discussions provided practical examples of pitfalls encountered while performing V&V of FM including the lack of consistent designs for implementing faults monitors and the fact that FM information is not centralized but scattered among many diverse project artifacts. The discussions also solidified the need for an early commitment to developing FM in parallel with the spacecraft systems as well as clearly defining FM terminology within a project.
Symposium on the Interface: Computing Science and Statistics (20th). Theme: Computationally Intensive Methods in Statistics Held in Reston, Virginia on April 20-23, 1988

DTIC Science & Technology

1988-08-20

34 William A. Link, Patuxent Wildlife Research Center "Increasing reliability of multiversion fault-tolerant software design by modulation," Junryo 3... Multiversion lault-Tolerant Software Design by Modularization Junryo Miyashita Department of Computer Science California state University at san Bernardino Fault...They shall beE refered to as " multiversion fault-tolerant software design". Onel problem of developing multi-versions of a program is the high cost
Extreme scale multi-physics simulations of the tsunamigenic 2004 Sumatra megathrust earthquake

NASA Astrophysics Data System (ADS)

Ulrich, T.; Gabriel, A. A.; Madden, E. H.; Wollherr, S.; Uphoff, C.; Rettenberger, S.; Bader, M.

2017-12-01

SeisSol (www.seissol.org) is an open-source software package based on an arbitrary high-order derivative Discontinuous Galerkin method (ADER-DG). It solves spontaneous dynamic rupture propagation on pre-existing fault interfaces according to non-linear friction laws, coupled to seismic wave propagation with high-order accuracy in space and time (minimal dispersion errors). SeisSol exploits unstructured meshes to account for complex geometries, e.g. high resolution topography and bathymetry, 3D subsurface structure, and fault networks. We present the up-to-date largest (1500 km of faults) and longest (500 s) dynamic rupture simulation modeling the 2004 Sumatra-Andaman earthquake. We demonstrate the need for end-to-end-optimization and petascale performance of scientific software to realize realistic simulations on the extreme scales of subduction zone earthquakes: Considering the full complexity of subduction zone geometries leads inevitably to huge differences in element sizes. The main code improvements include a cache-aware wave propagation scheme and optimizations of the dynamic rupture kernels using code generation. In addition, a novel clustered local-time-stepping scheme for dynamic rupture has been established. Finally, asynchronous output has been implemented to overlap I/O and compute time. We resolve the frictional sliding process on the curved mega-thrust and a system of splay faults, as well as the seismic wave field and seafloor displacement with frequency content up to 2.2 Hz. We validate the scenario by geodetic, seismological and tsunami observations. The resulting rupture dynamics shed new light on the activation and importance of splay faults.
A grid-doubling finite-element technique for calculating dynamic three-dimensional spontaneous rupture on an earthquake fault

USGS Publications Warehouse

Barall, Michael

2009-01-01

We present a new finite-element technique for calculating dynamic 3-D spontaneous rupture on an earthquake fault, which can reduce the required computational resources by a factor of six or more, without loss of accuracy. The grid-doubling technique employs small cells in a thin layer surrounding the fault. The remainder of the modelling volume is filled with larger cells, typically two or four times as large as the small cells. In the resulting non-conforming mesh, an interpolation method is used to join the thin layer of smaller cells to the volume of larger cells. Grid-doubling is effective because spontaneous rupture calculations typically require higher spatial resolution on and near the fault than elsewhere in the model volume. The technique can be applied to non-planar faults by morphing, or smoothly distorting, the entire mesh to produce the desired 3-D fault geometry. Using our FaultMod finite-element software, we have tested grid-doubling with both slip-weakening and rate-and-state friction laws, by running the SCEC/USGS 3-D dynamic rupture benchmark problems. We have also applied it to a model of the Hayward fault, Northern California, which uses realistic fault geometry and rock properties. FaultMod implements fault slip using common nodes, which represent motion common to both sides of the fault, and differential nodes, which represent motion of one side of the fault relative to the other side. We describe how to modify the traction-at-split-nodes method to work with common and differential nodes, using an implicit time stepping algorithm.
Data collection and analysis software development for rotor dynamics testing in spin laboratory

NASA Astrophysics Data System (ADS)

Abdul-Aziz, Ali; Arble, Daniel; Woike, Mark

2017-04-01

Gas turbine engine components undergo high rotational loading another complex environmental conditions. Such operating environment leads these components to experience damages and cracks that can cause catastrophic failure during flights. There are traditional crack detections and health monitoring methodologies currently being used which rely on periodic routine maintenances, nondestructive inspections that often times involve engine and components dis-assemblies. These methods do not also offer adequate information about the faults, especially, if these faults at subsurface or not clearly evident. At NASA Glenn research center, the rotor dynamics laboratory is presently involved in developing newer techniques that are highly dependent on sensor technology to enable health monitoring and prediction of damage and cracks in rotor disks. These approaches are noninvasive and relatively economical. Spin tests are performed using a subscale test article mimicking turbine rotor disk undergoing rotational load. Non-contact instruments such as capacitive and microwave sensors are used to measure the blade tip gap displacement and blade vibrations characteristics in an attempt develop a physics based model to assess/predict the faults in the rotor disk. Data collection is a major component in this experimental-analytical procedure and as a result, an upgrade to an older version of the data acquisition software which is based on LabVIEW program has been implemented to support efficiently running tests and analyze the results. Outcomes obtained from the tests data and related experimental and analytical rotor dynamics modeling including key features of the updated software are presented and discussed.

Rover Attitude and Pointing System Simulation Testbed

NASA Technical Reports Server (NTRS)

Vanelli, Charles A.; Grinblat, Jonathan F.; Sirlin, Samuel W.; Pfister, Sam

2009-01-01

The MER (Mars Exploration Rover) Attitude and Pointing System Simulation Testbed Environment (RAPSSTER) provides a simulation platform used for the development and test of GNC (guidance, navigation, and control) flight algorithm designs for the Mars rovers, which was specifically tailored to the MERs, but has since been used in the development of rover algorithms for the Mars Science Laboratory (MSL) as well. The software provides an integrated simulation and software testbed environment for the development of Mars rover attitude and pointing flight software. It provides an environment that is able to run the MER GNC flight software directly (as opposed to running an algorithmic model of the MER GNC flight code). This improves simulation fidelity and confidence in the results. Further more, the simulation environment allows the user to single step through its execution, pausing, and restarting at will. The system also provides for the introduction of simulated faults specific to Mars rover environments that cannot be replicated in other testbed platforms, to stress test the GNC flight algorithms under examination. The software provides facilities to do these stress tests in ways that cannot be done in the real-time flight system testbeds, such as time-jumping (both forwards and backwards), and introduction of simulated actuator faults that would be difficult, expensive, and/or destructive to implement in the real-time testbeds. Actual flight-quality codes can be incorporated back into the development-test suite of GNC developers, closing the loop between the GNC developers and the flight software developers. The software provides fully automated scripting, allowing multiple tests to be run with varying parameters, without human supervision.
The Infeasibility of Experimental Quantification of Life-Critical Software Reliability

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Finelli, George B.

1991-01-01

This paper affirms that quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or fault-tolerant software. The key assumption of software fault tolerance|separately programmed versions fail independently|is shown to be problematic. This assumption cannot be justified by experimentation in the ultra-reliability region and subjective arguments in its favor are not sufficiently strong to justify it as an axiom. Also, the implications of the recent multi-version software experiments support this affirmation.
Hidden Markov models and neural networks for fault detection in dynamic systems

NASA Technical Reports Server (NTRS)

Smyth, Padhraic

1994-01-01

Neural networks plus hidden Markov models (HMM) can provide excellent detection and false alarm rate performance in fault detection applications, as shown in this viewgraph presentation. Modified models allow for novelty detection. Key contributions of neural network models are: (1) excellent nonparametric discrimination capability; (2) a good estimator of posterior state probabilities, even in high dimensions, and thus can be embedded within overall probabilistic model (HMM); and (3) simple to implement compared to other nonparametric models. Neural network/HMM monitoring model is currently being integrated with the new Deep Space Network (DSN) antenna controller software and will be on-line monitoring a new DSN 34-m antenna (DSS-24) by July, 1994.
Experimental analysis of computer system dependability

NASA Technical Reports Server (NTRS)

Iyer, Ravishankar, K.; Tang, Dong

1993-01-01

This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance.
Distributed controller clustering in software defined networks

PubMed Central

Gani, Abdullah; Akhunzada, Adnan; Talebian, Hamid; Choo, Kim-Kwang Raymond

2017-01-01

Software Defined Networking (SDN) is an emerging promising paradigm for network management because of its centralized network intelligence. However, the centralized control architecture of the software-defined networks (SDNs) brings novel challenges of reliability, scalability, fault tolerance and interoperability. In this paper, we proposed a novel clustered distributed controller architecture in the real setting of SDNs. The distributed cluster implementation comprises of multiple popular SDN controllers. The proposed mechanism is evaluated using a real world network topology running on top of an emulated SDN environment. The result shows that the proposed distributed controller clustering mechanism is able to significantly reduce the average latency from 8.1% to 1.6%, the packet loss from 5.22% to 4.15%, compared to distributed controller without clustering running on HP Virtual Application Network (VAN) SDN and Open Network Operating System (ONOS) controllers respectively. Moreover, proposed method also shows reasonable CPU utilization results. Furthermore, the proposed mechanism makes possible to handle unexpected load fluctuations while maintaining a continuous network operation, even when there is a controller failure. The paper is a potential contribution stepping towards addressing the issues of reliability, scalability, fault tolerance, and inter-operability. PMID:28384312
Software fault tolerance in computer operating systems

NASA Technical Reports Server (NTRS)

Iyer, Ravishankar K.; Lee, Inhwan

1994-01-01

This chapter provides data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.
Space Station Module Power Management and Distribution System (SSM/PMAD)

NASA Technical Reports Server (NTRS)

Miller, William (Compiler); Britt, Daniel (Compiler); Elges, Michael (Compiler); Myers, Chris (Compiler)

1994-01-01

This report provides an overview of the Space Station Module Power Management and Distribution (SSM/PMAD) testbed system and describes recent enhancements to that system. Four tasks made up the original contract: (1) common module power management and distribution system automation plan definition; (2) definition of hardware and software elements of automation; (3) design, implementation and delivery of the hardware and software making up the SSM/PMAD system; and (4) definition and development of the host breadboard computer environment. Additions and/or enhancements to the SSM/PMAD test bed that have occurred since July 1990 are reported. These include: (1) rehosting the MAESTRO scheduler; (2) reorganization of the automation software internals; (3) a more robust communications package; (4) the activity editor to the MAESTRO scheduler; (5) rehosting the LPLMS to execute under KNOMAD; implementation of intermediate levels of autonomy; (6) completion of the KNOMAD knowledge management facility; (7) significant improvement of the user interface; (8) soft and incipient fault handling design; (9) intermediate levels of autonomy, and (10) switch maintenance.
Diagnostic and Prognostic Models for Generator Step-Up Transformers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vivek Agarwal; Nancy J. Lybeck; Binh T. Pham

In 2014, the online monitoring (OLM) of active components project under the Light Water Reactor Sustainability program at Idaho National Laboratory (INL) focused on diagnostic and prognostic capabilities for generator step-up transformers. INL worked with subject matter experts from the Electric Power Research Institute (EPRI) to augment and revise the GSU fault signatures previously implemented in the Electric Power Research Institute’s (EPRI’s) Fleet-Wide Prognostic and Health Management (FW-PHM) Suite software. Two prognostic models were identified and implemented for GSUs in the FW-PHM Suite software. INL and EPRI demonstrated the use of prognostic capabilities for GSUs. The complete set of faultmore » signatures developed for GSUs in the Asset Fault Signature Database of the FW-PHM Suite for GSUs is presented in this report. Two prognostic models are described for paper insulation: the Chendong model for degree of polymerization, and an IEEE model that uses a loading profile to calculates life consumption based on hot spot winding temperatures. Both models are life consumption models, which are examples of type II prognostic models. Use of the models in the FW-PHM Suite was successfully demonstrated at the 2014 August Utility Working Group Meeting, Idaho Falls, Idaho, to representatives from different utilities, EPRI, and the Halden Research Project.« less
Certification trails for data structures

NASA Technical Reports Server (NTRS)

Sullivan, Gregory F.; Masson, Gerald M.

1993-01-01

Certification trails are a recently introduced and promising approach to fault detection and fault tolerance. The applicability of the certification trail technique is significantly generalized. Previously, certification trails had to be customized to each algorithm application; trails appropriate to wide classes of algorithms were developed. These certification trails are based on common data-structure operations such as those carried out using these sets of operations such as those carried out using balanced binary trees and heaps. Any algorithms using these sets of operations can therefore employ the certification trail method to achieve software fault tolerance. To exemplify the scope of the generalization of the certification trail technique provided, constructions of trails for abstract data types such as priority queues and union-find structures are given. These trails are applicable to any data-structure implementation of the abstract data type. It is also shown that these ideals lead naturally to monitors for data-structure operations.
Fault-tolerant clock synchronization validation methodology. [in computer systems

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Palumbo, Daniel L.; Johnson, Sally C.

1987-01-01

A validation method for the synchronization subsystem of a fault-tolerant computer system is presented. The high reliability requirement of flight-crucial systems precludes the use of most traditional validation methods. The method presented utilizes formal design proof to uncover design and coding errors and experimentation to validate the assumptions of the design proof. The experimental method is described and illustrated by validating the clock synchronization system of the Software Implemented Fault Tolerance computer. The design proof of the algorithm includes a theorem that defines the maximum skew between any two nonfaulty clocks in the system in terms of specific system parameters. Most of these parameters are deterministic. One crucial parameter is the upper bound on the clock read error, which is stochastic. The probability that this upper bound is exceeded is calculated from data obtained by the measurement of system parameters. This probability is then included in a detailed reliability analysis of the system.
Continuous Fine-Fault Estimation with Real-Time GNSS

NASA Astrophysics Data System (ADS)

Norford, B. B.; Melbourne, T. I.; Szeliga, W. M.; Santillan, V. M.; Scrivner, C.; Senko, J.; Larsen, D.

2017-12-01

Thousands of real-time telemetered GNSS stations operate throughout the circum-Pacific that may be used for rapid earthquake characterization and estimation of local tsunami excitation. We report on the development of a GNSS-based finite-fault inversion system that continuously estimates slip using real-time GNSS position streams from the Cascadia subduction zone and which is being expanded throughout the circum-Pacific. The system uses 1 Hz precise point position streams computed in the ITRF14 reference frame using clock and satellite orbit corrections from the IGS. The software is implemented as seven independent modules that filter time series using Kalman filters, trigger and estimate coseismic offsets, invert for slip using a non-negative least squares method developed by Lawson and Hanson (1974) and elastic half-space Green's Functions developed by Okada (1985), smooth the results temporally and spatially, and write the resulting streams of time-dependent slip to a RabbitMQ messaging server for use by downstream modules such as tsunami excitation modules. Additional fault models can be easily added to the system for other circum-Pacific subduction zones as additional real-time GNSS data become available. The system is currently being tested using data from well-recorded earthquakes including the 2011 Tohoku earthquake, the 2010 Maule earthquake, the 2015 Illapel earthquake, the 2003 Tokachi-oki earthquake, the 2014 Iquique earthquake, the 2010 Mentawai earthquake, the 2016 Kaikoura earthquake, the 2016 Ecuador earthquake, the 2015 Gorkha earthquake, and others. Test data will be fed to the system and the resultant earthquake characterizations will be compared with published earthquake parameters. Seismic events will be assumed to occur on major faults, so, for example, only the San Andreas fault will be considered in Southern California, while the hundreds of other faults in the region will be ignored. Rake will be constrained along each subfault to be consistent with NUVEL-1 plate convergence directions. This software provides a basis for a GNSS-based rapid earthquake finite fault estimation system with global scope.
Diagnostics Tools Identify Faults Prior to Failure

NASA Technical Reports Server (NTRS)

2013-01-01

Through the SBIR program, Rochester, New York-based Impact Technologies LLC collaborated with Ames Research Center to commercialize the Center s Hybrid Diagnostic Engine, or HyDE, software. The fault detecting program is now incorporated into a software suite that identifies potential faults early in the design phase of systems ranging from printers to vehicles and robots, saving time and money.
Hardware Fault Simulator for Microprocessors

NASA Technical Reports Server (NTRS)

Hess, L. M.; Timoc, C. C.

1983-01-01

Breadboarded circuit is faster and more thorough than software simulator. Elementary fault simulator for AND gate uses three gates and shaft register to simulate stuck-at-one or stuck-at-zero conditions at inputs and output. Experimental results showed hardware fault simulator for microprocessor gave faster results than software simulator, by two orders of magnitude, with one test being applied every 4 microseconds.
Fault Tolerance Middleware for a Multi-Core System

NASA Technical Reports Server (NTRS)

Some, Raphael R.; Springer, Paul L.; Zima, Hans P.; James, Mark; Wagner, David A.

2012-01-01

Fault Tolerance Middleware (FTM) provides a framework to run on a dedicated core of a multi-core system and handles detection of single-event upsets (SEUs), and the responses to those SEUs, occurring in an application running on multiple cores of the processor. This software was written expressly for a multi-core system and can support different kinds of fault strategies, such as introspection, algorithm-based fault tolerance (ABFT), and triple modular redundancy (TMR). It focuses on providing fault tolerance for the application code, and represents the first step in a plan to eventually include fault tolerance in message passing and the FTM itself. In the multi-core system, the FTM resides on a single, dedicated core, separate from the cores used by the application. This is done in order to isolate the FTM from application faults and to allow it to swap out any application core for a substitute. The structure of the FTM consists of an interface to a fault tolerant strategy module, a responder module, a fault manager module, an error factory, and an error mapper that determines the severity of the error. In the present reference implementation, the only fault tolerant strategy implemented is introspection. The introspection code waits for an application node to send an error notification to it. It then uses the error factory to create an error object, and at this time, a severity level is assigned to the error. The introspection code uses its built-in knowledge base to generate a recommended response to the error. Responses might include ignoring the error, logging it, rolling back the application to a previously saved checkpoint, swapping in a new node to replace a bad one, or restarting the application. The original error and recommended response are passed to the top-level fault manager module, which invokes the response. The responder module also notifies the introspection module of the generated response. This provides additional information to the introspection module that it can use in generating its next response. For example, if the responder triggers an application rollback and errors are still occurring, the introspection module may decide to recommend an application restart.
Toward a Model-Based Approach to Flight System Fault Protection

NASA Technical Reports Server (NTRS)

Day, John; Murray, Alex; Meakin, Peter

2012-01-01

Fault Protection (FP) is a distinct and separate systems engineering sub-discipline that is concerned with the off-nominal behavior of a system. Flight system fault protection is an important part of the overall flight system systems engineering effort, with its own products and processes. As with other aspects of systems engineering, the FP domain is highly amenable to expression and management in models. However, while there are standards and guidelines for performing FP related analyses, there are not standards or guidelines for formally relating the FP analyses to each other or to the system hardware and software design. As a result, the material generated for these analyses are effectively creating separate models that are only loosely-related to the system being designed. Development of approaches that enable modeling of FP concerns in the same model as the system hardware and software design enables establishment of formal relationships that has great potential for improving the efficiency, correctness, and verification of the implementation of flight system FP. This paper begins with an overview of the FP domain, and then continues with a presentation of a SysML/UML model of the FP domain and the particular analyses that it contains, by way of showing a potential model-based approach to flight system fault protection, and an exposition of the use of the FP models in FSW engineering. The analyses are small examples, inspired by current real-project examples of FP analyses.
Ensuring critical event sequences in high consequence computer based systems as inspired by path expressions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kidd, M.E.C.

1997-02-01

The goal of our work is to provide a high level of confidence that critical software driven event sequences are maintained in the face of hardware failures, malevolent attacks and harsh or unstable operating environments. This will be accomplished by providing dynamic fault management measures directly to the software developer and to their varied development environments. The methodology employed here is inspired by previous work in path expressions. This paper discusses the perceived problems, a brief overview of path expressions, the proposed methods, and a discussion of the differences between the proposed methods and traditional path expression usage and implementation.
Motion-Based System Identification and Fault Detection and Isolation Technologies for Thruster Controlled Spacecraft

NASA Technical Reports Server (NTRS)

Wilson, Edward; Sutter, David W.; Berkovitz, Dustin; Betts, Bradley J.; Kong, Edmund; delMundo, Rommel; Lages, Christopher R.; Mah, Robert W.; Papasin, Richard

2003-01-01

By analyzing the motions of a thruster-controlled spacecraft, it is possible to provide on-line (1) thruster fault detection and isolation (FDI), and (2) vehicle mass- and thruster-property identification (ID). Technologies developed recently at NASA Ames have significantly improved the speed and accuracy of these ID and FDI capabilities, making them feasible for application to a broad class of spacecraft. Since these technologies use existing sensors, the improved system robustness and performance that comes with the thruster fault tolerance and system ID can be achieved through a software-only implementation. This contrasts with the added cost, mass, and hardware complexity commonly required by FDI. Originally developed in partnership with NASA - Johnson Space Center to provide thruster FDI capability for the X-38 during re-entry, these technologies are most recently being applied to the MIT SPHERES experimental spacecraft to fly on the International Space Station in 2004. The model-based FDI uses a maximum-likelihood calculation at its core, while the ID is based upon recursive least squares estimation. Flight test results from the SPHERES implementation, as flown aboard the NASA KC-1 35A 0-g simulator aircraft in November 2003 are presented.
Production of Reliable Flight Crucial Software: Validation Methods Research for Fault Tolerant Avionics and Control Systems Sub-Working Group Meeting

NASA Technical Reports Server (NTRS)

Dunham, J. R. (Editor); Knight, J. C. (Editor)

1982-01-01

The state of the art in the production of crucial software for flight control applications was addressed. The association between reliability metrics and software is considered. Thirteen software development projects are discussed. A short term need for research in the areas of tool development and software fault tolerance was indicated. For the long term, research in format verification or proof methods was recommended. Formal specification and software reliability modeling, were recommended as topics for both short and long term research.
Applications of Logic Coverage Criteria and Logic Mutation to Software Testing

ERIC Educational Resources Information Center

Kaminski, Garrett K.

2011-01-01

Logic is an important component of software. Thus, software logic testing has enjoyed significant research over a period of decades, with renewed interest in the last several years. One approach to detecting logic faults is to create and execute tests that satisfy logic coverage criteria. Another approach to detecting faults is to perform mutation…
Runtime Verification in Context : Can Optimizing Error Detection Improve Fault Diagnosis

NASA Technical Reports Server (NTRS)

Dwyer, Matthew B.; Purandare, Rahul; Person, Suzette

2010-01-01

Runtime verification has primarily been developed and evaluated as a means of enriching the software testing process. While many researchers have pointed to its potential applicability in online approaches to software fault tolerance, there has been a dearth of work exploring the details of how that might be accomplished. In this paper, we describe how a component-oriented approach to software health management exposes the connections between program execution, error detection, fault diagnosis, and recovery. We identify both research challenges and opportunities in exploiting those connections. Specifically, we describe how recent approaches to reducing the overhead of runtime monitoring aimed at error detection might be adapted to reduce the overhead and improve the effectiveness of fault diagnosis.

Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.1)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hukerikar, Saurabh; Engelmann, Christian

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Therefore the resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies that are capable of handling a broad set of fault models at accelerated fault rates. Also, due to practical limits on powermore » consumption in HPC systems future systems are likely to embrace innovative architectures, increasing the levels of hardware and software complexities. As a result the techniques that seek to improve resilience must navigate the complex trade-off space between resilience and the overheads to power consumption and performance. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power efficiency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience using the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. Each established solution is described in the form of a pattern that addresses concrete problems in the design of resilient systems. The complete catalog of resilience design patterns provides designers with reusable design elements. We also define a framework that enhances a designer's understanding of the important constraints and opportunities for the design patterns to be implemented and deployed at various layers of the system stack. This design framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also supports optimization of the cost-benefit trade-offs among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner in spite of frequent faults, errors, and failures of various types.« less
Low cost computer subsystem for the Solar Electric Propulsion Stage (SEPS)

NASA Technical Reports Server (NTRS)

1975-01-01

The Solar Electric Propulsion Stage (SEPS) subsystem which consists of the computer, custom input/output (I/O) unit, and tape recorder for mass storage of telemetry data was studied. Computer software and interface requirements were developed along with computer and I/O unit design parameters. Redundancy implementation was emphasized. Reliability analysis was performed for the complete command computer sybsystem. A SEPS fault tolerant memory breadboard was constructed and its operation demonstrated.
Architecture for Survivable System Processing (ASSP)

NASA Astrophysics Data System (ADS)

Wood, Richard J.

1991-11-01

The Architecture for Survivable System Processing (ASSP) Program is a multi-phase effort to implement Department of Defense (DOD) and commercially developed high-tech hardware, software, and architectures for reliable space avionics and ground based systems. System configuration options provide processing capabilities to address Time Dependent Processing (TDP), Object Dependent Processing (ODP), and Mission Dependent Processing (MDP) requirements through Open System Architecture (OSA) alternatives that allow for the enhancement, incorporation, and capitalization of a broad range of development assets. High technology developments in hardware, software, and networking models, address technology challenges of long processor life times, fault tolerance, reliability, throughput, memories, radiation hardening, size, weight, power (SWAP) and security. Hardware and software design, development, and implementation focus on the interconnectivity/interoperability of an open system architecture and is being developed to apply new technology into practical OSA components. To insure for widely acceptable architecture capable of interfacing with various commercial and military components, this program provides for regular interactions with standardization working groups (e.g.) the International Standards Organization (ISO), American National Standards Institute (ANSI), Society of Automotive Engineers (SAE), and Institute of Electrical and Electronic Engineers (IEEE). Selection of a viable open architecture is based on the widely accepted standards that implement the ISO/OSI Reference Model.
Architecture for Survivable System Processing (ASSP)

NASA Technical Reports Server (NTRS)

Wood, Richard J.

1991-01-01

The Architecture for Survivable System Processing (ASSP) Program is a multi-phase effort to implement Department of Defense (DOD) and commercially developed high-tech hardware, software, and architectures for reliable space avionics and ground based systems. System configuration options provide processing capabilities to address Time Dependent Processing (TDP), Object Dependent Processing (ODP), and Mission Dependent Processing (MDP) requirements through Open System Architecture (OSA) alternatives that allow for the enhancement, incorporation, and capitalization of a broad range of development assets. High technology developments in hardware, software, and networking models, address technology challenges of long processor life times, fault tolerance, reliability, throughput, memories, radiation hardening, size, weight, power (SWAP) and security. Hardware and software design, development, and implementation focus on the interconnectivity/interoperability of an open system architecture and is being developed to apply new technology into practical OSA components. To insure for widely acceptable architecture capable of interfacing with various commercial and military components, this program provides for regular interactions with standardization working groups (e.g.) the International Standards Organization (ISO), American National Standards Institute (ANSI), Society of Automotive Engineers (SAE), and Institute of Electrical and Electronic Engineers (IEEE). Selection of a viable open architecture is based on the widely accepted standards that implement the ISO/OSI Reference Model.
Health management and controls for Earth-to-orbit propulsion systems

NASA Astrophysics Data System (ADS)

Bickford, R. L.

1995-03-01

Avionics and health management technologies increase the safety and reliability while decreasing the overall cost for Earth-to-orbit (ETO) propulsion systems. New ETO propulsion systems will depend on highly reliable fault tolerant flight avionics, advanced sensing systems and artificial intelligence aided software to ensure critical control, safety and maintenance requirements are met in a cost effective manner. Propulsion avionics consist of the engine controller, actuators, sensors, software and ground support elements. In addition to control and safety functions, these elements perform system monitoring for health management. Health management is enhanced by advanced sensing systems and algorithms which provide automated fault detection and enable adaptive control and/or maintenance approaches. Aerojet is developing advanced fault tolerant rocket engine controllers which provide very high levels of reliability. Smart sensors and software systems which significantly enhance fault coverage and enable automated operations are also under development. Smart sensing systems, such as flight capable plume spectrometers, have reached maturity in ground-based applications and are suitable for bridging to flight. Software to detect failed sensors has reached similar maturity. This paper will discuss fault detection and isolation for advanced rocket engine controllers as well as examples of advanced sensing systems and software which significantly improve component failure detection for engine system safety and health management.
Online Monitoring of Induction Motors

DOE Office of Scientific and Technical Information (OSTI.GOV)

McJunkin, Timothy R.; Agarwal, Vivek; Lybeck, Nancy Jean

2016-01-01

The online monitoring of active components project, under the Advanced Instrumentation, Information, and Control Technologies Pathway of the Light Water Reactor Sustainability Program, researched diagnostic and prognostic models for alternating current induction motors (IM). Idaho National Laboratory (INL) worked with the Electric Power Research Institute (EPRI) to augment and revise the fault signatures previously implemented in the Asset Fault Signature Database of EPRI’s Fleet Wide Prognostic and Health Management (FW PHM) Suite software. Induction Motor diagnostic models were researched using the experimental data collected by Idaho State University. Prognostic models were explored in the set of literature and through amore » limited experiment with 40HP to seek the Remaining Useful Life Database of the FW PHM Suite.« less
Diagnostic Analyzer for Gearboxes (DAG): User's Guide. Version 3.1 for Microsoft Windows 3.1

NASA Technical Reports Server (NTRS)

Jammu, Vinay B.; Kourosh, Danai

1997-01-01

This documentation describes the Diagnostic Analyzer for Gearboxes (DAG) software for performing fault diagnosis of gearboxes. First, the user would construct a graphical representation of the gearbox using the gear, bearing, shaft, and sensor tools contained in the DAG software. Next, a set of vibration features obtained by processing the vibration signals recorded from the gearbox using a signal analyzer is required. Given this information, the DAG software uses an unsupervised neural network referred to as the Fault Detection Network (FDN) to identify the occurrence of faults, and a pattern classifier called Single Category-Based Classifier (SCBC) for abnormality scaling of individual vibration features. The abnormality-scaled vibration features are then used as inputs to a Structure-Based Connectionist Network (SBCN) for identifying faults in gearbox subsystems and components. The weights of the SBCN represent its diagnostic knowledge and are derived from the structure of the gearbox graphically presented in DAG. The outputs of SBCN are fault possibility values between 0 and 1 for individual subsystems and components in the gearbox with a 1 representing a definite fault and a 0 representing normality. This manual describes the steps involved in creating the diagnostic gearbox model, along with the options and analysis tools of the DAG software.
Stress state reconstruction and tectonic evolution of the northern slope of the Baikit anteclise, Siberian Craton, based on 3D seismic data

NASA Astrophysics Data System (ADS)

Moskalenko, A. N.; Khudoley, A. K.; Khusnitdinov, R. R.

2017-05-01

In this work, we consider application of an original method for determining the indicators of the tectonic stress fields in the northern Baikit anteclise based on 3D seismic data for further reconstruction of the stress state parameters when analyzing structural maps of seismic horizons and corresponded faults. The stress state parameters are determined by the orientations of the main stress axes and shape of the stress ellipsoid. To calculate the stress state parameters from data on the spatial orientations of faults and slip vectors, we used the algorithms from quasiprimary stress computation methods and cataclastic analysis, implemented in the software products FaultKinWin and StressGeol, respectively. The results of this work show that kinematic characteristics of faults regularly change toward the top of succession and that the stress state parameters are characterized by different values of the Lode-Nadai coefficient. Faults are presented as strike-slip faults with normal or reverse component of displacement. Three stages of formation of the faults are revealed: (1) partial inversion of ancient normal faults, (2) the most intense stage with the predominance of thrust and strike-slip faults at north-northeast orientation of an axis of the main compression, and (3) strike-slip faults at the west-northwest orientation of an axis of the main compression. The second and third stages are pre-Vendian in age and correlate to tectonic events that took place during the evolution of the active southwestern margin of the Siberian Craton.
NASA ground terminal communication equipment automated fault isolation expert systems

NASA Technical Reports Server (NTRS)

Tang, Y. K.; Wetzel, C. R.

1990-01-01

The prototype expert systems are described that diagnose the Distribution and Switching System I and II (DSS1 and DSS2), Statistical Multiplexers (SM), and Multiplexer and Demultiplexer systems (MDM) at the NASA Ground Terminal (NGT). A system level fault isolation expert system monitors the activities of a selected data stream, verifies that the fault exists in the NGT and identifies the faulty equipment. Equipment level fault isolation expert systems are invoked to isolate the fault to a Line Replaceable Unit (LRU) level. Input and sometimes output data stream activities for the equipment are available. The system level fault isolation expert system compares the equipment input and output status for a data stream and performs loopback tests (if necessary) to isolate the faulty equipment. The equipment level fault isolation system utilizes the process of elimination and/or the maintenance personnel's fault isolation experience stored in its knowledge base. The DSS1, DSS2 and SM fault isolation systems, using the knowledge of the current equipment configuration and the equipment circuitry issues a set of test connections according to the predefined rules. The faulty component or board can be identified by the expert system by analyzing the test results. The MDM fault isolation system correlates the failure symptoms with the faulty component based on maintenance personnel experience. The faulty component can be determined by knowing the failure symptoms. The DSS1, DSS2, SM, and MDM equipment simulators are implemented in PASCAL. The DSS1 fault isolation expert system was converted to C language from VP-Expert and integrated into the NGT automation software for offline switch diagnoses. Potentially, the NGT fault isolation algorithms can be used for the DSS1, SM, amd MDM located at Goddard Space Flight Center (GSFC).
Using EMIS to Identify Top Opportunities for Commercial Building Efficiency

DOE Office of Scientific and Technical Information (OSTI.GOV)

Lin, Guanjing; Singla, Rupam; Granderson, Jessica

Energy Management and Information Systems (EMIS) comprise a broad family of tools and services to manage commercial building energy use. These technologies offer a mix of capabilities to store, display, and analyze energy use and system data, and in some cases, provide control. EMIS technologies enable 10–20 percent site energy savings in best practice implementations. Energy Information System (EIS) and Fault Detection and Diagnosis (FDD) systems are two key technologies in the EMIS family. Energy Information Systems are broadly defined as the web-based software, data acquisition hardware, and communication systems used to analyze and display building energy performance. At amore » minimum, an EIS provides daily, hourly or sub-hourly interval meter data at the whole-building level, with graphical and analytical capability. Fault Detection and Diagnosis systems automatically identify heating, ventilation, and air-conditioning (HVAC) system or equipment-level performances issues, and in some cases are able to isolate the root causes of the problem. They use computer algorithms to continuously analyze system-level operational data to detect faults and diagnose their causes. Many FDD tools integrate the trend log data from a Building Automation System (BAS) but otherwise are stand-alone software packages; other types of FDD tools are implemented as “on-board” equipment-embedded diagnostics. (This document focuses on the former.) Analysis approaches adopted in FDD technologies span a variety of techniques from rule-based methods to process history-based approaches. FDD tools automate investigations that can be conducted via manual data inspection by someone with expert knowledge, thereby expanding accessibility and breath of analysis opportunity, and also reducing complexity.« less
Processing LiDAR Data to Predict Natural Hazards

NASA Technical Reports Server (NTRS)

Fairweather, Ian; Crabtree, Robert; Hager, Stacey

2008-01-01

ELF-Base and ELF-Hazards (wherein 'ELF' signifies 'Extract LiDAR Features' and 'LiDAR' signifies 'light detection and ranging') are developmental software modules for processing remote-sensing LiDAR data to identify past natural hazards (principally, landslides) and predict future ones. ELF-Base processes raw LiDAR data, including LiDAR intensity data that are often ignored in other software, to create digital terrain models (DTMs) and digital feature models (DFMs) with sub-meter accuracy. ELF-Hazards fuses raw LiDAR data, data from multispectral and hyperspectral optical images, and DTMs and DFMs generated by ELF-Base to generate hazard risk maps. Advanced algorithms in these software modules include line-enhancement and edge-detection algorithms, surface-characterization algorithms, and algorithms that implement innovative data-fusion techniques. The line-extraction and edge-detection algorithms enable users to locate such features as faults and landslide headwall scarps. Also implemented in this software are improved methodologies for identification and mapping of past landslide events by use of (1) accurate, ELF-derived surface characterizations and (2) three LiDAR/optical-data-fusion techniques: post-classification data fusion, maximum-likelihood estimation modeling, and hierarchical within-class discrimination. This software is expected to enable faster, more accurate forecasting of natural hazards than has previously been possible.
Ground Software Maintenance Facility (GSMF) system manual

NASA Technical Reports Server (NTRS)

Derrig, D.; Griffith, G.

1986-01-01

The Ground Software Maintenance Facility (GSMF) is designed to support development and maintenance of spacelab ground support software. THE GSMF consists of a Perkin Elmer 3250 (Host computer) and a MITRA 125s (ATE computer), with appropriate interface devices and software to simulate the Electrical Ground Support Equipment (EGSE). This document is presented in three sections: (1) GSMF Overview; (2) Software Structure; and (3) Fault Isolation Capability. The overview contains information on hardware and software organization along with their corresponding block diagrams. The Software Structure section describes the modes of software structure including source files, link information, and database files. The Fault Isolation section describes the capabilities of the Ground Computer Interface Device, Perkin Elmer host, and MITRA ATE.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, John C.

1987-01-01

Multi-version or N-version programming is proposed as a method of providing fault tolerance in software. The approach requires the separate, independent preparation of multiple versions of a piece of software for some application. These versions are executed in parallel in the application environment; each receives identical inputs and each produces its version of the required outputs. The outputs are collected by a voter and, in principle, they should all be the same. In practice there may be some disagreement. If this occurs, the results of the majority are taken to be the correct output, and that is the output used by the system. A total of 27 programs were produced. Each of these programs was then subjected to one million randomly-generated test cases. The experiment yielded a number of programs containing faults that are useful for general studies of software reliability as well as studies of N-version programming. Fault tolerance through data diversity and analytic models of comparison testing are discussed.
Measurement of fault latency in a digital avionic miniprocessor

NASA Technical Reports Server (NTRS)

Mcgough, J. G.; Swern, F. L.

1981-01-01

The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are presented. The failure detection coverage of comparison-monitoring and a typical avionics CPU self-test program was determined. The specific tasks and experiments included: (1) inject randomly selected gate-level and pin-level faults and emulate six software programs using comparison-monitoring to detect the faults; (2) based upon the derived empirical data develop and validate a model of fault latency that will forecast a software program's detecting ability; (3) given a typical avionics self-test program, inject randomly selected faults at both the gate-level and pin-level and determine the proportion of faults detected; (4) determine why faults were undetected; (5) recommend how the emulation can be extended to multiprocessor systems such as SIFT; and (6) determine the proportion of faults detected by a uniprocessor BIT (built-in-test) irrespective of self-test.
Criteria for software modularization

NASA Technical Reports Server (NTRS)

Card, David N.; Page, Gerald T.; Mcgarry, Frank E.

1985-01-01

A central issue in programming practice involves determining the appropriate size and information content of a software module. This study attempted to determine the effectiveness of two widely used criteria for software modularization, strength and size, in reducing fault rate and development cost. Data from 453 FORTRAN modules developed by professional programmers were analyzed. The results indicated that module strength is a good criterion with respect to fault rate, whereas arbitrary module size limitations inhibit programmer productivity. This analysis is a first step toward defining empirically based standards for software modularization.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary

NASA Technical Reports Server (NTRS)

Smith, T. B., III; Lala, J. H.

1984-01-01

The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1985-01-01

Synopses are given for NASA supported work in computer science at the University of Virginia. Some areas of research include: error seeding as a testing method; knowledge representation for engineering design; analysis of faults in a multi-version software experiment; implementation of a parallel programming environment; two computer graphics systems for visualization of pressure distribution and convective density particles; task decomposition for multiple robot arms; vectorized incomplete conjugate gradient; and iterative methods for solving linear equations on the Flex/32.
Combining Topological Hardware and Topological Software: Color-Code Quantum Computing with Topological Superconductor Networks

NASA Astrophysics Data System (ADS)

Litinski, Daniel; Kesselring, Markus S.; Eisert, Jens; von Oppen, Felix

2017-07-01

We present a scalable architecture for fault-tolerant topological quantum computation using networks of voltage-controlled Majorana Cooper pair boxes and topological color codes for error correction. Color codes have a set of transversal gates which coincides with the set of topologically protected gates in Majorana-based systems, namely, the Clifford gates. In this way, we establish color codes as providing a natural setting in which advantages offered by topological hardware can be combined with those arising from topological error-correcting software for full-fledged fault-tolerant quantum computing. We provide a complete description of our architecture, including the underlying physical ingredients. We start by showing that in topological superconductor networks, hexagonal cells can be employed to serve as physical qubits for universal quantum computation, and we present protocols for realizing topologically protected Clifford gates. These hexagonal-cell qubits allow for a direct implementation of open-boundary color codes with ancilla-free syndrome read-out and logical T gates via magic-state distillation. For concreteness, we describe how the necessary operations can be implemented using networks of Majorana Cooper pair boxes, and we give a feasibility estimate for error correction in this architecture. Our approach is motivated by nanowire-based networks of topological superconductors, but it could also be realized in alternative settings such as quantum-Hall-superconductor hybrids.
Second generation experiments in fault tolerant software

NASA Technical Reports Server (NTRS)

Knight, J. C.

1987-01-01

The purpose of the Multi-Version Software (MVS) experiment is to obtain empirical measurements of the performance of multi-version systems. Twenty version of a program were prepared under reasonably realistic development conditions from the same specifications. The overall structure of the testing environment for the MVS experiment and its status are described. A preliminary version of the control system is described that was implemented for the MVS experiment to allow the experimenter to have control over the details of the testing. The results of an empirical study of error detection using self checks are also presented. The analysis of the checks revealed that there are great differences in the ability of individual programmers to design effective checks.
Combinatorial Optimization Algorithms for Dynamic Multiple Fault Diagnosis in Automotive and Aerospace Applications

NASA Astrophysics Data System (ADS)

Kodali, Anuradha

In this thesis, we develop dynamic multiple fault diagnosis (DMFD) algorithms to diagnose faults that are sporadic and coupled. Firstly, we formulate a coupled factorial hidden Markov model-based (CFHMM) framework to diagnose dependent faults occurring over time (dynamic case). Here, we implement a mixed memory Markov coupling model to determine the most likely sequence of (dependent) fault states, the one that best explains the observed test outcomes over time. An iterative Gauss-Seidel coordinate ascent optimization method is proposed for solving the problem. A soft Viterbi algorithm is also implemented within the framework for decoding dependent fault states over time. We demonstrate the algorithm on simulated and real-world systems with coupled faults; the results show that this approach improves the correct isolation rate as compared to the formulation where independent fault states are assumed. Secondly, we formulate a generalization of set-covering, termed dynamic set-covering (DSC), which involves a series of coupled set-covering problems over time. The objective of the DSC problem is to infer the most probable time sequence of a parsimonious set of failure sources that explains the observed test outcomes over time. The DSC problem is NP-hard and intractable due to the fault-test dependency matrix that couples the failed tests and faults via the constraint matrix, and the temporal dependence of failure sources over time. Here, the DSC problem is motivated from the viewpoint of a dynamic multiple fault diagnosis problem, but it has wide applications in operations research, for e.g., facility location problem. Thus, we also formulated the DSC problem in the context of a dynamically evolving facility location problem. Here, a facility can be opened, closed, or can be temporarily unavailable at any time for a given requirement of demand points. These activities are associated with costs or penalties, viz., phase-in or phase-out for the opening or closing of a facility, respectively. The set-covering matrix encapsulates the relationship among the rows (tests or demand points) and columns (faults or locations) of the system at each time. By relaxing the coupling constraints using Lagrange multipliers, the DSC problem can be decoupled into independent subproblems, one for each column. Each subproblem is solved using the Viterbi decoding algorithm, and a primal feasible solution is constructed by modifying the Viterbi solutions via a heuristic. The proposed Viterbi-Lagrangian relaxation algorithm (VLRA) provides a measure of suboptimality via an approximate duality gap. As a major practical extension of the above problem, we also consider the problem of diagnosing faults with delayed test outcomes, termed delay-dynamic set-covering (DDSC), and experiment with real-world problems that exhibit masking faults. Also, we present simulation results on OR-library datasets (set-covering formulations are predominantly validated on these matrices in the literature), posed as facility location problems. Finally, we implement these algorithms to solve problems in aerospace and automotive applications. Firstly, we address the diagnostic ambiguity problem in aerospace and automotive applications by developing a dynamic fusion framework that includes dynamic multiple fault diagnosis algorithms. This improves the correct fault isolation rate, while minimizing the false alarm rates, by considering multiple faults instead of the traditional data-driven techniques based on single fault (class)-single epoch (static) assumption. The dynamic fusion problem is formulated as a maximum a posteriori decision problem of inferring the fault sequence based on uncertain outcomes of multiple binary classifiers over time. The fusion process involves three steps: the first step transforms the multi-class problem into dichotomies using error correcting output codes (ECOC), thereby solving the concomitant binary classification problems; the second step fuses the outcomes of multiple binary classifiers over time using a sliding window or block dynamic fusion method that exploits temporal data correlations over time. We solve this NP-hard optimization problem via a Lagrangian relaxation (variational) technique. The third step optimizes the classifier parameters, viz., probabilities of detection and false alarm, using a genetic algorithm. The proposed algorithm is demonstrated by computing the diagnostic performance metrics on a twin-spool commercial jet engine, an automotive engine, and UCI datasets (problems with high classification error are specifically chosen for experimentation). We show that the primal-dual optimization framework performed consistently better than any traditional fusion technique, even when it is forced to give a single fault decision across a range of classification problems. Secondly, we implement the inference algorithms to diagnose faults in vehicle systems that are controlled by a network of electronic control units (ECUs). The faults, originating from various interactions and especially between hardware and software, are particularly challenging to address. Our basic strategy is to divide the fault universe of such cyber-physical systems in a hierarchical manner, and monitor the critical variables/signals that have impact at different levels of interactions. The proposed diagnostic strategy is validated on an electrical power generation and storage system (EPGS) controlled by two ECUs in an environment with CANoe/MATLAB co-simulation. Eleven faults are injected with the failures originating in actuator hardware, sensor, controller hardware and software components. Diagnostic matrix is established to represent the relationship between the faults and the test outcomes (also known as fault signatures) via simulations. The results show that the proposed diagnostic strategy is effective in addressing the interaction-caused faults.

Path Searching Based Fault Automated Recovery Scheme for Distribution Grid with DG

NASA Astrophysics Data System (ADS)

Xia, Lin; Qun, Wang; Hui, Xue; Simeng, Zhu

2016-12-01

Applying the method of path searching based on distribution network topology in setting software has a good effect, and the path searching method containing DG power source is also applicable to the automatic generation and division of planned islands after the fault. This paper applies path searching algorithm in the automatic division of planned islands after faults: starting from the switch of fault isolation, ending in each power source, and according to the line load that the searching path traverses and the load integrated by important optimized searching path, forming optimized division scheme of planned islands that uses each DG as power source and is balanced to local important load. Finally, COBASE software and distribution network automation software applied are used to illustrate the effectiveness of the realization of such automatic restoration program.
Building Diagnostic Market Deployment - Final Report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Katipamula, S.; Gayeski, N.

2012-04-30

Operational faults are pervasive across the commercial buildings sector, wasting energy and increasing energy costs by up to about 30% (Mills 2009, Liu et al. 2003, Claridge et al. 2000, Katipamula and Brambley 2008, and Brambley and Katipamula 2009). Automated fault detection and diagnostic (AFDD) tools provide capabilities essential for detecting and correcting these problems and eliminating the associated energy waste and costs. The U.S. Department of Energy's (DOE) Building Technology Program (BTP) has previously invested in developing and testing of such diagnostic tools for whole-building (and major system) energy use, air handlers, chillers, cooling towers, chilled-water distribution systems, andmore » boilers. These diagnostic processes can be used to make the commercial buildings more energy efficient. The work described in this report was done as part of a Cooperative Research and Development Agreement (CRADA) between the U.S. Department of Energy's Pacific Northwest National Laboratory (PNNL) and KGS Building LLC (KGS). PNNL and KGS both believe that the widespread adoption of AFDD tools will result in significant reduction to energy and peak energy consumption. The report provides an introduction and summary of the various tasks performed under the CRADA. The CRADA project had three major focus areas: (1) Technical Assistance for Whole Building Energy Diagnostician (WBE) Commercialization, (2) Market Transfer of the Outdoor Air/Economizer Diagnostician (OAE), and (3) Development and Deployment of Automated Diagnostics to Improve Large Commercial Building Operations. PNNL has previously developed two diagnostic tools: (1) whole building energy (WBE) diagnostician and (2) outdoor air/economizer (OAE) diagnostician. WBE diagnostician is currently licensed non-exclusively to one company. As part of this CRADA, PNNL developed implementation documentation and provided technical support to KGS to implement the tool into their software suite, Clockworks. PNNL also provided validation data sets and the WBE software tool to validate the KGS implementation. OAE diagnostician automatically detects and diagnoses problems with outdoor air ventilation and economizer operation for air handling units (AHUs) in commercial buildings using data available from building automation systems (BASs). As part of this CRADA, PNNL developed implementation documentation and provided technical support to KGS to implement the tool into their software suite. PNNL also provided validation data sets and the OAE software tool to validate the KGS implementation. Finally, as part of this CRADA project, PNNL developed new processes to automate parts of the re-tuning process and transfer those process to KGS for integration into their software product. The transfer of DOE-funded technologies will transform the commercial buildings sector by making buildings more energy efficient and also reducing the carbon footprint from the buildings. As part of the CRADA with PNNL, KGS implemented the whole building energy diagnostician, a portion of outdoor air economizer diagnostician and a number of measures that automate the identification of re-tuning measures.« less
A Fault Tolerant System for an Integrated Avionics Sensor Configuration

NASA Technical Reports Server (NTRS)

Caglayan, A. K.; Lancraft, R. E.

1984-01-01

An aircraft sensor fault tolerant system methodology for the Transport Systems Research Vehicle in a Microwave Landing System (MLS) environment is described. The fault tolerant system provides reliable estimates in the presence of possible failures both in ground-based navigation aids, and in on-board flight control and inertial sensors. Sensor failures are identified by utilizing the analytic relationships between the various sensors arising from the aircraft point mass equations of motion. The estimation and failure detection performance of the software implementation (called FINDS) of the developed system was analyzed on a nonlinear digital simulation of the research aircraft. Simulation results showing the detection performance of FINDS, using a dual redundant sensor compliment, are presented for bias, hardover, null, ramp, increased noise and scale factor failures. In general, the results show that FINDS can distinguish between normal operating sensor errors and failures while providing an excellent detection speed for bias failures in the MLS, indicated airspeed, attitude and radar altimeter sensors.
Software reliability through fault-avoidance and fault-tolerance

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.

1992-01-01

Accomplishments in the following research areas are summarized: structure based testing, reliability growth, and design testability with risk evaluation; reliability growth models and software risk management; and evaluation of consensus voting, consensus recovery block, and acceptance voting. Four papers generated during the reporting period are included as appendices.
Fault tolerant data management system

NASA Technical Reports Server (NTRS)

Gustin, W. M.; Smither, M. A.

1972-01-01

Described in detail are: (1) results obtained in modifying the onboard data management system software to a multiprocessor fault tolerant system; (2) a functional description of the prototype buffer I/O units; (3) description of modification to the ACADC and stimuli generating unit of the DTS; and (4) summaries and conclusions on techniques implemented in the rack and prototype buffers. Also documented is the work done in investigating techniques of high speed (5 Mbps) digital data transmission in the data bus environment. The application considered is a multiport data bus operating with the following constraints: no preferred stations; random bus access by all stations; all stations equally likely to source or sink data; no limit to the number of stations along the bus; no branching of the bus; and no restriction on station placement along the bus.
Detection of faults and software reliability analysis

NASA Technical Reports Server (NTRS)

Knight, J. C.

1987-01-01

Specific topics briefly addressed include: the consistent comparison problem in N-version system; analytic models of comparison testing; fault tolerance through data diversity; and the relationship between failures caused by automatically seeded faults.
Design for dependability: A simulation-based approach. Ph.D. Thesis, 1993

NASA Technical Reports Server (NTRS)

Goswami, Kumar K.

1994-01-01

This research addresses issues in simulation-based system level dependability analysis of fault-tolerant computer systems. The issues and difficulties of providing a general simulation-based approach for system level analysis are discussed and a methodology that address and tackle these issues is presented. The proposed methodology is designed to permit the study of a wide variety of architectures under various fault conditions. It permits detailed functional modeling of architectural features such as sparing policies, repair schemes, routing algorithms as well as other fault-tolerant mechanisms, and it allows the execution of actual application software. One key benefit of this approach is that the behavior of a system under faults does not have to be pre-defined as it is normally done. Instead, a system can be simulated in detail and injected with faults to determine its failure modes. The thesis describes how object-oriented design is used to incorporate this methodology into a general purpose design and fault injection package called DEPEND. A software model is presented that uses abstractions of application programs to study the behavior and effect of software on hardware faults in the early design stage when actual code is not available. Finally, an acceleration technique that combines hierarchical simulation, time acceleration algorithms and hybrid simulation to reduce simulation time is introduced.
Debugging and Logging Services for Defence Service Oriented Architectures

DTIC Science & Technology

2012-02-01

Service A software component and callable end point that provides a logically related set of operations, each of which perform a logical step in a...important to note that in some cases when the fault is identified to lie in uneditable code such as program libraries, or outsourced software services ...debugging is limited to characterisation of the fault, reporting it to the software or service provider and development of work-arounds and management
What does fault tolerant Deep Learning need from MPI?

DOE Office of Scientific and Technical Information (OSTI.GOV)

Amatya, Vinay C.; Vishnu, Abhinav; Siegel, Charles M.

Deep Learning (DL) algorithms have become the {\\em de facto} Machine Learning (ML) algorithm for large scale data analysis. DL algorithms are computationally expensive -- even distributed DL implementations which use MPI require days of training (model learning) time on commonly studied datasets. Long running DL applications become susceptible to faults -- requiring development of a fault tolerant system infrastructure, in addition to fault tolerant DL algorithms. This raises an important question: {\\em What is needed from MPI for designing fault tolerant DL implementations?} In this paper, we address this problem for permanent faults. We motivate the need for amore » fault tolerant MPI specification by an in-depth consideration of recent innovations in DL algorithms and their properties, which drive the need for specific fault tolerance features. We present an in-depth discussion on the suitability of different parallelism types (model, data and hybrid); a need (or lack thereof) for check-pointing of any critical data structures; and most importantly, consideration for several fault tolerance proposals (user-level fault mitigation (ULFM), Reinit) in MPI and their applicability to fault tolerant DL implementations. We leverage a distributed memory implementation of Caffe, currently available under the Machine Learning Toolkit for Extreme Scale (MaTEx). We implement our approaches by extending MaTEx-Caffe for using ULFM-based implementation. Our evaluation using the ImageNet dataset and AlexNet neural network topology demonstrates the effectiveness of the proposed fault tolerant DL implementation using OpenMPI based ULFM.« less
Experimental evaluation of the certification-trail method

NASA Technical Reports Server (NTRS)

Sullivan, Gregory F.; Wilson, Dwight S.; Masson, Gerald M.; Itoh, Mamoru; Smith, Warren W.; Kay, Jonathan S.

1993-01-01

Certification trails are a recently introduced and promising approach to fault-detection and fault-tolerance. A comprehensive attempt to assess experimentally the performance and overall value of the method is reported. The method is applied to algorithms for the following problems: huffman tree, shortest path, minimum spanning tree, sorting, and convex hull. Our results reveal many cases in which an approach using certification-trails allows for significantly faster overall program execution time than a basic time redundancy-approach. Algorithms for the answer-validation problem for abstract data types were also examined. This kind of problem provides a basis for applying the certification-trail method to wide classes of algorithms. Answer-validation solutions for two types of priority queues were implemented and analyzed. In both cases, the algorithm which performs answer-validation is substantially faster than the original algorithm for computing the answer. Next, a probabilistic model and analysis which enables comparison between the certification-trail method and the time-redundancy approach were presented. The analysis reveals some substantial and sometimes surprising advantages for ther certification-trail method. Finally, the work our group performed on the design and implementation of fault injection testbeds for experimental analysis of the certification trail technique is discussed. This work employs two distinct methodologies, software fault injection (modification of instruction, data, and stack segments of programs on a Sun Sparcstation ELC and on an IBM 386 PC) and hardware fault injection (control, address, and data lines of a Motorola MC68000-based target system pulsed at logical zero/one values). Our results indicate the viability of the certification trail technique. It is also believed that the tools developed provide a solid base for additional exploration.
Software Testbed for Developing and Evaluating Integrated Autonomous Subsystems

NASA Technical Reports Server (NTRS)

Ong, James; Remolina, Emilio; Prompt, Axel; Robinson, Peter; Sweet, Adam; Nishikawa, David

2015-01-01

To implement fault tolerant autonomy in future space systems, it will be necessary to integrate planning, adaptive control, and state estimation subsystems. However, integrating these subsystems is difficult, time-consuming, and error-prone. This paper describes Intelliface/ADAPT, a software testbed that helps researchers develop and test alternative strategies for integrating planning, execution, and diagnosis subsystems more quickly and easily. The testbed's architecture, graphical data displays, and implementations of the integrated subsystems support easy plug and play of alternate components to support research and development in fault-tolerant control of autonomous vehicles and operations support systems. Intelliface/ADAPT controls NASA's Advanced Diagnostics and Prognostics Testbed (ADAPT), which comprises batteries, electrical loads (fans, pumps, and lights), relays, circuit breakers, invertors, and sensors. During plan execution, an experimentor can inject faults into the ADAPT testbed by tripping circuit breakers, changing fan speed settings, and closing valves to restrict fluid flow. The diagnostic subsystem, based on NASA's Hybrid Diagnosis Engine (HyDE), detects and isolates these faults to determine the new state of the plant, ADAPT. Intelliface/ADAPT then updates its model of the ADAPT system's resources and determines whether the current plan can be executed using the reduced resources. If not, the planning subsystem generates a new plan that reschedules tasks, reconfigures ADAPT, and reassigns the use of ADAPT resources as needed to work around the fault. The resource model, planning domain model, and planning goals are expressed using NASA's Action Notation Modeling Language (ANML). Parts of the ANML model are generated automatically, and other parts are constructed by hand using the Planning Model Integrated Development Environment, a visual Eclipse-based IDE that accelerates ANML model development. Because native ANML planners are currently under development and not yet sufficiently capable, the ANML model is translated into the New Domain Definition Language (NDDL) and sent to NASA's EUROPA planning system for plan generation. The adaptive controller executes the new plan, using augmented, hierarchical finite state machines to select and sequence actions based on the state of the ADAPT system. Real-time sensor data, commands, and plans are displayed in information-dense arrays of timelines and graphs that zoom and scroll in unison. A dynamic schematic display uses color to show the real-time fault state and utilization of the system components and resources. An execution manager coordinates the activities of the other subsystems. The subsystems are integrated using the Internet Communications Engine (ICE). an object-oriented toolkit for building distributed applications.
The Infeasibility of Quantifying the Reliability of Life-Critical Real-Time Software

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Finelli, George B.

1991-01-01

This paper affirms that the quantification of life-critical software reliability is infeasible using statistical methods whether applied to standard software or fault-tolerant software. The classical methods of estimating reliability are shown to lead to exhorbitant amounts of testing when applied to life-critical software. Reliability growth models are examined and also shown to be incapable of overcoming the need for excessive amounts of testing. The key assumption of software fault tolerance separately programmed versions fail independently is shown to be problematic. This assumption cannot be justified by experimentation in the ultrareliability region and subjective arguments in its favor are not sufficiently strong to justify it as an axiom. Also, the implications of the recent multiversion software experiments support this affirmation.
A Fault Oblivious Extreme-Scale Execution Environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

McKie, Jim

The FOX project, funded under the ASCR X-stack I program, developed systems software and runtime libraries for a new approach to the data and work distribution for massively parallel, fault oblivious application execution. Our work was motivated by the premise that exascale computing systems will provide a thousand-fold increase in parallelism and a proportional increase in failure rate relative to today’s machines. To deliver the capability of exascale hardware, the systems software must provide the infrastructure to support existing applications while simultaneously enabling efficient execution of new programming models that naturally express dynamic, adaptive, irregular computation; coupled simulations; and massivemore » data analysis in a highly unreliable hardware environment with billions of threads of execution. Our OS research has prototyped new methods to provide efficient resource sharing, synchronization, and protection in a many-core compute node. We have experimented with alternative task/dataflow programming models and shown scalability in some cases to hundreds of thousands of cores. Much of our software is in active development through open source projects. Concepts from FOX are being pursued in next generation exascale operating systems. Our OS work focused on adaptive, application tailored OS services optimized for multi → many core processors. We developed a new operating system NIX that supports role-based allocation of cores to processes which was released to open source. We contributed to the IBM FusedOS project, which promoted the concept of latency-optimized and throughput-optimized cores. We built a task queue library based on distributed, fault tolerant key-value store and identified scaling issues. A second fault tolerant task parallel library was developed, based on the Linda tuple space model, that used low level interconnect primitives for optimized communication. We designed fault tolerance mechanisms for task parallel computations employing work stealing for load balancing that scaled to the largest existing supercomputers. Finally, we implemented the Elastic Building Blocks runtime, a library to manage object-oriented distributed software components. To support the research, we won two INCITE awards for time on Intrepid (BG/P) and Mira (BG/Q). Much of our work has had impact in the OS and runtime community through the ASCR Exascale OS/R workshop and report, leading to the research agenda of the Exascale OS/R program. Our project was, however, also affected by attrition of multiple PIs. While the PIs continued to participate and offer guidance as time permitted, losing these key individuals was unfortunate both for the project and for the DOE HPC community.« less
Simulation-Based Probabilistic Seismic Hazard Assessment Using System-Level, Physics-Based Models: Assembling Virtual California

NASA Astrophysics Data System (ADS)

Rundle, P. B.; Rundle, J. B.; Morein, G.; Donnellan, A.; Turcotte, D.; Klein, W.

2004-12-01

The research community is rapidly moving towards the development of an earthquake forecast technology based on the use of complex, system-level earthquake fault system simulations. Using these topologically and dynamically realistic simulations, it is possible to develop ensemble forecasting methods similar to that used in weather and climate research. To effectively carry out such a program, one needs 1) a topologically realistic model to simulate the fault system; 2) data sets to constrain the model parameters through a systematic program of data assimilation; 3) a computational technology making use of modern paradigms of high performance and parallel computing systems; and 4) software to visualize and analyze the results. In particular, we focus attention on a new version of our code Virtual California (version 2001) in which we model all of the major strike slip faults in California, from the Mexico-California border to the Mendocino Triple Junction. Virtual California is a "backslip model", meaning that the long term rate of slip on each fault segment in the model is matched to the observed rate. We use the historic data set of earthquakes larger than magnitude M > 6 to define the frictional properties of 650 fault segments (degrees of freedom) in the model. To compute the dynamics and the associated surface deformation, we use message passing as implemented in the MPICH standard distribution on a Beowulf clusters consisting of >10 cpus. We also will report results from implementing the code on significantly larger machines so that we can begin to examine much finer spatial scales of resolution, and to assess scaling properties of the code. We present results of simulations both as static images and as mpeg movies, so that the dynamical aspects of the computation can be assessed by the viewer. We compute a variety of statistics from the simulations, including magnitude-frequency relations, and compare these with data from real fault systems. We report recent results on use of Virtual California for probabilistic earthquake forecasting for several sub-groups of major faults in California. These methods have the advantage that system-level fault interactions are explicitly included, as well as laboratory-based friction laws.
Effectiveness of back-to-back testing

NASA Technical Reports Server (NTRS)

Vouk, Mladen A.; Mcallister, David F.; Eckhardt, David E.; Caglayan, Alper; Kelly, John P. J.

1987-01-01

Three models of back-to-back testing processes are described. Two models treat the case where there is no intercomponent failure dependence. The third model describes the more realistic case where there is correlation among the failure probabilities of the functionally equivalent components. The theory indicates that back-to-back testing can, under the right conditions, provide a considerable gain in software reliability. The models are used to analyze the data obtained in a fault-tolerant software experiment. It is shown that the expected gain is indeed achieved, and exceeded, provided the intercomponent failure dependence is sufficiently small. However, even with the relatively high correlation the use of several functionally equivalent components coupled with back-to-back testing may provide a considerable reliability gain. Implications of this finding are that the multiversion software development is a feasible and cost effective approach to providing highly reliable software components intended for fault-tolerant software systems, on condition that special attention is directed at early detection and elimination of correlated faults.
Providing an empirical basis for optimizing the verification and testing phases of software development

NASA Technical Reports Server (NTRS)

Briand, Lionel C.; Basili, Victor R.; Hetmanski, Christopher J.

1992-01-01

Applying equal testing and verification effort to all parts of a software system is not very efficient, especially when resources are limited and scheduling is tight. Therefore, one needs to be able to differentiate low/high fault density components so that the testing/verification effort can be concentrated where needed. Such a strategy is expected to detect more faults and thus improve the resulting reliability of the overall system. This paper presents an alternative approach for constructing such models that is intended to fulfill specific software engineering needs (i.e. dealing with partial/incomplete information and creating models that are easy to interpret). Our approach to classification is as follows: (1) to measure the software system to be considered; and (2) to build multivariate stochastic models for prediction. We present experimental results obtained by classifying FORTRAN components developed at the NASA/GSFC into two fault density classes: low and high. Also we evaluate the accuracy of the model and the insights it provides into the software process.
Fault Mitigation Schemes for Future Spaceflight Multicore Processors

NASA Technical Reports Server (NTRS)

Alexander, James W.; Clement, Bradley J.; Gostelow, Kim P.; Lai, John Y.

2012-01-01

Future planetary exploration missions demand significant advances in on-board computing capabilities over current avionics architectures based on a single-core processing element. The state-of-the-art multi-core processor provides much promise in meeting such challenges while introducing new fault tolerance problems when applied to space missions. Software-based schemes are being presented in this paper that can achieve system-level fault mitigation beyond that provided by radiation-hard-by-design (RHBD). For mission and time critical applications such as the Terrain Relative Navigation (TRN) for planetary or small body navigation, and landing, a range of fault tolerance methods can be adapted by the application. The software methods being investigated include Error Correction Code (ECC) for data packet routing between cores, virtual network routing, Triple Modular Redundancy (TMR), and Algorithm-Based Fault Tolerance (ABFT). A robust fault tolerance framework that provides fail-operational behavior under hard real-time constraints and graceful degradation will be demonstrated using TRN executing on a commercial Tilera(R) processor with simulated fault injections.
Enhancements to the KATE model-based reasoning system

NASA Technical Reports Server (NTRS)

Thomas, Stan J.

1994-01-01

KATE (Knowledge-based Autonomous Test Engineer) is a model-based software system developed in the Artificial Intelligence Laboratory at the Kennedy Space Center for monitoring, fault detection, and control of launch vehicles and ground support systems. This report describes two software efforts which enhance the functionality and usability of KATE. The first addition, a flow solver, adds to KATE a tool for modeling the flow of liquid in a pipe system. The second addition adds support for editing KATE knowledge base files to the Emacs editor. The body of this report discusses design and implementation issues having to do with these two tools. It will be useful to anyone maintaining or extending either the flow solver or the editor enhancements.
Numerical aerodynamic simulation facility feasibility study

NASA Technical Reports Server (NTRS)

1979-01-01

There were three major issues examined in the feasibility study. First, the ability of the proposed system architecture to support the anticipated workload was evaluated. Second, the throughput of the computational engine (the flow model processor) was studied using real application programs. Third, the availability reliability, and maintainability of the system were modeled. The evaluations were based on the baseline systems. The results show that the implementation of the Numerical Aerodynamic Simulation Facility, in the form considered, would indeed be a feasible project with an acceptable level of risk. The technology required (both hardware and software) either already exists or, in the case of a few parts, is expected to be announced this year. Facets of the work described include the hardware configuration, software, user language, and fault tolerance.
Simplex GPS and InSAR Inversion Software

NASA Technical Reports Server (NTRS)

Donnellan, Andrea; Parker, Jay W.; Lyzenga, Gregory A.; Pierce, Marlon E.

2012-01-01

Changes in the shape of the Earth's surface can be routinely measured with precisions better than centimeters. Processes below the surface often drive these changes and as a result, investigators require models with inversion methods to characterize the sources. Simplex inverts any combination of GPS (global positioning system), UAVSAR (uninhabited aerial vehicle synthetic aperture radar), and InSAR (interferometric synthetic aperture radar) data simultaneously for elastic response from fault and fluid motions. It can be used to solve for multiple faults and parameters, all of which can be specified or allowed to vary. The software can be used to study long-term tectonic motions and the faults responsible for those motions, or can be used to invert for co-seismic slip from earthquakes. Solutions involving estimation of fault motion and changes in fluid reservoirs such as magma or water are possible. Any arbitrary number of faults or parameters can be considered. Simplex specifically solves for any of location, geometry, fault slip, and expansion/contraction of a single or multiple faults. It inverts GPS and InSAR data for elastic dislocations in a half-space. Slip parameters include strike slip, dip slip, and tensile dislocations. It includes a map interface for both setting up the models and viewing the results. Results, including faults, and observed, computed, and residual displacements, are output in text format, a map interface, and can be exported to KML. The software interfaces with the QuakeTables database allowing a user to select existing fault parameters or data. Simplex can be accessed through the QuakeSim portal graphical user interface or run from a UNIX command line.

Modeling and Performance Considerations for Automated Fault Isolation in Complex Systems

NASA Technical Reports Server (NTRS)

Ferrell, Bob; Oostdyk, Rebecca

2010-01-01

The purpose of this paper is to document the modeling considerations and performance metrics that were examined in the development of a large-scale Fault Detection, Isolation and Recovery (FDIR) system. The FDIR system is envisioned to perform health management functions for both a launch vehicle and the ground systems that support the vehicle during checkout and launch countdown by using suite of complimentary software tools that alert operators to anomalies and failures in real-time. The FDIR team members developed a set of operational requirements for the models that would be used for fault isolation and worked closely with the vendor of the software tools selected for fault isolation to ensure that the software was able to meet the requirements. Once the requirements were established, example models of sufficient complexity were used to test the performance of the software. The results of the performance testing demonstrated the need for enhancements to the software in order to meet the demands of the full-scale ground and vehicle FDIR system. The paper highlights the importance of the development of operational requirements and preliminary performance testing as a strategy for identifying deficiencies in highly scalable systems and rectifying those deficiencies before they imperil the success of the project
Developing interpretable models with optimized set reduction for identifying high risk software components

NASA Technical Reports Server (NTRS)

Briand, Lionel C.; Basili, Victor R.; Hetmanski, Christopher J.

1993-01-01

Applying equal testing and verification effort to all parts of a software system is not very efficient, especially when resources are limited and scheduling is tight. Therefore, one needs to be able to differentiate low/high fault frequency components so that testing/verification effort can be concentrated where needed. Such a strategy is expected to detect more faults and thus improve the resulting reliability of the overall system. This paper presents the Optimized Set Reduction approach for constructing such models, intended to fulfill specific software engineering needs. Our approach to classification is to measure the software system and build multivariate stochastic models for predicting high risk system components. We present experimental results obtained by classifying Ada components into two classes: is or is not likely to generate faults during system and acceptance test. Also, we evaluate the accuracy of the model and the insights it provides into the error making process.
Secure it now or secure it later: the benefits of addressing cyber-security from the outset

NASA Astrophysics Data System (ADS)

Olama, Mohammed M.; Nutaro, James

2013-05-01

The majority of funding for research and development (R&D) in cyber-security is focused on the end of the software lifecycle where systems have been deployed or are nearing deployment. Recruiting of cyber-security personnel is similarly focused on end-of-life expertise. By emphasizing cyber-security at these late stages, security problems are found and corrected when it is most expensive to do so, thus increasing the cost of owning and operating complex software systems. Worse, expenditures on expensive security measures often mean less money for innovative developments. These unwanted increases in cost and potential slowing of innovation are unavoidable consequences of an approach to security that finds and remediate faults after software has been implemented. We argue that software security can be improved and the total cost of a software system can be substantially reduced by an appropriate allocation of resources to the early stages of a software project. By adopting a similar allocation of R&D funds to the early stages of the software lifecycle, we propose that the costs of cyber-security can be better controlled and, consequently, the positive effects of this R&D on industry will be much more pronounced.
Use of Soft Computing Technologies For Rocket Engine Control

NASA Technical Reports Server (NTRS)

Trevino, Luis C.; Olcmen, Semih; Polites, Michael

2003-01-01

The problem to be addressed in this paper is to explore how the use of Soft Computing Technologies (SCT) could be employed to further improve overall engine system reliability and performance. Specifically, this will be presented by enhancing rocket engine control and engine health management (EHM) using SCT coupled with conventional control technologies, and sound software engineering practices used in Marshall s Flight Software Group. The principle goals are to improve software management, software development time and maintenance, processor execution, fault tolerance and mitigation, and nonlinear control in power level transitions. The intent is not to discuss any shortcomings of existing engine control and EHM methodologies, but to provide alternative design choices for control, EHM, implementation, performance, and sustaining engineering. The approaches outlined in this paper will require knowledge in the fields of rocket engine propulsion, software engineering for embedded systems, and soft computing technologies (i.e., neural networks, fuzzy logic, and Bayesian belief networks), much of which is presented in this paper. The first targeted demonstration rocket engine platform is the MC-1 (formerly FASTRAC Engine) which is simulated with hardware and software in the Marshall Avionics & Software Testbed laboratory that
PDSS/IMC requirements and functional specifications

NASA Technical Reports Server (NTRS)

1983-01-01

The system (software and hardware) requirements for the Payload Development Support System (PDSS)/Image Motion Compensator (IMC) are provided. The PDSS/IMC system provides the capability for performing Image Motion Compensator Electronics (IMCE) flight software test, checkout, and verification and provides the capability for monitoring the IMC flight computer system during qualification testing for fault detection and fault isolation.
SIRU development. Volume 3: Software description and program documentation

NASA Technical Reports Server (NTRS)

Oehrle, J.

1973-01-01

The development and initial evaluation of a strapdown inertial reference unit (SIRU) system are discussed. The SIRU configuration is a modular inertial subsystem with hardware and software features that achieve fault tolerant operational capabilities. The SIRU redundant hardware design is formulated about a six gyro and six accelerometer instrument module package. The six axes array provides redundant independent sensing and the symmetry enables the formulation of an optimal software redundant data processing structure with self-contained fault detection and isolation (FDI) capabilities. The basic SIRU software coding system used in the DDP-516 computer is documented.
Towards Certification of a Space System Application of Fault Detection and Isolation

NASA Technical Reports Server (NTRS)

Feather, Martin S.; Markosian, Lawrence Z.

2008-01-01

Advanced fault detection, isolation and recovery (FDIR) software is being investigated at NASA as a means to the improve reliability and availability of its space systems. Certification is a critical step in the acceptance of such software. Its attainment hinges on performing the necessary verification and validation to show that the software will fulfill its requirements in the intended setting. Presented herein is our ongoing work to plan for the certification of a pilot application of advanced FDIR software in a NASA setting. We describe the application, and the key challenges and opportunities it offers for certification.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
Fault Tolerant Software Technology for Distributed Computer Systems

DTIC Science & Technology

1989-03-01

RAY.) &-TR-88-296 I Fin;.’ Technical Report ,r 19,39 i A28 3329 F’ULT TOLERANT SOFTWARE TECHNOLOGY FOR DISTRIBUTED COMPUTER SYSTEMS Georgia Institute...GrfisABN 34-70IiWftlI NO0. IN?3. NO IACCESSION NO. 158 21 7 11. TITLE (Incld security Cassification) FAULT TOLERANT SOFTWARE FOR DISTRIBUTED COMPUTER ...Technology for Distributed Computing Systems," a two year effort performed at Georgia Institute of Technology as part of the Clouds Project. The Clouds
Petascale computation of multi-physics seismic simulations

NASA Astrophysics Data System (ADS)

Gabriel, Alice-Agnes; Madden, Elizabeth H.; Ulrich, Thomas; Wollherr, Stephanie; Duru, Kenneth C.

2017-04-01

Capturing the observed complexity of earthquake sources in concurrence with seismic wave propagation simulations is an inherently multi-scale, multi-physics problem. In this presentation, we present simulations of earthquake scenarios resolving high-detail dynamic rupture evolution and high frequency ground motion. The simulations combine a multitude of representations of model complexity; such as non-linear fault friction, thermal and fluid effects, heterogeneous fault stress and fault strength initial conditions, fault curvature and roughness, on- and off-fault non-elastic failure to capture dynamic rupture behavior at the source; and seismic wave attenuation, 3D subsurface structure and bathymetry impacting seismic wave propagation. Performing such scenarios at the necessary spatio-temporal resolution requires highly optimized and massively parallel simulation tools which can efficiently exploit HPC facilities. Our up to multi-PetaFLOP simulations are performed with SeisSol (www.seissol.org), an open-source software package based on an ADER-Discontinuous Galerkin (DG) scheme solving the seismic wave equations in velocity-stress formulation in elastic, viscoelastic, and viscoplastic media with high-order accuracy in time and space. Our flux-based implementation of frictional failure remains free of spurious oscillations. Tetrahedral unstructured meshes allow for complicated model geometry. SeisSol has been optimized on all software levels, including: assembler-level DG kernels which obtain 50% peak performance on some of the largest supercomputers worldwide; an overlapping MPI-OpenMP parallelization shadowing the multiphysics computations; usage of local time stepping; parallel input and output schemes and direct interfaces to community standard data formats. All these factors enable aim to minimise the time-to-solution. The results presented highlight the fact that modern numerical methods and hardware-aware optimization for modern supercomputers are essential to further our understanding of earthquake source physics and complement both physic-based ground motion research and empirical approaches in seismic hazard analysis. Lastly, we will conclude with an outlook on future exascale ADER-DG solvers for seismological applications.
Study on nondestructive detection system based on x-ray for wire ropes conveyer belt

NASA Astrophysics Data System (ADS)

Miao, Changyun; Shi, Boya; Wan, Peng; Li, Jie

2008-03-01

A nondestructive detection system based on X-ray for wire ropes conveyer belt is designed by X-ray detection technology. In this paper X-ray detection principle is analyzed, a design scheme of the system is presented; image processing of conveyer belt is researched and image processing algorithms are given; X-ray acquisition receiving board is designed with the use of FPGA and DSP; the software of the system is programmed by C#.NET on WINXP/WIN2000 platform. The experiment indicates the system can implement remote real-time detection of wire ropes conveyer belt images, find faults and give an alarm in time. The system is direct perceived, strong real-time and high accurate. It can be used for fault detection of wire ropes conveyer belts in mines, ports, terminals and other fields.
Evaluating Predictive Models of Software Quality

NASA Astrophysics Data System (ADS)

Ciaschini, V.; Canaparo, M.; Ronchieri, E.; Salomoni, D.

2014-06-01

Applications from High Energy Physics scientific community are constantly growing and implemented by a large number of developers. This implies a strong churn on the code and an associated risk of faults, which is unavoidable as long as the software undergoes active evolution. However, the necessities of production systems run counter to this. Stability and predictability are of paramount importance; in addition, a short turn-around time for the defect discovery-correction-deployment cycle is required. A way to reconcile these opposite foci is to use a software quality model to obtain an approximation of the risk before releasing a program to only deliver software with a risk lower than an agreed threshold. In this article we evaluated two quality predictive models to identify the operational risk and the quality of some software products. We applied these models to the development history of several EMI packages with intent to discover the risk factor of each product and compare it with its real history. We attempted to determine if the models reasonably maps reality for the applications under evaluation, and finally we concluded suggesting directions for further studies.
Cost-Sensitive Radial Basis Function Neural Network Classifier for Software Defect Prediction

PubMed Central

Venkatesan, R.

2016-01-01

Effective prediction of software modules, those that are prone to defects, will enable software developers to achieve efficient allocation of resources and to concentrate on quality assurance activities. The process of software development life cycle basically includes design, analysis, implementation, testing, and release phases. Generally, software testing is a critical task in the software development process wherein it is to save time and budget by detecting defects at the earliest and deliver a product without defects to the customers. This testing phase should be carefully operated in an effective manner to release a defect-free (bug-free) software product to the customers. In order to improve the software testing process, fault prediction methods identify the software parts that are more noted to be defect-prone. This paper proposes a prediction approach based on conventional radial basis function neural network (RBFNN) and the novel adaptive dimensional biogeography based optimization (ADBBO) model. The developed ADBBO based RBFNN model is tested with five publicly available datasets from the NASA data program repository. The computed results prove the effectiveness of the proposed ADBBO-RBFNN classifier approach with respect to the considered metrics in comparison with that of the early predictors available in the literature for the same datasets. PMID:27738649
Cost-Sensitive Radial Basis Function Neural Network Classifier for Software Defect Prediction.

PubMed

Kumudha, P; Venkatesan, R

Effective prediction of software modules, those that are prone to defects, will enable software developers to achieve efficient allocation of resources and to concentrate on quality assurance activities. The process of software development life cycle basically includes design, analysis, implementation, testing, and release phases. Generally, software testing is a critical task in the software development process wherein it is to save time and budget by detecting defects at the earliest and deliver a product without defects to the customers. This testing phase should be carefully operated in an effective manner to release a defect-free (bug-free) software product to the customers. In order to improve the software testing process, fault prediction methods identify the software parts that are more noted to be defect-prone. This paper proposes a prediction approach based on conventional radial basis function neural network (RBFNN) and the novel adaptive dimensional biogeography based optimization (ADBBO) model. The developed ADBBO based RBFNN model is tested with five publicly available datasets from the NASA data program repository. The computed results prove the effectiveness of the proposed ADBBO-RBFNN classifier approach with respect to the considered metrics in comparison with that of the early predictors available in the literature for the same datasets.
Staged-Fault Testing of Distance Protection Relay Settings

NASA Astrophysics Data System (ADS)

Havelka, J.; Malarić, R.; Frlan, K.

2012-01-01

In order to analyze the operation of the protection system during induced fault testing in the Croatian power system, a simulation using the CAPE software has been performed. The CAPE software (Computer-Aided Protection Engineering) is expert software intended primarily for relay protection engineers, which calculates current and voltage values during faults in the power system, so that relay protection devices can be properly set up. Once the accuracy of the simulation model had been confirmed, a series of simulations were performed in order to obtain the optimal fault location to test the protection system. The simulation results were used to specify the test sequence definitions for the end-to-end relay testing using advanced testing equipment with GPS synchronization for secondary injection in protection schemes based on communication. The objective of the end-to-end testing was to perform field validation of the protection settings, including verification of the circuit breaker operation, telecommunication channel time and the effectiveness of the relay algorithms. Once the end-to-end secondary injection testing had been completed, the induced fault testing was performed with three-end lines loaded and in service. This paper describes and analyses the test procedure, consisting of CAPE simulations, end-to-end test with advanced secondary equipment and staged-fault test of a three-end power line in the Croatian transmission system.
Functional requirements for an intelligent RPC. [remote power controller for spaceborne electrical distribution system

NASA Technical Reports Server (NTRS)

Aucoin, B. M.; Heller, R. P.

1990-01-01

An intelligent remote power controller (RPC) based on microcomputer technology can implement advanced functions for the accurate and secure detection of all types of faults on a spaceborne electrical distribution system. The intelligent RPC will implement conventional protection functions such as overcurrent, under-voltage, and ground fault protection. Advanced functions for the detection of soft faults, which cannot presently be detected, can also be implemented. Adaptive overcurrent protection changes overcurrent settings based on connected load. Incipient and high-impedance fault detection provides early detection of arcing conditions to prevent fires, and to clear and reconfigure circuits before soft faults progress to a hard-fault condition. Power electronics techniques can be used to implement fault current limiting to prevent voltage dips during hard faults. It is concluded that these techniques will enhance the overall safety and reliability of the distribution system.
The tectonic origin of the Aurora and Concordia Trenches, Dome C area, East Antarctica

NASA Astrophysics Data System (ADS)

Cianfarra, P.; Bianchi, C.; Forieri, A.; Salvini, F.; Tabacco, I. E.

2003-04-01

The bedrock below the Ice Cap in the Dome C area, East Antarctica, is characterised by the presence of a series of elongated depressions separating rigdes, with the Aurora and Concordia Trenches representing the major depressions. At these depressions the ice cap reaches a thickness of over 4000 m, leaving the possibility to have water deposits at their bottom. The well known Lake Vostok represents by far the largest and most famous of these structures. The relative young age of the Antarctic Ice Cap, about 38 Ma, compared with the old, Mesozoic age of the former, continental landscape constrains the age of these structures in Cenozoic time. The Aurora and Concordia trenches show a characteristic asymmetric shape, difficult to merely explain with erosional processes. On the other hand, this asymmetric shape is typical of morphologies resulting from fault activity, and specifically the presence of active normal faults with planes of variable dip. The bedrock morphologies at these trenches were compared with normal faulting processes by a series of numerical modelling to evaluate the possibility of a tectonic origin. Modelling of the bedrock morphology was simulated by the Hybrid Cellular Automata method (HCA) through the Forc2D software implementation. Within the Italian PNRA (Programma Nazionale Ricerche in Antartide) a series of airborne radar surveys was performed in the Lake Vostok-Dome C region in the last decade. Four meaningful bedrock profiles were selected, to provide, as close as possible, across strike sections of the Aurora and Concordia trenches . The optimal orientation was then achieved by projecting the data along a perfectly across strike trajectory. In this way it was possible to simulate the faulting as a cylindrical deformation, suitable to be modelled by 2D software. Two sections were prepared for each trench and the same fault setting was applied to each couple. The match was obtained by a forward modelling approach, in that the fault trace and the displacement were tuned until a satisfactory match was obtained. The obtained results confirmed the feasibility of the tectonic origin of the Aurora and Concordia trenches.
A Vehicle Management End-to-End Testing and Analysis Platform for Validation of Mission and Fault Management Algorithms to Reduce Risk for NASA's Space Launch System

NASA Technical Reports Server (NTRS)

Trevino, Luis; Patterson, Jonathan; Teare, David; Johnson, Stephen

2015-01-01

The engineering development of the new Space Launch System (SLS) launch vehicle requires cross discipline teams with extensive knowledge of launch vehicle subsystems, information theory, and autonomous algorithms dealing with all operations from pre-launch through on orbit operations. The characteristics of these spacecraft systems must be matched with the autonomous algorithm monitoring and mitigation capabilities for accurate control and response to abnormal conditions throughout all vehicle mission flight phases, including precipitating safing actions and crew aborts. This presents a large and complex system engineering challenge, which is being addressed in part by focusing on the specific subsystems involved in the handling of off-nominal mission and fault tolerance with response management. Using traditional model based system and software engineering design principles from the Unified Modeling Language (UML) and Systems Modeling Language (SysML), the Mission and Fault Management (M&FM) algorithms for the vehicle are crafted and vetted in specialized Integrated Development Teams (IDTs) composed of multiple development disciplines such as Systems Engineering (SE), Flight Software (FSW), Safety and Mission Assurance (S&MA) and the major subsystems and vehicle elements such as Main Propulsion Systems (MPS), boosters, avionics, Guidance, Navigation, and Control (GNC), Thrust Vector Control (TVC), and liquid engines. These model based algorithms and their development lifecycle from inception through Flight Software certification are an important focus of this development effort to further insure reliable detection and response to off-nominal vehicle states during all phases of vehicle operation from pre-launch through end of flight. NASA formed a dedicated M&FM team for addressing fault management early in the development lifecycle for the SLS initiative. As part of the development of the M&FM capabilities, this team has developed a dedicated testbed that integrates specific M&FM algorithms, specialized nominal and off-nominal test cases, and vendor-supplied physics-based launch vehicle subsystem models. Additionally, the team has developed processes for implementing and validating these algorithms for concept validation and risk reduction for the SLS program. The flexibility of the Vehicle Management End-to-end Testbed (VMET) enables thorough testing of the M&FM algorithms by providing configurable suites of both nominal and off-nominal test cases to validate the developed algorithms utilizing actual subsystem models such as MPS. The intent of VMET is to validate the M&FM algorithms and substantiate them with performance baselines for each of the target vehicle subsystems in an independent platform exterior to the flight software development infrastructure and its related testing entities. In any software development process there is inherent risk in the interpretation and implementation of concepts into software through requirements and test cases into flight software compounded with potential human errors throughout the development lifecycle. Risk reduction is addressed by the M&FM analysis group working with other organizations such as S&MA, Structures and Environments, GNC, Orion, the Crew Office, Flight Operations, and Ground Operations by assessing performance of the M&FM algorithms in terms of their ability to reduce Loss of Mission and Loss of Crew probabilities. In addition, through state machine and diagnostic modeling, analysis efforts investigate a broader suite of failure effects and associated detection and responses that can be tested in VMET to ensure that failures can be detected, and confirm that responses do not create additional risks or cause undesired states through interactive dynamic effects with other algorithms and systems. VMET further contributes to risk reduction by prototyping and exercising the M&FM algorithms early in their implementation and without any inherent hindrances such as meeting FSW processor scheduling constraints due to their target platform - ARINC 653 partitioned OS, resource limitations, and other factors related to integration with other subsystems not directly involved with M&FM such as telemetry packing and processing. The baseline plan for use of VMET encompasses testing the original M&FM algorithms coded in the same C++ language and state machine architectural concepts as that used by Flight Software. This enables the development of performance standards and test cases to characterize the M&FM algorithms and sets a benchmark from which to measure the effectiveness of M&FM algorithms performance in the FSW development and test processes.
Finite-Fault and Other New Capabilities of CISN ShakeAlert

NASA Astrophysics Data System (ADS)

Boese, M.; Felizardo, C.; Heaton, T. H.; Hudnut, K. W.; Hauksson, E.

2013-12-01

Over the past 6 years, scientists at Caltech, UC Berkeley, the Univ. of Southern California, the Univ. of Washington, the US Geological Survey, and ETH Zurich (Switzerland) have developed the 'ShakeAlert' earthquake early warning demonstration system for California and the Pacific Northwest. We have now started to transform this system into a stable end-to-end production system that will be integrated into the daily routine operations of the CISN and PNSN networks. To quickly determine the earthquake magnitude and location, ShakeAlert currently processes and interprets real-time data-streams from several hundred seismic stations within the California Integrated Seismic Network (CISN) and the Pacific Northwest Seismic Network (PNSN). Based on these parameters, the 'UserDisplay' software predicts and displays the arrival and intensity of shaking at a given user site. Real-time ShakeAlert feeds are currently being shared with around 160 individuals, companies, and emergency response organizations to gather feedback about the system performance, to educate potential users about EEW, and to identify needs and applications of EEW in a future operational warning system. To improve the performance during large earthquakes (M>6.5), we have started to develop, implement, and test a number of new algorithms for the ShakeAlert system: the 'FinDer' (Finite Fault Rupture Detector) algorithm provides real-time estimates of locations and extents of finite-fault ruptures from high-frequency seismic data. The 'GPSlip' algorithm estimates the fault slip along these ruptures using high-rate real-time GPS data. And, third, a new type of ground-motion prediction models derived from over 415,000 rupture simulations along active faults in southern California improves MMI intensity predictions for large earthquakes with consideration of finite-fault, rupture directivity, and basin response effects. FinDer and GPSlip are currently being real-time and offline tested in a separate internal ShakeAlert installation at Caltech. Real-time position and displacement time series from around 100 GPS sensors are obtained in JSON format from RTK/PPP(AR) solutions using the RTNet software at USGS Pasadena. However, we have also started to investigate the usage of onsite (in-receiver) processing using NetR9 with RTX and tracebuf2 output format. A number of changes to the ShakeAlert processing, xml message format, and the usage of this information in the UserDisplay software were necessary to handle the new finite-fault and slip information from the FinDer and GPSlip algorithms. In addition, we have developed a framework for end-to-end off-line testing with archived and simulated waveform data using the Earthworm tankplayer. Detailed background information about the algorithms, processing, and results from these test runs will be presented.
Tutorial: Advanced fault tree applications using HARP

NASA Technical Reports Server (NTRS)

Dugan, Joanne Bechta; Bavuso, Salvatore J.; Boyd, Mark A.

1993-01-01

Reliability analysis of fault tolerant computer systems for critical applications is complicated by several factors. These modeling difficulties are discussed and dynamic fault tree modeling techniques for handling them are described and demonstrated. Several advanced fault tolerant computer systems are described, and fault tree models for their analysis are presented. HARP (Hybrid Automated Reliability Predictor) is a software package developed at Duke University and NASA Langley Research Center that is capable of solving the fault tree models presented.

Experiments in fault tolerant software reliability

NASA Technical Reports Server (NTRS)

Mcallister, David F.; Vouk, Mladen A.

1989-01-01

Twenty functionally equivalent programs were built and tested in a multiversion software experiment. Following unit testing, all programs were subjected to an extensive system test. In the process sixty-one distinct faults were identified among the versions. Less than 12 percent of the faults exhibited varying degrees of positive correlation. The common-cause (or similar) faults spanned as many as 14 components. However, a majority of these faults were trivial, and easily detected by proper unit and/or system testing. Only two of the seven similar faults were difficult faults, and both were caused by specification ambiguities. One of these faults exhibited variable identical-and-wrong response span, i.e. response span which varied with the testing conditions and input data. Techniques that could have been used to avoid the faults are discussed. For example, it was determined that back-to-back testing of 2-tuples could have been used to eliminate about 90 percent of the faults. In addition, four of the seven similar faults could have been detected by using back-to-back testing of 5-tuples. It is believed that most, if not all, similar faults could have been avoided had the specifications been written using more formal notation, the unit testing phase was subject to more stringent standards and controls, and better tools for measuring the quality and adequacy of the test data (e.g. coverage) were used.
Computing and Visualizing the Complex Dynamics of Earthquake Fault Systems: Towards Ensemble Earthquake Forecasting

NASA Astrophysics Data System (ADS)

Rundle, J.; Rundle, P.; Donnellan, A.; Li, P.

2003-12-01

We consider the problem of the complex dynamics of earthquake fault systems, and whether numerical simulations can be used to define an ensemble forecasting technology similar to that used in weather and climate research. To effectively carry out such a program, we need 1) a topological realistic model to simulate the fault system; 2) data sets to constrain the model parameters through a systematic program of data assimilation; 3) a computational technology making use of modern paradigms of high performance and parallel computing systems; and 4) software to visualize and analyze the results. In particular, we focus attention of a new version of our code Virtual California (version 2001) in which we model all of the major strike slip faults extending throughout California, from the Mexico-California border to the Mendocino Triple Junction. We use the historic data set of earthquakes larger than magnitude M > 6 to define the frictional properties of all 654 fault segments (degrees of freedom) in the model. Previous versions of Virtual California had used only 215 fault segments to model the strike slip faults in southern California. To compute the dynamics and the associated surface deformation, we use message passing as implemented in the MPICH standard distribution on a small Beowulf cluster consisting of 10 cpus. We are also planning to run the code on significantly larger machines so that we can begin to examine much finer spatial scales of resolution, and to assess scaling properties of the code. We present results of simulations both as static images and as mpeg movies, so that the dynamical aspects of the computation can be assessed by the viewer. We also compute a variety of statistics from the simulations, including magnitude-frequency relations, and compare these with data from real fault systems.
An experimental investigation of fault tolerant software structures in an avionics application

NASA Technical Reports Server (NTRS)

Caglayan, Alper K.; Eckhardt, Dave E., Jr.

1989-01-01

The objective of this experimental investigation is to compare the functional performance and software reliability of competing fault tolerant software structures utilizing software diversity. In this experiment, three versions of the redundancy management software for a skewed sensor array have been developed using three diverse failure detection and isolation algorithms and incorporated into various N-version, recovery block and hybrid software structures. The empirical results show that, for maximum functional performance improvement in the selected application domain, the results of diverse algorithms should be voted before being processed by multiple versions without enforced diversity. Results also suggest that when the reliability gain with an N-version structure is modest, recovery block structures are more feasible since higher reliability can be obtained using an acceptance check with a modest reliability.
Using recurrence plot analysis for software execution interpretation and fault detection

NASA Astrophysics Data System (ADS)

Mosdorf, M.

2015-09-01

This paper shows a method targeted at software execution interpretation and fault detection using recurrence plot analysis. In in the proposed approach recurrence plot analysis is applied to software execution trace that contains executed assembly instructions. Results of this analysis are subject to further processing with PCA (Principal Component Analysis) method that simplifies number coefficients used for software execution classification. This method was used for the analysis of five algorithms: Bubble Sort, Quick Sort, Median Filter, FIR, SHA-1. Results show that some of the collected traces could be easily assigned to particular algorithms (logs from Bubble Sort and FIR algorithms) while others are more difficult to distinguish.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.

1973-01-01

This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
Research in computer science

NASA Technical Reports Server (NTRS)

Ortega, J. M.

1984-01-01

Several short summaries of the work performed during this reporting period are presented. Topics discussed in this document include: (1) resilient seeded errors via simple techniques; (2) knowledge representation for engineering design; (3) analysis of faults in a multiversion software experiment; (4) implementation of parallel programming environment; (5) symbolic execution of concurrent programs; (6) two computer graphics systems for visualization of pressure distribution and convective density particles; (7) design of a source code management system; (8) vectorizing incomplete conjugate gradient on the Cyber 203/205; (9) extensions of domain testing theory and; (10) performance analyzer for the pisces system.
Scalable and reusable emulator for evaluating the performance of SS7 networks

NASA Astrophysics Data System (ADS)

Lazar, Aurel A.; Tseng, Kent H.; Lim, Koon Seng; Choe, Winston

1994-04-01

A scalable and reusable emulator was designed and implemented for studying the behavior of SS7 networks. The emulator design was largely based on public domain software. It was developed on top of an environment supported by PVM, the Parallel Virtual Machine, and managed by OSIMIS-the OSI Management Information Service platform. The emulator runs on top of a commercially available ATM LAN interconnecting engineering workstations. As a case study for evaluating the emulator, the behavior of the Singapore National SS7 Network under fault and unbalanced loading conditions was investigated.
Slicken 1.0: Program for calculating the orientation of shear on reactivated faults

NASA Astrophysics Data System (ADS)

Xu, Hong; Xu, Shunshan; Nieto-Samaniego, Ángel F.; Alaniz-Álvarez, Susana A.

2017-07-01

The slip vector on a fault is an important parameter in the study of the movement history of a fault and its faulting mechanism. Although there exist many graphical programs to represent the shear stress (or slickenline) orientations on faults, programs to quantitatively calculate the orientation of fault slip based on a given stress field are scarce. In consequence, we develop Slicken 1.0, a software to rapidly calculate the orientation of maximum shear stress on any fault plane. For this direct method of calculating the resolved shear stress on a planar surface, the input data are the unit vector normal to the involved plane, the unit vectors of the three principal stress axes, and the stress ratio. The advantage of this program is that the vertical or horizontal principal stresses are not necessarily required. Due to its nimble design using Java SE 8.0, it runs on most operating systems with the corresponding Java VM. The software program will be practical for geoscience students, geologists and engineers and will help resolve a deficiency in field geology, and structural and engineering geology.
A theoretical basis for the analysis of redundant software subject to coincident errors

NASA Technical Reports Server (NTRS)

Eckhardt, D. E., Jr.; Lee, L. D.

1985-01-01

Fundamental to the development of redundant software techniques fault-tolerant software, is an understanding of the impact of multiple-joint occurrences of coincident errors. A theoretical basis for the study of redundant software is developed which provides a probabilistic framework for empirically evaluating the effectiveness of the general (N-Version) strategy when component versions are subject to coincident errors, and permits an analytical study of the effects of these errors. The basic assumptions of the model are: (1) independently designed software components are chosen in a random sample; and (2) in the user environment, the system is required to execute on a stationary input series. The intensity of coincident errors, has a central role in the model. This function describes the propensity to introduce design faults in such a way that software components fail together when executing in the user environment. The model is used to give conditions under which an N-Version system is a better strategy for reducing system failure probability than relying on a single version of software. A condition which limits the effectiveness of a fault-tolerant strategy is studied, and it is posted whether system failure probability varies monotonically with increasing N or whether an optimal choice of N exists.
Prognostic and health management of active assets in nuclear power plants

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agarwal, Vivek; Lybeck, Nancy; Pham, Binh T.

This study presents the development of diagnostic and prognostic capabilities for active assets in nuclear power plants (NPPs). The research was performed under the Advanced Instrumentation, Information, and Control Technologies Pathway of the Light Water Reactor Sustainability Program. Idaho National Laboratory researched, developed, implemented, and demonstrated diagnostic and prognostic models for generator step-up transformers (GSUs). The Fleet-Wide Prognostic and Health Management (FW-PHM) Suite software developed by the Electric Power Research Institute was used to perform diagnosis and prognosis. As part of the research activity, Idaho National Laboratory implemented 22 GSU diagnostic models in the Asset Fault Signature Database and twomore » wellestablished GSU prognostic models for the paper winding insulation in the Remaining Useful Life Database of the FW-PHM Suite. The implemented models along with a simulated fault data stream were used to evaluate the diagnostic and prognostic capabilities of the FW-PHM Suite. Knowledge of the operating condition of plant asset gained from diagnosis and prognosis is critical for the safe, productive, and economical long-term operation of the current fleet of NPPs. This research addresses some of the gaps in the current state of technology development and enables effective application of diagnostics and prognostics to nuclear plant assets.« less
Prognostic and health management of active assets in nuclear power plants

DOE PAGES

Agarwal, Vivek; Lybeck, Nancy; Pham, Binh T.; ...

2015-06-04

This study presents the development of diagnostic and prognostic capabilities for active assets in nuclear power plants (NPPs). The research was performed under the Advanced Instrumentation, Information, and Control Technologies Pathway of the Light Water Reactor Sustainability Program. Idaho National Laboratory researched, developed, implemented, and demonstrated diagnostic and prognostic models for generator step-up transformers (GSUs). The Fleet-Wide Prognostic and Health Management (FW-PHM) Suite software developed by the Electric Power Research Institute was used to perform diagnosis and prognosis. As part of the research activity, Idaho National Laboratory implemented 22 GSU diagnostic models in the Asset Fault Signature Database and twomore » wellestablished GSU prognostic models for the paper winding insulation in the Remaining Useful Life Database of the FW-PHM Suite. The implemented models along with a simulated fault data stream were used to evaluate the diagnostic and prognostic capabilities of the FW-PHM Suite. Knowledge of the operating condition of plant asset gained from diagnosis and prognosis is critical for the safe, productive, and economical long-term operation of the current fleet of NPPs. This research addresses some of the gaps in the current state of technology development and enables effective application of diagnostics and prognostics to nuclear plant assets.« less
ALLIANCE: An architecture for fault tolerant, cooperative control of heterogeneous mobile robots

DOE Office of Scientific and Technical Information (OSTI.GOV)

Parker, L.E.

1995-02-01

This research addresses the problem of achieving fault tolerant cooperation within small- to medium-sized teams of heterogeneous mobile robots. The author describes a novel behavior-based, fully distributed architecture, called ALLIANCE, that utilizes adaptive action selection to achieve fault tolerant cooperative control in robot missions involving loosely coupled, largely independent tasks. The robots in this architecture possess a variety of high-level functions that they can perform during a mission, and must at all times select an appropriate action based on the requirements of the mission, the activities of other robots, the current environmental conditions, and their own internal states. Since suchmore » cooperative teams often work in dynamic and unpredictable environments, the software architecture allows the team members to respond robustly and reliably to unexpected environmental changes and modifications in the robot team that may occur due to mechanical failure, the learning of new skills, or the addition or removal of robots from the team by human intervention. After presenting ALLIANCE, the author describes in detail experimental results of an implementation of this architecture on a team of physical mobile robots performing a cooperative box pushing demonstration. These experiments illustrate the ability of ALLIANCE to achieve adaptive, fault-tolerant cooperative control amidst dynamic changes in the capabilities of the robot team.« less
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Software for determining the true displacement of faults

NASA Astrophysics Data System (ADS)

Nieto-Fuentes, R.; Nieto-Samaniego, Á. F.; Xu, S.-S.; Alaniz-Álvarez, S. A.

2014-03-01

One of the most important parameters of faults is the true (or net) displacement, which is measured by restoring two originally adjacent points, called “piercing points”, to their original positions. This measurement is not typically applicable because it is rare to observe piercing points in natural outcrops. Much more common is the measurement of the apparent displacement of a marker. Methods to calculate the true displacement of faults using descriptive geometry, trigonometry or vector algebra are common in the literature, and most of them solve a specific situation from a large amount of possible combinations of the fault parameters. True displacements are not routinely calculated because it is a tedious and tiring task, despite their importance and the relatively simple methodology. We believe that the solution is to develop software capable of performing this work. In a previous publication, our research group proposed a method to calculate the true displacement of faults by solving most combinations of fault parameters using simple trigonometric equations. The purpose of this contribution is to present a computer program for calculating the true displacement of faults. The input data are the dip of the fault; the pitch angles of the markers, slickenlines and observation lines; and the marker separation. To prevent the common difficulties involved in switching between operative systems, the software is developed using the Java programing language. The computer program could be used as a tool in education and will also be useful for the calculation of the true fault displacement in geological and engineering works. The application resolves the cases with known direction of net slip, which commonly is assumed parallel to the slickenlines. This assumption is not always valid and must be used with caution, because the slickenlines are formed during a step of the incremental displacement on the fault surface, whereas the net slip is related to the finite slip.
Risk-Significant Adverse Condition Awareness Strengthens Assurance of Fault Management Systems

NASA Technical Reports Server (NTRS)

Fitz, Rhonda

2017-01-01

As spaceflight systems increase in complexity, Fault Management (FM) systems are ranked high in risk-based assessment of software criticality, emphasizing the importance of establishing highly competent domain expertise to provide assurance. Adverse conditions (ACs) and specific vulnerabilities encountered by safety- and mission-critical software systems have been identified through efforts to reduce the risk posture of software-intensive NASA missions. Acknowledgement of potential off-nominal conditions and analysis to determine software system resiliency are important aspects of hazard analysis and FM. A key component of assuring FM is an assessment of how well software addresses susceptibility to failure through consideration of ACs. Focus on significant risk predicted through experienced analysis conducted at the NASA Independent Verification & Validation (IV&V) Program enables the scoping of effective assurance strategies with regard to overall asset protection of complex spaceflight as well as ground systems. Research efforts sponsored by NASAs Office of Safety and Mission Assurance (OSMA) defined terminology, categorized data fields, and designed a baseline repository that centralizes and compiles a comprehensive listing of ACs and correlated data relevant across many NASA missions. This prototype tool helps projects improve analysis by tracking ACs and allowing queries based on project, mission type, domain/component, causal fault, and other key characteristics. Vulnerability in off-nominal situations, architectural design weaknesses, and unexpected or undesirable system behaviors in reaction to faults are curtailed with the awareness of ACs and risk-significant scenarios modeled for analysts through this database. Integration within the Enterprise Architecture at NASA IV&V enables interfacing with other tools and datasets, technical support, and accessibility across the Agency. This paper discusses the development of an improved workflow process utilizing this database for adaptive, risk-informed FM assurance that critical software systems will safely and securely protect against faults and respond to ACs in order to achieve successful missions.
Risk-Significant Adverse Condition Awareness Strengthens Assurance of Fault Management Systems

NASA Technical Reports Server (NTRS)

Fitz, Rhonda

2017-01-01

As spaceflight systems increase in complexity, Fault Management (FM) systems are ranked high in risk-based assessment of software criticality, emphasizing the importance of establishing highly competent domain expertise to provide assurance. Adverse conditions (ACs) and specific vulnerabilities encountered by safety- and mission-critical software systems have been identified through efforts to reduce the risk posture of software-intensive NASA missions. Acknowledgement of potential off-nominal conditions and analysis to determine software system resiliency are important aspects of hazard analysis and FM. A key component of assuring FM is an assessment of how well software addresses susceptibility to failure through consideration of ACs. Focus on significant risk predicted through experienced analysis conducted at the NASA Independent Verification Validation (IVV) Program enables the scoping of effective assurance strategies with regard to overall asset protection of complex spaceflight as well as ground systems. Research efforts sponsored by NASA's Office of Safety and Mission Assurance defined terminology, categorized data fields, and designed a baseline repository that centralizes and compiles a comprehensive listing of ACs and correlated data relevant across many NASA missions. This prototype tool helps projects improve analysis by tracking ACs and allowing queries based on project, mission type, domaincomponent, causal fault, and other key characteristics. Vulnerability in off-nominal situations, architectural design weaknesses, and unexpected or undesirable system behaviors in reaction to faults are curtailed with the awareness of ACs and risk-significant scenarios modeled for analysts through this database. Integration within the Enterprise Architecture at NASA IVV enables interfacing with other tools and datasets, technical support, and accessibility across the Agency. This paper discusses the development of an improved workflow process utilizing this database for adaptive, risk-informed FM assurance that critical software systems will safely and securely protect against faults and respond to ACs in order to achieve successful missions.
Preferential Flow Paths In A Karstified Spring Catchment: A Study Of Fault Zones As Conduits To Rapid Groundwater Flow

NASA Astrophysics Data System (ADS)

Kordilla, J.; Terrell, A. N.; Veltri, M.; Sauter, M.; Schmidt, S.

2017-12-01

In this study we model saturated and unsaturated flow in the karstified Weendespring catchment, located within the Leinetal graben in Goettingen, Germany. We employ the finite element COMSOL Multiphysics modeling software to model variably saturated flow using the Richards equation with a van Genuchten type parameterization. As part of the graben structure, the Weende spring catchment is intersected by seven fault zones along the main flow path of the 7400 m cross section of the catchment. As the Weende spring is part of the drinking water supply in Goettingen, it is particularly important to understand the vulnerability of the catchment and effect of fault zones on rapid transport of contaminants. Nitrate signals have been observed at the spring only a few days after the application of fertilizers within the catchment at a distance of approximately 2km. As the underlying layers are known to be highly impermeable, fault zones within the area are likely to create rapid flow paths to the water table and the spring. The model conceptualizes the catchment as containing three hydrogeological limestone units with varying degrees of karstification: the lower Muschelkalk limestone as a highly conductive layer, the middle Muschelkalk as an aquitard, and the upper Muschelkalk as another conductive layer. The fault zones are parameterized based on a combination of field data from quarries, remote sensing and literary data. The fault zone is modeled considering the fracture core as well as the surrounding damage zone with separate, specific hydraulic properties. The 2D conceptual model was implemented in COMSOL to study unsaturated flow at the catchment scale using van Genuchten parameters. The study demonstrates the importance of fault zones for preferential flow within the catchment and its effect on the spatial distribution of vulnerability.
AC-DCFS: a toolchain implementation to Automatically Compute Coulomb Failure Stress changes after relevant earthquakes.

NASA Astrophysics Data System (ADS)

Alvarez-Gómez, José A.; García-Mayordomo, Julián

2017-04-01

We present an automated free software-based toolchain to obtain Coulomb Failure Stress change maps on fault planes of interest following the occurrence of a relevant earthquake. The system uses as input the focal mechanism data of the event occurred and an active fault database for the region. From the focal mechanism the orientations of the possible rupture planes, the location of the event and the size of the earthquake are obtained. From the size of the earthquake, the dimensions of the rupture plane are obtained by means of an algorithm based on empirical relations. Using the active fault database in the area, the stress-receiving planes are obtained and a verisimilitude index is assigned to the source plane from the two nodal planes of the focal mechanism. The obtained product is a series of layers in a format compatible with any type of GIS (or map completely edited in PDF format) showing the possible stress change maps on the different families of fault planes present in the epicentral zone. These type of products are presented generally in technical reports developed in the weeks following the occurrence of the event, or in scientific publications; however they have been proven useful for emergency management in the hours and days after a major event being these stress changes responsible of aftershocks, in addition to the mid-term earthquake forecasting. The automation of the calculation allows its incorporation within the products generated by the alert and surveillance agencies shortly after the earthquake occurred. It is now being implemented in the Spanish Geological Survey as one of the products that this agency would provides after the occurrence of relevant seismic series in Spain.
Integration of Advanced Probabilistic Analysis Techniques with Multi-Physics Models

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cetiner, Mustafa Sacit; none,; Flanagan, George F.

2014-07-30

An integrated simulation platform that couples probabilistic analysis-based tools with model-based simulation tools can provide valuable insights for reactive and proactive responses to plant operating conditions. The objective of this work is to demonstrate the benefits of a partial implementation of the Small Modular Reactor (SMR) Probabilistic Risk Assessment (PRA) Detailed Framework Specification through the coupling of advanced PRA capabilities and accurate multi-physics plant models. Coupling a probabilistic model with a multi-physics model will aid in design, operations, and safety by providing a more accurate understanding of plant behavior. This represents the first attempt at actually integrating these two typesmore » of analyses for a control system used for operations, on a faster than real-time basis. This report documents the development of the basic communication capability to exchange data with the probabilistic model using Reliability Workbench (RWB) and the multi-physics model using Dymola. The communication pathways from injecting a fault (i.e., failing a component) to the probabilistic and multi-physics models were successfully completed. This first version was tested with prototypic models represented in both RWB and Modelica. First, a simple event tree/fault tree (ET/FT) model was created to develop the software code to implement the communication capabilities between the dynamic-link library (dll) and RWB. A program, written in C#, successfully communicates faults to the probabilistic model through the dll. A systems model of the Advanced Liquid-Metal Reactor–Power Reactor Inherently Safe Module (ALMR-PRISM) design developed under another DOE project was upgraded using Dymola to include proper interfaces to allow data exchange with the control application (ConApp). A program, written in C+, successfully communicates faults to the multi-physics model. The results of the example simulation were successfully plotted.« less
Investigation of the applicability of a functional programming model to fault-tolerant parallel processing for knowledge-based systems

NASA Technical Reports Server (NTRS)

Harper, Richard

1989-01-01

In a fault-tolerant parallel computer, a functional programming model can facilitate distributed checkpointing, error recovery, load balancing, and graceful degradation. Such a model has been implemented on the Draper Fault-Tolerant Parallel Processor (FTPP). When used in conjunction with the FTPP's fault detection and masking capabilities, this implementation results in a graceful degradation of system performance after faults. Three graceful degradation algorithms have been implemented and are presented. A user interface has been implemented which requires minimal cognitive overhead by the application programmer, masking such complexities as the system's redundancy, distributed nature, variable complement of processing resources, load balancing, fault occurrence and recovery. This user interface is described and its use demonstrated. The applicability of the functional programming style to the Activation Framework, a paradigm for intelligent systems, is then briefly described.

Modelling Active Faults in Probabilistic Seismic Hazard Analysis (PSHA) with OpenQuake: Definition, Design and Experience

NASA Astrophysics Data System (ADS)

Weatherill, Graeme; Garcia, Julio; Poggi, Valerio; Chen, Yen-Shin; Pagani, Marco

2016-04-01

The Global Earthquake Model (GEM) has, since its inception in 2009, made many contributions to the practice of seismic hazard modeling in different regions of the globe. The OpenQuake-engine (hereafter referred to simply as OpenQuake), GEM's open-source software for calculation of earthquake hazard and risk, has found application in many countries, spanning a diversity of tectonic environments. GEM itself has produced a database of national and regional seismic hazard models, harmonizing into OpenQuake's own definition the varied seismogenic sources found therein. The characterization of active faults in probabilistic seismic hazard analysis (PSHA) is at the centre of this process, motivating many of the developments in OpenQuake and presenting hazard modellers with the challenge of reconciling seismological, geological and geodetic information for the different regions of the world. Faced with these challenges, and from the experience gained in the process of harmonizing existing models of seismic hazard, four critical issues are addressed. The challenge GEM has faced in the development of software is how to define a representation of an active fault (both in terms of geometry and earthquake behaviour) that is sufficiently flexible to adapt to different tectonic conditions and levels of data completeness. By exploring the different fault typologies supported by OpenQuake we illustrate how seismic hazard calculations can, and do, take into account complexities such as geometrical irregularity of faults in the prediction of ground motion, highlighting some of the potential pitfalls and inconsistencies that can arise. This exploration leads to the second main challenge in active fault modeling, what elements of the fault source model impact most upon the hazard at a site, and when does this matter? Through a series of sensitivity studies we show how different configurations of fault geometry, and the corresponding characterisation of near-fault phenomena (including hanging wall and directivity effects) within modern ground motion prediction equations, can have an influence on the seismic hazard at a site. Yet we also illustrate the conditions under which these effects may be partially tempered when considering the full uncertainty in rupture behaviour within the fault system. The third challenge is the development of efficient means for representing both aleatory and epistemic uncertainties from active fault models in PSHA. In implementing state-of-the-art seismic hazard models into OpenQuake, such as those recently undertaken in California and Japan, new modeling techniques are needed that redefine how we treat interdependence of ruptures within the model (such as mutual exclusivity), and the propagation of uncertainties emerging from geology. Finally, we illustrate how OpenQuake, and GEM's additional toolkits for model preparation, can be applied to address long-standing issues in active fault modeling in PSHA. These include constraining the seismogenic coupling of a fault and the partitioning of seismic moment between the active fault surfaces and the surrounding seismogenic crust. We illustrate some of the possible roles that geodesy can play in the process, but highlight where this may introduce new uncertainties and potential biases into the seismic hazard process, and how these can be addressed.
The Dangers of Failure Masking in Fault-Tolerant Software: Aspects of a Recent In-Flight Upset Event

NASA Technical Reports Server (NTRS)

Johnson, C. W.; Holloway, C. M.

2007-01-01

On 1 August 2005, a Boeing Company 777-200 aircraft, operating on an international passenger flight from Australia to Malaysia, was involved in a significant upset event while flying on autopilot. The Australian Transport Safety Bureau's investigation into the event discovered that an anomaly existed in the component software hierarchy that allowed inputs from a known faulty accelerometer to be processed by the air data inertial reference unit (ADIRU) and used by the primary flight computer, autopilot and other aircraft systems. This anomaly had existed in original ADIRU software, and had not been detected in the testing and certification process for the unit. This paper describes the software aspects of the incident in detail, and suggests possible implications concerning complex, safety-critical, fault-tolerant software.
Fault-Tree Compiler

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Boerschlein, David P.

1993-01-01

Fault-Tree Compiler (FTC) program, is software tool used to calculate probability of top event in fault tree. Gates of five different types allowed in fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N. High-level input language easy to understand and use. In addition, program supports hierarchical fault-tree definition feature, which simplifies tree-description process and reduces execution time. Set of programs created forming basis for reliability-analysis workstation: SURE, ASSIST, PAWS/STEM, and FTC fault-tree tool (LAR-14586). Written in PASCAL, ANSI-compliant C language, and FORTRAN 77. Other versions available upon request.
HETDEX tracker control system design and implementation

NASA Astrophysics Data System (ADS)

Beno, Joseph H.; Hayes, Richard; Leck, Ron; Penney, Charles; Soukup, Ian

2012-09-01

To enable the Hobby-Eberly Telescope Dark Energy Experiment, The University of Texas at Austin Center for Electromechanics and McDonald Observatory developed a precision tracker and control system - an 18,000 kg robot to position a 3,100 kg payload within 10 microns of a desired dynamic track. Performance requirements to meet science needs and safety requirements that emerged from detailed Failure Modes and Effects Analysis resulted in a system of 13 precision controlled actuators and 100 additional analog and digital devices (primarily sensors and safety limit switches). Due to this complexity, demanding accuracy requirements, and stringent safety requirements, two independent control systems were developed. First, a versatile and easily configurable centralized control system that links with modeling and simulation tools during the hardware and software design process was deemed essential for normal operation including motion control. A second, parallel, control system, the Hardware Fault Controller (HFC) provides independent monitoring and fault control through a dedicated microcontroller to force a safe, controlled shutdown of the entire system in the event a fault is detected. Motion controls were developed in a Matlab-Simulink simulation environment, and coupled with dSPACE controller hardware. The dSPACE real-time operating system collects sensor information; motor commands are transmitted over a PROFIBUS network to servo amplifiers and drive motor status is received over the same network. To interface the dSPACE controller directly to absolute Heidenhain sensors with EnDat 2.2 protocol, a custom communication board was developed. This paper covers details of operational control software, the HFC, algorithms, tuning, debugging, testing, and lessons learned.
Runtime Speculative Software-Only Fault Tolerance

DTIC Science & Technology

2012-06-01

reliability of RSFT, a in-depth analysis on its window of vulnerability is also discussed and measured via simulated fault injection. The performance...propagation of faults through the entire program. For optimal performance, these techniques have to use herotic alias analysis to find the minimum set of...affect program output. No program source code or alias analysis is needed to analyze the fault propagation ahead of time. 2.3 Limitations of Existing
An Empirical Approach to Logical Clustering of Software Failure Regions

DTIC Science & Technology

1994-03-01

this is a coincidence or normal behavior of failure regions. " Software faults were numbered in order as they were discovered, by the various testing...locations of the associated faults. The goal of this research will be an improved testing technique that incorporates failure region behavior . To do this...clustering behavior . This, however, does not correlate with the structural clustering of failure regions observed by Ginn (1991) on the same set of data
NO-FAULT COMPENSATION FOR MEDICAL INJURIES: TRENDS AND CHALLENGES.

PubMed

Kassim, Puteri Nemie

2014-12-01

As an alternative to the tort or fault-based system, a no-fault compensation system has been viewed as having the potential to overcome problems inherent in the tort system by providing fair, speedy and adequate compensation for medically injured victims. Proponents of the suggested no-fault compensation system have argued that this system is more efficient in terms of time and money, as well as in making the circumstances in which compensation is paid, much clearer. However, the arguments against no-fault compensation systems are mainly on issues of funding difficulties, accountability and deterrence, particularly, once fault is taken out of the equation. Nonetheless, the no-fault compensation system has been successfully implemented in various countries but, at the same time, rejected in some others, as not being implementable. In the present trend, the no-fault system seems to fit the needs of society by offering greater access to justice for medically injured victims and providing a clearer "road map" towards obtaining suitable redress. This paper aims at providing the readers with an overview of the characteristics of the no fault compensation system and some examples of countries that have implemented it. Qualitative Research-Content Analysis. Given the many problems and hurdles posed by the tort or fault-based system, it is questionable that it can efficiently play its role as a mechanism that affords fair and adequate compensation for victims of medical injuries. However, while a comprehensive no-fault compensation system offers a tempting alternative to the tort or fault-based system, to import such a change into our local scenario requires a great deal of consideration. There are major differences, mainly in terms of social standing, size of population, political ideology and financial commitment, between Malaysia and countries that have successfully implemented no-fault systems. Nevertheless, implementing a no-fault compensation system in Malaysia is not entirely impossible. A custom-made no-fault model tailored to suit our local scenario can be promising, provided that a thorough research is made, assessing the viability of a no-fault system in Malaysia, addressing the inherent problems and, consequently, designing a workable no-fault system in Malaysia.
Using certification trails to achieve software fault tolerance

NASA Technical Reports Server (NTRS)

Sullivan, Gregory F.; Masson, Gerald M.

1993-01-01

A conceptually novel and powerful technique to achieve fault tolerance in hardware and software systems is introduced. When used for software fault tolerance, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance was formalized and it was illustrated by applying it to the fundamental problem of finding a minimum spanning tree. Cases in which the second phase can be run concorrectly with the first and act as a monitor are discussed. The certification trail approach was compared to other approaches to fault tolerance. Because of space limitations we have omitted examples of our technique applied to the Huffman tree, and convex hull problems. These can be found in the full version of this paper.
Automatic Detection of Previously-Unseen Application States for Deployment Environment Testing and Analysis

PubMed Central

Murphy, Christian; Vaughan, Moses; Ilahi, Waseem; Kaiser, Gail

2010-01-01

For large, complex software systems, it is typically impossible in terms of time and cost to reliably test the application in all possible execution states and configurations before releasing it into production. One proposed way of addressing this problem has been to continue testing and analysis of the application in the field, after it has been deployed. A practical limitation of many such automated approaches is the potentially high performance overhead incurred by the necessary instrumentation. However, it may be possible to reduce this overhead by selecting test cases and performing analysis only in previously-unseen application states, thus reducing the number of redundant tests and analyses that are run. Solutions for fault detection, model checking, security testing, and fault localization in deployed software may all benefit from a technique that ignores application states that have already been tested or explored. In this paper, we present a solution that ensures that deployment environment tests are only executed in states that the application has not previously encountered. In addition to discussing our implementation, we present the results of an empirical study that demonstrates its effectiveness, and explain how the new approach can be generalized to assist other automated testing and analysis techniques intended for the deployment environment. PMID:21197140
Evaluating software development characteristics: Assessment of software measures in the Software Engineering Laboratory. [reliability engineering

NASA Technical Reports Server (NTRS)

Basili, V. R.

1981-01-01

Work on metrics is discussed. Factors that affect software quality are reviewed. Metrics is discussed in terms of criteria achievements, reliability, and fault tolerance. Subjective and objective metrics are distinguished. Product/process and cost/quality metrics are characterized and discussed.
Emerging technologies for V&V of ISHM software for space exploration

NASA Technical Reports Server (NTRS)

Feather, Martin S.; Markosian, Lawrence Z.

2006-01-01

Systems1,2 required to exhibit high operational reliability often rely on some form of fault protection to recognize and respond to faults, preventing faults' escalation to catastrophic failures. Integrated System Health Management (ISHM) extends the functionality of fault protection to both scale to more complex systems (and systems of systems), and to maintain capability rather than just avert catastrophe. Forms of ISHM have been utilized to good effect in the maintenance phase of systems' total lifecycles (often referred to as 'condition-based mainte-nance'), but less so in a 'fault protection' role during actual operations. One of the impediments to such use lies in the challenges of verification, validation and certification of ISHM systems themselves. This paper makes the case that state-of-the-practice V&V and certification techniques will not suffice for emerging forms of ISHM systems; however, a number of maturing software engineering assurance technologies show particular promise for addressing these ISHM V&V challenges.
The KATE shell: An implementation of model-based control, monitor and diagnosis

NASA Technical Reports Server (NTRS)

Cornell, Matthew

1987-01-01

The conventional control and monitor software currently used by the Space Center for Space Shuttle processing has many limitations such as high maintenance costs, limited diagnostic capabilities and simulation support. These limitations have caused the development of a knowledge based (or model based) shell to generically control and monitor electro-mechanical systems. The knowledge base describes the system's structure and function and is used by a software shell to do real time constraints checking, low level control of components, diagnosis of detected faults, sensor validation, automatic generation of schematic diagrams and automatic recovery from failures. This approach is more versatile and more powerful than the conventional hard coded approach and offers many advantages over it, although, for systems which require high speed reaction times or aren't well understood, knowledge based control and monitor systems may not be appropriate.
Measurement of SIFT operating system overhead

NASA Technical Reports Server (NTRS)

Palumbo, D. L.; Butler, R. W.

1985-01-01

The overhead of the software implemented fault tolerance (SIFT) operating system was measured. Several versions of the operating system evolved. Each version represents different strategies employed to improve the measured performance. Three of these versions are analyzed. The internal data structures of the operating systems are discussed. The overhead of the SIFT operating system was found to be of two types: vote overhead and executive task overhead. Both types of overhead were found to be significant in all versions of the system. Improvements substantially reduced this overhead; even with these improvements, the operating system consumed well over 50% of the available processing time.
Specification, Synthesis, and Verification of Software-based Control Protocols for Fault-Tolerant Space Systems

DTIC Science & Technology

2016-08-16

Force Research Laboratory Space Vehicles Directorate AFRL /RVSV 3550 Aberdeen Ave, SE 11. SPONSOR/MONITOR’S REPORT Kirtland AFB, NM 87117-5776 NUMBER...Ft Belvoir, VA 22060-6218 1 cy AFRL /RVIL Kirtland AFB, NM 87117-5776 2 cys Official Record Copy AFRL /RVSV/Richard S. Erwin 1 cy... AFRL -RV-PS- AFRL -RV-PS- TR-2016-0112 TR-2016-0112 SPECIFICATION, SYNTHESIS, AND VERIFICATION OF SOFTWARE-BASED CONTROL PROTOCOLS FOR FAULT-TOLERANT
Software reliability studies

NASA Technical Reports Server (NTRS)

Hoppa, Mary Ann; Wilson, Larry W.

1994-01-01

There are many software reliability models which try to predict future performance of software based on data generated by the debugging process. Our research has shown that by improving the quality of the data one can greatly improve the predictions. We are working on methodologies which control some of the randomness inherent in the standard data generation processes in order to improve the accuracy of predictions. Our contribution is twofold in that we describe an experimental methodology using a data structure called the debugging graph and apply this methodology to assess the robustness of existing models. The debugging graph is used to analyze the effects of various fault recovery orders on the predictive accuracy of several well-known software reliability algorithms. We found that, along a particular debugging path in the graph, the predictive performance of different models can vary greatly. Similarly, just because a model 'fits' a given path's data well does not guarantee that the model would perform well on a different path. Further we observed bug interactions and noted their potential effects on the predictive process. We saw that not only do different faults fail at different rates, but that those rates can be affected by the particular debugging stage at which the rates are evaluated. Based on our experiment, we conjecture that the accuracy of a reliability prediction is affected by the fault recovery order as well as by fault interaction.
Numerical modeling of fracking fluid and methane migration through fault zones in shale gas reservoirs

NASA Astrophysics Data System (ADS)

Taherdangkoo, Reza; Tatomir, Alexandru; Sauter, Martin

2017-04-01

Hydraulic fracturing operation in shale gas reservoir has gained growing interest over the last few years. Groundwater contamination is one of the most important environmental concerns that have emerged surrounding shale gas development (Reagan et al., 2015). The potential impacts of hydraulic fracturing could be studied through the possible pathways for subsurface migration of contaminants towards overlying aquifers (Kissinger et al., 2013; Myers, 2012). The intent of this study is to investigate, by means of numerical simulation, two failure scenarios which are based on the presence of a fault zone that penetrates the full thickness of overburden and connect shale gas reservoir to aquifer. Scenario 1 addresses the potential transport of fracturing fluid from the shale into the subsurface. This scenario was modeled with COMSOL Multiphysics software. Scenario 2 deals with the leakage of methane from the reservoir into the overburden. The numerical modeling of this scenario was implemented in DuMux (free and open-source software), discrete fracture model (DFM) simulator (Tatomir, 2012). The modeling results are used to evaluate the influence of several important parameters (reservoir pressure, aquifer-reservoir separation thickness, fault zone inclination, porosity, permeability, etc.) that could affect the fluid transport through the fault zone. Furthermore, we determined the main transport mechanisms and circumstances in which would allow frack fluid or methane migrate through the fault zone into geological layers. The results show that presence of a conductive fault could reduce the contaminant travel time and a significant contaminant leakage, under certain hydraulic conditions, is most likely to occur. Bibliography Kissinger, A., Helmig, R., Ebigbo, A., Class, H., Lange, T., Sauter, M., Heitfeld, M., Klünker, J., Jahnke, W., 2013. Hydraulic fracturing in unconventional gas reservoirs: risks in the geological system, part 2. Environ Earth Sci 70, 3855-3873. Myers, T., 2012. Potential contaminant pathways from hydraulically fractured shale to aquifers. Groundwater, 50(6), 872-882. Reagan, M.T., Moridis, G.J., Keen, N.D., Johnson, J.N., 2015. Numerical simulation of the environmental impact of hydraulic fracturing of tight/shale gas reservoirs on near-surface groundwater: Background, base cases, shallow reservoirs, short-term gas, and water transport. Water Resources Research 51, 2543-2573. Tatomir, A., 2012. From Discrete to Continuum Concepts of Flow in Fractured Porous Media. Stuttgart University: University of Stuttgart.
A DICOM-based 2nd generation Molecular Imaging Data Grid implementing the IHE XDS-i integration profile.

PubMed

Lee, Jasper; Zhang, Jianguo; Park, Ryan; Dagliyan, Grant; Liu, Brent; Huang, H K

2012-07-01

A Molecular Imaging Data Grid (MIDG) was developed to address current informatics challenges in archival, sharing, search, and distribution of preclinical imaging studies between animal imaging facilities and investigator sites. This manuscript presents a 2nd generation MIDG replacing the Globus Toolkit with a new system architecture that implements the IHE XDS-i integration profile. Implementation and evaluation were conducted using a 3-site interdisciplinary test-bed at the University of Southern California. The 2nd generation MIDG design architecture replaces the initial design's Globus Toolkit with dedicated web services and XML-based messaging for dedicated management and delivery of multi-modality DICOM imaging datasets. The Cross-enterprise Document Sharing for Imaging (XDS-i) integration profile from the field of enterprise radiology informatics was adopted into the MIDG design because streamlined image registration, management, and distribution dataflow are likewise needed in preclinical imaging informatics systems as in enterprise PACS application. Implementation of the MIDG is demonstrated at the University of Southern California Molecular Imaging Center (MIC) and two other sites with specified hardware, software, and network bandwidth. Evaluation of the MIDG involves data upload, download, and fault-tolerance testing scenarios using multi-modality animal imaging datasets collected at the USC Molecular Imaging Center. The upload, download, and fault-tolerance tests of the MIDG were performed multiple times using 12 collected animal study datasets. Upload and download times demonstrated reproducibility and improved real-world performance. Fault-tolerance tests showed that automated failover between Grid Node Servers has minimal impact on normal download times. Building upon the 1st generation concepts and experiences, the 2nd generation MIDG system improves accessibility of disparate animal-model molecular imaging datasets to users outside a molecular imaging facility's LAN using a new architecture, dataflow, and dedicated DICOM-based management web services. Productivity and efficiency of preclinical research for translational sciences investigators has been further streamlined for multi-center study data registration, management, and distribution.
A Wireless Sensor System for Real-Time Monitoring and Fault Detection of Motor Arrays

PubMed Central

Medina-García, Jonathan; Sánchez-Rodríguez, Trinidad; Galán, Juan Antonio Gómez; Delgado, Aránzazu; Gómez-Bravo, Fernando; Jiménez, Raúl

2017-01-01

This paper presents a wireless fault detection system for industrial motors that combines vibration, motor current and temperature analysis, thus improving the detection of mechanical faults. The design also considers the time of detection and further possible actions, which are also important for the early detection of possible malfunctions, and thus for avoiding irreversible damage to the motor. The remote motor condition monitoring is implemented through a wireless sensor network (WSN) based on the IEEE 802.15.4 standard. The deployed network uses the beacon-enabled mode to synchronize several sensor nodes with the coordinator node, and the guaranteed time slot mechanism provides data monitoring with a predetermined latency. A graphic user interface offers remote access to motor conditions and real-time monitoring of several parameters. The developed wireless sensor node exhibits very low power consumption since it has been optimized both in terms of hardware and software. The result is a low cost, highly reliable and compact design, achieving a high degree of autonomy of more than two years with just one 3.3 V/2600 mAh battery. Laboratory and field tests confirm the feasibility of the wireless system. PMID:28245623
A Wireless Sensor System for Real-Time Monitoring and Fault Detection of Motor Arrays.

PubMed

Medina-García, Jonathan; Sánchez-Rodríguez, Trinidad; Galán, Juan Antonio Gómez; Delgado, Aránzazu; Gómez-Bravo, Fernando; Jiménez, Raúl

2017-02-25

This paper presents a wireless fault detection system for industrial motors that combines vibration, motor current and temperature analysis, thus improving the detection of mechanical faults. The design also considers the time of detection and further possible actions, which are also important for the early detection of possible malfunctions, and thus for avoiding irreversible damage to the motor. The remote motor condition monitoring is implemented through a wireless sensor network (WSN) based on the IEEE 802.15.4 standard. The deployed network uses the beacon-enabled mode to synchronize several sensor nodes with the coordinator node, and the guaranteed time slot mechanism provides data monitoring with a predetermined latency. A graphic user interface offers remote access to motor conditions and real-time monitoring of several parameters. The developed wireless sensor node exhibits very low power consumption since it has been optimized both in terms of hardware and software. The result is a low cost, highly reliable and compact design, achieving a high degree of autonomy of more than two years with just one 3.3 V/2600 mAh battery. Laboratory and field tests confirm the feasibility of the wireless system.
Diagnosis diagrams for passing signals on an automatic block signaling railway section

NASA Astrophysics Data System (ADS)

Spunei, E.; Piroi, I.; Chioncel, C. P.; Piroi, F.

2018-01-01

This work presents a diagnosis method for railway traffic security installations. More specifically, the authors present a series of diagnosis charts for passing signals on a railway block equipped with an automatic block signaling installation. These charts are based on the exploitation electric schemes, and are subsequently used to develop a diagnosis software package. The thus developed software package contributes substantially to a reduction of failure detection and remedy for these types of installation faults. The use of the software package eliminates making wrong decisions in the fault detection process, decisions that may result in longer remedy times and, sometimes, to railway traffic events.

From experiment to design -- Fault characterization and detection in parallel computer systems using computational accelerators

NASA Astrophysics Data System (ADS)

Yim, Keun Soo

This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7). The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7). The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components. This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.
Certification of computational results

NASA Technical Reports Server (NTRS)

Sullivan, Gregory F.; Wilson, Dwight S.; Masson, Gerald M.

1993-01-01

A conceptually novel and powerful technique to achieve fault detection and fault tolerance in hardware and software systems is described. When used for software fault detection, this new technique uses time and software redundancy and can be outlined as follows. In the initial phase, a program is run to solve a problem and store the result. In addition, this program leaves behind a trail of data called a certification trail. In the second phase, another program is run which solves the original problem again. This program, however, has access to the certification trail left by the first program. Because of the availability of the certification trail, the second phase can be performed by a less complex program and can execute more quickly. In the final phase, the two results are compared and if they agree the results are accepted as correct; otherwise an error is indicated. An essential aspect of this approach is that the second program must always generate either an error indication or a correct output even when the certification trail it receives from the first program is incorrect. The certification trail approach to fault tolerance is formalized and realizations of it are illustrated by considering algorithms for the following problems: convex hull, sorting, and shortest path. Cases in which the second phase can be run concurrently with the first and act as a monitor are discussed. The certification trail approach are compared to other approaches to fault tolerance.
Quantitative method of medication system interface evaluation.

PubMed

Pingenot, Alleene Anne; Shanteau, James; Pingenot, James D F

2007-01-01

The objective of this study was to develop a quantitative method of evaluating the user interface for medication system software. A detailed task analysis provided a description of user goals and essential activity. A structural fault analysis was used to develop a detailed description of the system interface. Nurses experienced with use of the system under evaluation provided estimates of failure rates for each point in this simplified fault tree. Means of estimated failure rates provided quantitative data for fault analysis. Authors note that, although failures of steps in the program were frequent, participants reported numerous methods of working around these failures so that overall system failure was rare. However, frequent process failure can affect the time required for processing medications, making a system inefficient. This method of interface analysis, called Software Efficiency Evaluation and Fault Identification Method, provides quantitative information with which prototypes can be compared and problems within an interface identified.
The Integrated Safety-Critical Advanced Avionics Communication and Control (ISAACC) System Concept: Infrastructure for ISHM

NASA Technical Reports Server (NTRS)

Gwaltney, David A.; Briscoe, Jeri M.

2005-01-01

Integrated System Health Management (ISHM) architectures for spacecraft will include hard real-time, critical subsystems and soft real-time monitoring subsystems. Interaction between these subsystems will be necessary and an architecture supporting multiple criticality levels will be required. Demonstration hardware for the Integrated Safety-Critical Advanced Avionics Communication & Control (ISAACC) system has been developed at NASA Marshall Space Flight Center. It is a modular system using a commercially available time-triggered protocol, ?Tp/C, that supports hard real-time distributed control systems independent of the data transmission medium. The protocol is implemented in hardware and provides guaranteed low-latency messaging with inherent fault-tolerance and fault-containment. Interoperability between modules and systems of modules using the TTP/C is guaranteed through definition of messages and the precise message schedule implemented by the master-less Time Division Multiple Access (TDMA) communications protocol. "Plug-and-play" capability for sensors and actuators provides automatically configurable modules supporting sensor recalibration and control algorithm re-tuning without software modification. Modular components of controlled physical system(s) critical to control algorithm tuning, such as pumps or valve components in an engine, can be replaced or upgraded as "plug and play" components without modification to the ISAACC module hardware or software. ISAACC modules can communicate with other vehicle subsystems through time-triggered protocols or other communications protocols implemented over Ethernet, MIL-STD- 1553 and RS-485/422. Other communication bus physical layers and protocols can be included as required. In this way, the ISAACC modules can be part of a system-of-systems in a vehicle with multi-tier subsystems of varying criticality. The goal of the ISAACC architecture development is control and monitoring of safety critical systems of a manned spacecraft. These systems include spacecraft navigation and attitude control, propulsion, automated docking, vehicle health management and life support. ISAACC can integrate local critical subsystem health management with subsystems performing long term health monitoring. The ISAACC system and its relationship to ISHM will be presented.
Implementation of a model based fault detection and diagnosis for actuation faults of the Space Shuttle main engine

NASA Technical Reports Server (NTRS)

Duyar, A.; Guo, T.-H.; Merrill, W.; Musgrave, J.

1992-01-01

In a previous study, Guo, Merrill and Duyar, 1990, reported a conceptual development of a fault detection and diagnosis system for actuation faults of the space shuttle main engine. This study, which is a continuation of the previous work, implements the developed fault detection and diagnosis scheme for the real time actuation fault diagnosis of the space shuttle main engine. The scheme will be used as an integral part of an intelligent control system demonstration experiment at NASA Lewis. The diagnosis system utilizes a model based method with real time identification and hypothesis testing for actuation, sensor, and performance degradation faults.
Assurance of Fault Management: Risk-Significant Adverse Condition Awareness

NASA Technical Reports Server (NTRS)

Fitz, Rhonda

2016-01-01

Fault Management (FM) systems are ranked high in risk-based assessment of criticality within flight software, emphasizing the importance of establishing highly competent domain expertise to provide assurance for NASA projects, especially as spaceflight systems continue to increase in complexity. Insight into specific characteristics of FM architectures seen embedded within safety- and mission-critical software systems analyzed by the NASA Independent Verification Validation (IVV) Program has been enhanced with an FM Technical Reference (TR) suite. Benefits are aimed beyond the IVV community to those that seek ways to efficiently and effectively provide software assurance to reduce the FM risk posture of NASA and other space missions. The identification of particular FM architectures, visibility, and associated IVV techniques provides a TR suite that enables greater assurance that critical software systems will adequately protect against faults and respond to adverse conditions. The role FM has with regard to overall asset protection of flight software systems is being addressed with the development of an adverse condition (AC) database encompassing flight software vulnerabilities.Identification of potential off-nominal conditions and analysis to determine how a system responds to these conditions are important aspects of hazard analysis and fault management. Understanding what ACs the mission may face, and ensuring they are prevented or addressed is the responsibility of the assurance team, which necessarily should have insight into ACs beyond those defined by the project itself. Research efforts sponsored by NASAs Office of Safety and Mission Assurance defined terminology, categorized data fields, and designed a baseline repository that centralizes and compiles a comprehensive listing of ACs and correlated data relevant across many NASA missions. This prototype tool helps projects improve analysis by tracking ACs, and allowing queries based on project, mission type, domain component, causal fault, and other key characteristics. The repository has a firm structure, initial collection of data, and an interface established for informational queries, with plans for integration within the Enterprise Architecture at NASA IVV, enabling support and accessibility across the Agency. The development of an improved workflow process for adaptive, risk-informed FM assurance is currently underway.
Large-scale virtual screening on public cloud resources with Apache Spark.

PubMed

Capuccini, Marco; Ahmed, Laeeq; Schaal, Wesley; Laure, Erwin; Spjuth, Ola

2017-01-01

Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against [Formula: see text]2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries. Our implementation is named Spark-VS and it is freely available as open source from GitHub (https://github.com/mcapuccini/spark-vs).Graphical abstract.
Fault Detection and Correction for the Solar Dynamics Observatory Attitude Control System

NASA Technical Reports Server (NTRS)

Starin, Scott R.; Vess, Melissa F.; Kenney, Thomas M.; Maldonado, Manuel D.; Morgenstern, Wendy M.

2007-01-01

The Solar Dynamics Observatory is an Explorer-class mission that will launch in early 2009. The spacecraft will operate in a geosynchronous orbit, sending data 24 hours a day to a devoted ground station in White Sands, New Mexico. It will carry a suite of instruments designed to observe the Sun in multiple wavelengths at unprecedented resolution. The Atmospheric Imaging Assembly includes four telescopes with focal plane CCDs that can image the full solar disk in four different visible wavelengths. The Extreme-ultraviolet Variability Experiment will collect time-correlated data on the activity of the Sun's corona. The Helioseismic and Magnetic Imager will enable study of pressure waves moving through the body of the Sun. The attitude control system on Solar Dynamics Observatory is responsible for four main phases of activity. The physical safety of the spacecraft after separation must be guaranteed. Fine attitude determination and control must be sufficient for instrument calibration maneuvers. The mission science mode requires 2-arcsecond control according to error signals provided by guide telescopes on the Atmospheric Imaging Assembly, one of the three instruments to be carried. Lastly, accurate execution of linear and angular momentum changes to the spacecraft must be provided for momentum management and orbit maintenance. In thsp aper, single-fault tolerant fault detection and correction of the Solar Dynamics Observatory attitude control system is described. The attitude control hardware suite for the mission is catalogued, with special attention to redundancy at the hardware level. Four reaction wheels are used where any three are satisfactory. Four pairs of redundant thrusters are employed for orbit change maneuvers and momentum management. Three two-axis gyroscopes provide full redundancy for rate sensing. A digital Sun sensor and two autonomous star trackers provide two-out-of-three redundancy for fine attitude determination. The use of software to maximize chances of recovery from any hardware or software fault is detailed. A generic fault detection and correction software structure is used, allowing additions, deletions, and adjustments to fault detection and correction rules. This software structure is fed by in-line fault tests that are also able to take appropriate actions to avoid corruption of the data stream.
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

DOE PAGES

Engelmann, Christian; Hukerikar, Saurabh

2017-09-01

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space remains fragmented. There are no formal methods and metrics to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance \\& power efficiency characteristics.more » Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this paper, we develop a structured approach to the design, evaluation and optimization of HPC resilience using the concept of design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the problems caused by various types of faults, errors and failures in HPC systems and the techniques used to deal with these events. Each well-known solution that addresses a specific HPC resilience challenge is described in the form of a pattern. We develop a complete catalog of such resilience design patterns, which may be used by system architects, system software and tools developers, application programmers, as well as users and operators as essential building blocks when designing and deploying resilience solutions. We also develop a design framework that enhances a designer's understanding the opportunities for integrating multiple patterns across layers of the system stack and the important constraints during implementation of the individual patterns. It is also useful for defining mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The resilience patterns and the design framework also enable exploration and evaluation of design alternatives and support optimization of the cost-benefit trade-offs among performance, protection coverage, and power consumption of resilience solutions. Here, the overall goal of this work is to establish a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.« less
Resilience Design Patterns: A Structured Approach to Resilience at Extreme Scale

DOE Office of Scientific and Technical Information (OSTI.GOV)

Engelmann, Christian; Hukerikar, Saurabh

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest the prevalence of very high fault rates in future systems. While the HPC community has developed various resilience solutions, application-level techniques as well as system-based solutions, the solution space remains fragmented. There are no formal methods and metrics to integrate the various HPC resilience techniques into composite solutions, nor are there methods to holistically evaluate the adequacy and efficacy of such solutions in terms of their protection coverage, and their performance \\& power efficiency characteristics.more » Additionally, few of the current approaches are portable to newer architectures and software environments that will be deployed on future systems. In this paper, we develop a structured approach to the design, evaluation and optimization of HPC resilience using the concept of design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the problems caused by various types of faults, errors and failures in HPC systems and the techniques used to deal with these events. Each well-known solution that addresses a specific HPC resilience challenge is described in the form of a pattern. We develop a complete catalog of such resilience design patterns, which may be used by system architects, system software and tools developers, application programmers, as well as users and operators as essential building blocks when designing and deploying resilience solutions. We also develop a design framework that enhances a designer's understanding the opportunities for integrating multiple patterns across layers of the system stack and the important constraints during implementation of the individual patterns. It is also useful for defining mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The resilience patterns and the design framework also enable exploration and evaluation of design alternatives and support optimization of the cost-benefit trade-offs among performance, protection coverage, and power consumption of resilience solutions. Here, the overall goal of this work is to establish a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-efficient manner despite frequent faults, errors, and failures of various types.« less
Fault-tolerant clock synchronization in distributed systems

NASA Technical Reports Server (NTRS)

Ramanathan, Parameswaran; Shin, Kang G.; Butler, Ricky W.

1990-01-01

Existing fault-tolerant clock synchronization algorithms are compared and contrasted. These include the following: software synchronization algorithms, such as convergence-averaging, convergence-nonaveraging, and consistency algorithms, as well as probabilistic synchronization; hardware synchronization algorithms; and hybrid synchronization. The worst-case clock skews guaranteed by representative algorithms are compared, along with other important aspects such as time, message, and cost overhead imposed by the algorithms. More recent developments such as hardware-assisted software synchronization and algorithms for synchronizing large, partially connected distributed systems are especially emphasized.
The development of an interim generalized gate logic software simulator

NASA Technical Reports Server (NTRS)

Mcgough, J. G.; Nemeroff, S.

1985-01-01

A proof-of-concept computer program called IGGLOSS (Interim Generalized Gate Logic Software Simulator) was developed and is discussed. The simulator engine was designed to perform stochastic estimation of self test coverage (fault-detection latency times) of digital computers or systems. A major attribute of the IGGLOSS is its high-speed simulation: 9.5 x 1,000,000 gates/cpu sec for nonfaulted circuits and 4.4 x 1,000,000 gates/cpu sec for faulted circuits on a VAX 11/780 host computer.
AADL and Model-based Engineering

DTIC Science & Technology

2014-10-20

and MBE Feiler, Oct 20, 2014 © 2014 Carnegie Mellon University We Rely on Software for Safe Aircraft Operation Embedded software systems ...D eveloper Compute Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency...confusion Hardware Engineer Why do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software
Software Health Management with Bayesian Networks

NASA Technical Reports Server (NTRS)

Mengshoel, Ole; Schumann, JOhann

2011-01-01

Most modern aircraft as well as other complex machinery is equipped with diagnostics systems for its major subsystems. During operation, sensors provide important information about the subsystem (e.g., the engine) and that information is used to detect and diagnose faults. Most of these systems focus on the monitoring of a mechanical, hydraulic, or electromechanical subsystem of the vehicle or machinery. Only recently, health management systems that monitor software have been developed. In this paper, we will discuss our approach of using Bayesian networks for Software Health Management (SWHM). We will discuss SWHM requirements, which make advanced reasoning capabilities for the detection and diagnosis important. Then we will present our approach to using Bayesian networks for the construction of health models that dynamically monitor a software system and is capable of detecting and diagnosing faults.
A theoretical basis for the analysis of multiversion software subject to coincident errors

NASA Technical Reports Server (NTRS)

Eckhardt, D. E., Jr.; Lee, L. D.

1985-01-01

Fundamental to the development of redundant software techniques (known as fault-tolerant software) is an understanding of the impact of multiple joint occurrences of errors, referred to here as coincident errors. A theoretical basis for the study of redundant software is developed which: (1) provides a probabilistic framework for empirically evaluating the effectiveness of a general multiversion strategy when component versions are subject to coincident errors, and (2) permits an analytical study of the effects of these errors. An intensity function, called the intensity of coincident errors, has a central role in this analysis. This function describes the propensity of programmers to introduce design faults in such a way that software components fail together when executing in the application environment. A condition under which a multiversion system is a better strategy than relying on a single version is given.
Integrated Hardware and Software for No-Loss Computing

NASA Technical Reports Server (NTRS)

James, Mark

2007-01-01

When an algorithm is distributed across multiple threads executing on many distinct processors, a loss of one of those threads or processors can potentially result in the total loss of all the incremental results up to that point. When implementation is massively hardware distributed, then the probability of a hardware failure during the course of a long execution is potentially high. Traditionally, this problem has been addressed by establishing checkpoints where the current state of some or part of the execution is saved. Then in the event of a failure, this state information can be used to recompute that point in the execution and resume the computation from that point. A serious problem arises when one distributes a problem across multiple threads and physical processors is that one increases the likelihood of the algorithm failing due to no fault of the scientist but as a result of hardware faults coupled with operating system problems. With good reason, scientists expect their computing tools to serve them and not the other way around. What is novel here is a unique combination of hardware and software that reformulates an application into monolithic structure that can be monitored in real-time and dynamically reconfigured in the event of a failure. This unique reformulation of hardware and software will provide advanced aeronautical technologies to meet the challenges of next-generation systems in aviation, for civilian and scientific purposes, in our atmosphere and in atmospheres of other worlds. In particular, with respect to NASA s manned flight to Mars, this technology addresses the critical requirements for improving safety and increasing reliability of manned spacecraft.
Spacecraft fault tolerance: The Magellan experience

NASA Technical Reports Server (NTRS)

Kasuda, Rick; Packard, Donna Sexton

1993-01-01

Interplanetary and earth orbiting missions are now imposing unique fault tolerant requirements upon spacecraft design. Mission success is the prime motivator for building spacecraft with fault tolerant systems. The Magellan spacecraft had many such requirements imposed upon its design. Magellan met these requirements by building redundancy into all the major subsystem components and designing the onboard hardware and software with the capability to detect a fault, isolate it to a component, and issue commands to achieve a back-up configuration. This discussion is limited to fault protection, which is the autonomous capability to respond to a fault. The Magellan fault protection design is discussed, as well as the developmental and flight experiences and a summary of the lessons learned.
Results of an electrical power system fault study (CDDF)

NASA Technical Reports Server (NTRS)

Dugal-Whitehead, N. R.; Johnson, Y. B.

1993-01-01

This report gives the results of an electrical power system fault study which has been conducted over the last 2 and one-half years. First, the results of the literature search into electrical power system faults in space and terrestrial power system applications are reported. A description of the intended implementations of the power system faults into the Large Autonomous Spacecraft Electrical Power System (LASEPS) breadboard is then presented. Then, the actual implementation of the faults into the breadboard is discussed along with a discussion describing the LASEPS breadboard. Finally, the results of the injected faults and breadboard failures are discussed.
Multi-Agent Diagnosis and Control of an Air Revitalization System for Life Support in Space

NASA Technical Reports Server (NTRS)

Malin, Jane T.; Kowing, Jeffrey; Nieten, Joseph; Graham, Jeffrey s.; Schreckenghost, Debra; Bonasso, Pete; Fleming, Land D.; MacMahon, Matt; Thronesbery, Carroll

2000-01-01

An architecture of interoperating agents has been developed to provide control and fault management for advanced life support systems in space. In this adjustable autonomy architecture, software agents coordinate with human agents and provide support in novel fault management situations. This architecture combines the Livingstone model-based mode identification and reconfiguration (MIR) system with the 3T architecture for autonomous flexible command and control. The MIR software agent performs model-based state identification and diagnosis. MIR identifies novel recovery configurations and the set of commands required for the recovery. The AZT procedural executive and the human operator use the diagnoses and recovery recommendations, and provide command sequencing. User interface extensions have been developed to support human monitoring of both AZT and MIR data and activities. This architecture has been demonstrated performing control and fault management for an oxygen production system for air revitalization in space. The software operates in a dynamic simulation testbed.
Transparent Ada rendezvous in a fault tolerant distributed system

NASA Technical Reports Server (NTRS)

Racine, Roger

1986-01-01

There are many problems associated with distributing an Ada program over a loosely coupled communication network. Some of these problems involve the various aspects of the distributed rendezvous. The problems addressed involve supporting the delay statement in a selective call and supporting the else clause in a selective call. Most of these difficulties are compounded by the need for an efficient communication system. The difficulties are compounded even more by considering the possibility of hardware faults occurring while the program is running. With a hardware fault tolerant computer system, it is possible to design a distribution scheme and communication software which is efficient and allows Ada semantics to be preserved. An Ada design for the communications software of one such system will be presented, including a description of the services provided in the seven layers of an International Standards Organization (ISO) Open System Interconnect (OSI) model communications system. The system capabilities (hardware and software) that allow this communication system will also be described.

Characterization of the faulted behavior of digital computers and fault tolerant systems

NASA Technical Reports Server (NTRS)

Bavuso, Salvatore J.; Miner, Paul S.

1989-01-01

A development status evaluation is presented for efforts conducted at NASA-Langley since 1977, toward the characterization of the latent fault in digital fault-tolerant systems. Attention is given to the practical, high speed, generalized gate-level logic system simulator developed, as well as to the validation methodology used for the simulator, on the basis of faultable software and hardware simulations employing a prototype MIL-STD-1750A processor. After validation, latency tests will be performed.
Computer Sciences and Data Systems, volume 1

NASA Technical Reports Server (NTRS)

1987-01-01

Topics addressed include: software engineering; university grants; institutes; concurrent processing; sparse distributed memory; distributed operating systems; intelligent data management processes; expert system for image analysis; fault tolerant software; and architecture research.
Survey of Verification and Validation Techniques for Small Satellite Software Development

NASA Technical Reports Server (NTRS)

Jacklin, Stephen A.

2015-01-01

The purpose of this paper is to provide an overview of the current trends and practices in small-satellite software verification and validation. This document is not intended to promote a specific software assurance method. Rather, it seeks to present an unbiased survey of software assurance methods used to verify and validate small satellite software and to make mention of the benefits and value of each approach. These methods include simulation and testing, verification and validation with model-based design, formal methods, and fault-tolerant software design with run-time monitoring. Although the literature reveals that simulation and testing has by far the longest legacy, model-based design methods are proving to be useful for software verification and validation. Some work in formal methods, though not widely used for any satellites, may offer new ways to improve small satellite software verification and validation. These methods need to be further advanced to deal with the state explosion problem and to make them more usable by small-satellite software engineers to be regularly applied to software verification. Last, it is explained how run-time monitoring, combined with fault-tolerant software design methods, provides an important means to detect and correct software errors that escape the verification process or those errors that are produced after launch through the effects of ionizing radiation.
Space station data system analysis/architecture study. Task 2: Options development DR-5. Volume 1: Technology options

NASA Technical Reports Server (NTRS)

1985-01-01

The second task in the Space Station Data System (SSDS) Analysis/Architecture Study is the development of an information base that will support the conduct of trade studies and provide sufficient data to make key design/programmatic decisions. This volume identifies the preferred options in the technology category and characterizes these options with respect to performance attributes, constraints, cost, and risk. The technology category includes advanced materials, processes, and techniques that can be used to enhance the implementation of SSDS design structures. The specific areas discussed are mass storage, including space and round on-line storage and off-line storage; man/machine interface; data processing hardware, including flight computers and advanced/fault tolerant computer architectures; and software, including data compression algorithms, on-board high level languages, and software tools. Also discussed are artificial intelligence applications and hard-wire communications.
TERMTrial--terminology-based documentation systems for cooperative clinical trials.

PubMed

Merzweiler, A; Weber, R; Garde, S; Haux, R; Knaup-Gregori, P

2005-04-01

Within cooperative groups of multi-center clinical trials a standardized documentation is a prerequisite for communication and sharing of data. Standardizing documentation systems means standardizing the underlying terminology. The management and consistent application of terminology systems is a difficult and fault-prone task, which should be supported by appropriate software tools. Today, documentation systems for clinical trials are often implemented as so-called Remote-Data-Entry-Systems (RDE-systems). Although there are many commercial systems, which support the development of RDE-systems there is none offering a comprehensive terminological support. Therefore, we developed the software system TERMTrial which consists of a component for the definition and management of terminology systems for cooperative groups of clinical trials and two components for the terminology-based automatic generation of trial databases and terminology-based interactive design of electronic case report forms (eCRFs). TERMTrial combines the advantages of remote data entry with a comprehensive terminological control.
Using Remote Sensing Data to Constrain Models of Fault Interactions and Plate Boundary Deformation

NASA Astrophysics Data System (ADS)

Glasscoe, M. T.; Donnellan, A.; Lyzenga, G. A.; Parker, J. W.; Milliner, C. W. D.

2016-12-01

Determining the distribution of slip and behavior of fault interactions at plate boundaries is a complex problem. Field and remotely sensed data often lack the necessary coverage to fully resolve fault behavior. However, realistic physical models may be used to more accurately characterize the complex behavior of faults constrained with observed data, such as GPS, InSAR, and SfM. These results will improve the utility of using combined models and data to estimate earthquake potential and characterize plate boundary behavior. Plate boundary faults exhibit complex behavior, with partitioned slip and distributed deformation. To investigate what fraction of slip becomes distributed deformation off major faults, we examine a model fault embedded within a damage zone of reduced elastic rigidity that narrows with depth and forward model the slip and resulting surface deformation. The fault segments and slip distributions are modeled using the JPL GeoFEST software. GeoFEST (Geophysical Finite Element Simulation Tool) is a two- and three-dimensional finite element software package for modeling solid stress and strain in geophysical and other continuum domain applications [Lyzenga, et al., 2000; Glasscoe, et al., 2004; Parker, et al., 2008, 2010]. New methods to advance geohazards research using computer simulations and remotely sensed observations for model validation are required to understand fault slip, the complex nature of fault interaction and plate boundary deformation. These models help enhance our understanding of the underlying processes, such as transient deformation and fault creep, and can aid in developing observation strategies for sUAV, airborne, and upcoming satellite missions seeking to determine how faults behave and interact and assess their associated hazard. Models will also help to characterize this behavior, which will enable improvements in hazard estimation. Validating the model results against remotely sensed observations will allow us to better constrain fault zone rheology and physical properties, having implications for the overall understanding of earthquake physics, fault interactions, plate boundary deformation and earthquake hazard, preparedness and risk reduction.
Fault Tolerant Real-Time Systems

DTIC Science & Technology

1993-09-30

The ART (Advanced Real-Time Technology) Project of Carnegie Mellon University is engaged in wide ranging research on hard real - time systems . The...including hardware and software fault tolerance using temporal redundancy and analytic redundancy to permit the construction of real - time systems whose
A Vehicle Management End-to-End Testing and Analysis Platform for Validation of Mission and Fault Management Algorithms to Reduce Risk for NASA's Space Launch System

NASA Technical Reports Server (NTRS)

Trevino, Luis; Johnson, Stephen B.; Patterson, Jonathan; Teare, David

2015-01-01

The development of the Space Launch System (SLS) launch vehicle requires cross discipline teams with extensive knowledge of launch vehicle subsystems, information theory, and autonomous algorithms dealing with all operations from pre-launch through on orbit operations. The characteristics of these systems must be matched with the autonomous algorithm monitoring and mitigation capabilities for accurate control and response to abnormal conditions throughout all vehicle mission flight phases, including precipitating safing actions and crew aborts. This presents a large complex systems engineering challenge being addressed in part by focusing on the specific subsystems handling of off-nominal mission and fault tolerance. Using traditional model based system and software engineering design principles from the Unified Modeling Language (UML), the Mission and Fault Management (M&FM) algorithms are crafted and vetted in specialized Integrated Development Teams composed of multiple development disciplines. NASA also has formed an M&FM team for addressing fault management early in the development lifecycle. This team has developed a dedicated Vehicle Management End-to-End Testbed (VMET) that integrates specific M&FM algorithms, specialized nominal and off-nominal test cases, and vendor-supplied physics-based launch vehicle subsystem models. The flexibility of VMET enables thorough testing of the M&FM algorithms by providing configurable suites of both nominal and off-nominal test cases to validate the algorithms utilizing actual subsystem models. The intent is to validate the algorithms and substantiate them with performance baselines for each of the vehicle subsystems in an independent platform exterior to flight software test processes. In any software development process there is inherent risk in the interpretation and implementation of concepts into software through requirements and test processes. Risk reduction is addressed by working with other organizations such as S&MA, Structures and Environments, GNC, Orion, the Crew Office, Flight Operations, and Ground Operations by assessing performance of the M&FM algorithms in terms of their ability to reduce Loss of Mission and Loss of Crew probabilities. In addition, through state machine and diagnostic modeling, analysis efforts investigate a broader suite of failure effects and detection and responses that can be tested in VMET and confirm that responses do not create additional risks or cause undesired states through interactive dynamic effects with other algorithms and systems. VMET further contributes to risk reduction by prototyping and exercising the M&FM algorithms early in their implementation and without any inherent hindrances such as meeting FSW processor scheduling constraints due to their target platform - ARINC 653 partitioned OS, resource limitations, and other factors related to integration with other subsystems not directly involved with M&FM. The plan for VMET encompasses testing the original M&FM algorithms coded in the same C++ language and state machine architectural concepts as that used by Flight Software. This enables the development of performance standards and test cases to characterize the M&FM algorithms and sets a benchmark from which to measure the effectiveness of M&FM algorithms performance in the FSW development and test processes. This paper is outlined in a systematic fashion analogous to a lifecycle process flow for engineering development of algorithms into software and testing. Section I describes the NASA SLS M&FM context, presenting the current infrastructure, leading principles, methods, and participants. Section II defines the testing philosophy of the M&FM algorithms as related to VMET followed by section III, which presents the modeling methods of the algorithms to be tested and validated in VMET. Its details are then further presented in section IV followed by Section V presenting integration, test status, and state analysis. Finally, section VI addresses the summary and forward directions followed by the appendices presenting relevant information on terminology and documentation.
Calculation and use of an environment's characteristic software metric set

NASA Technical Reports Server (NTRS)

Basili, Victor R.; Selby, Richard W., Jr.

1985-01-01

Since both cost/quality and production environments differ, this study presents an approach for customizing a characteristic set of software metrics to an environment. The approach is applied in the Software Engineering Laboratory (SEL), a NASA Goddard production environment, to 49 candidate process and product metrics of 652 modules from six (51,000 to 112,000 lines) projects. For this particular environment, the method yielded the characteristic metric set (source lines, fault correction effort per executable statement, design effort, code effort, number of I/O parameters, number of versions). The uses examined for a characteristic metric set include forecasting the effort for development, modification, and fault correction of modules based on historical data.
Simulation loop between cad systems, GEANT-4 and GeoModel: Implementation and results

NASA Astrophysics Data System (ADS)

Sharmazanashvili, A.; Tsutskiridze, Niko

2016-09-01

Compare analysis of simulation and as-built geometry descriptions of detector is important field of study for data_vs_Monte-Carlo discrepancies. Shapes consistency and detalization is not important while adequateness of volumes and weights of detector components are essential for tracking. There are 2 main reasons of faults of geometry descriptions in simulation: (1) Difference between simulated and as-built geometry descriptions; (2) Internal inaccuracies of geometry transformations added by simulation software infrastructure itself. Georgian Engineering team developed hub on the base of CATIA platform and several tools enabling to read in CATIA different descriptions used by simulation packages, like XML->CATIA; VP1->CATIA; Geo-Model->CATIA; Geant4->CATIA. As a result it becomes possible to compare different descriptions with each other using the full power of CATIA and investigate both classes of reasons of faults of geometry descriptions. Paper represents results of case studies of ATLAS Coils and End-Cap toroid structures.
Attitude Determination and Control System (ADCS) and Maintenance and Diagnostic System (MDS): A maintenance and diagnostic system for Space Station Freedom

NASA Technical Reports Server (NTRS)

Toms, David; Hadden, George D.; Harrington, Jim

1990-01-01

The Maintenance and Diagnostic System (MDS) that is being developed at Honeywell to enhance the Fault Detection Isolation and Recovery system (FDIR) for the Attitude Determination and Control System on Space Station Freedom is described. The MDS demonstrates ways that AI-based techniques can be used to improve the maintainability and safety of the Station by helping to resolve fault anomalies that cannot be fully determined by built-in-test, by providing predictive maintenance capabilities, and by providing expert maintenance assistance. The MDS will address the problems associated with reasoning about dynamic, continuous information versus only about static data, the concerns of porting software based on AI techniques to embedded targets, and the difficulties associated with real-time response. An initial prototype was built of the MDS. The prototype executes on Sun and IBM PS/2 hardware and is implemented in the Common Lisp; further work will evaluate its functionality and develop mechanisms to port the code to Ada.
Multi-focus and multi-level techniques for visualization and analysis of networks with thematic data

NASA Astrophysics Data System (ADS)

Cossalter, Michele; Mengshoel, Ole J.; Selker, Ted

2013-01-01

Information-rich data sets bring several challenges in the areas of visualization and analysis, even when associated with node-link network visualizations. This paper presents an integration of multi-focus and multi-level techniques that enable interactive, multi-step comparisons in node-link networks. We describe NetEx, a visualization tool that enables users to simultaneously explore different parts of a network and its thematic data, such as time series or conditional probability tables. NetEx, implemented as a Cytoscape plug-in, has been applied to the analysis of electrical power networks, Bayesian networks, and the Enron e-mail repository. In this paper we briefly discuss visualization and analysis of the Enron social network, but focus on data from an electrical power network. Specifically, we demonstrate how NetEx supports the analytical task of electrical power system fault diagnosis. Results from a user study with 25 subjects suggest that NetEx enables more accurate isolation of complex faults compared to an especially designed software tool.
Security Vulnerability Profiles of Mission Critical Software: Empirical Analysis of Security Related Bug Reports

NASA Technical Reports Server (NTRS)

Goseva-Popstojanova, Katerina; Tyo, Jacob

2017-01-01

While some prior research work exists on characteristics of software faults (i.e., bugs) and failures, very little work has been published on analysis of software applications vulnerabilities. This paper aims to contribute towards filling that gap by presenting an empirical investigation of application vulnerabilities. The results are based on data extracted from issue tracking systems of two NASA missions. These data were organized in three datasets: Ground mission IVV issues, Flight mission IVV issues, and Flight mission Developers issues. In each dataset, we identified security related software bugs and classified them in specific vulnerability classes. Then, we created the security vulnerability profiles, i.e., determined where and when the security vulnerabilities were introduced and what were the dominating vulnerabilities classes. Our main findings include: (1) In IVV issues datasets the majority of vulnerabilities were code related and were introduced in the Implementation phase. (2) For all datasets, around 90 of the vulnerabilities were located in two to four subsystems. (3) Out of 21 primary classes, five dominated: Exception Management, Memory Access, Other, Risky Values, and Unused Entities. Together, they contributed from 80 to 90 of vulnerabilities in each dataset.
Towards Real-time, On-board, Hardware-Supported Sensor and Software Health Management for Unmanned Aerial Systems

NASA Technical Reports Server (NTRS)

Schumann, Johann; Rozier, Kristin Y.; Reinbacher, Thomas; Mengshoel, Ole J.; Mbaya, Timmy; Ippolito, Corey

2013-01-01

Unmanned aerial systems (UASs) can only be deployed if they can effectively complete their missions and respond to failures and uncertain environmental conditions while maintaining safety with respect to other aircraft as well as humans and property on the ground. In this paper, we design a real-time, on-board system health management (SHM) capability to continuously monitor sensors, software, and hardware components for detection and diagnosis of failures and violations of safety or performance rules during the flight of a UAS. Our approach to SHM is three-pronged, providing: (1) real-time monitoring of sensor and/or software signals; (2) signal analysis, preprocessing, and advanced on the- fly temporal and Bayesian probabilistic fault diagnosis; (3) an unobtrusive, lightweight, read-only, low-power realization using Field Programmable Gate Arrays (FPGAs) that avoids overburdening limited computing resources or costly re-certification of flight software due to instrumentation. Our implementation provides a novel approach of combining modular building blocks, integrating responsive runtime monitoring of temporal logic system safety requirements with model-based diagnosis and Bayesian network-based probabilistic analysis. We demonstrate this approach using actual data from the NASA Swift UAS, an experimental all-electric aircraft.
An Investigation of Network Enterprise Risk Management Techniques to Support Military Net-Centric Operations

DTIC Science & Technology

2009-09-01

this information supports the decison - making process as it is applied to the management of risk. 2. Operational Risk Operational risk is the threat... reasonability . However, to make a software system fault tolerant, the system needs to recognize and fix a system state condition. To detect a fault, a fault...Tracking ..........................................51 C. DECISION- MAKING PROCESS................................................................51 1. Risk
Secure Embedded System Design Methodologies for Military Cryptographic Systems

DTIC Science & Technology

2016-03-31

Fault- Tree Analysis (FTA); Built-In Self-Test (BIST) Introduction Secure access-control systems restrict operations to authorized users via methods...failures in the individual software/processor elements, the question of exactly how unlikely is difficult to answer. Fault- Tree Analysis (FTA) has a...Collins of Sandia National Laboratories for years of sharing his extensive knowledge of Fail-Safe Design Assurance and Fault- Tree Analysis
Intelligent fault management for the Space Station active thermal control system

NASA Technical Reports Server (NTRS)

Hill, Tim; Faltisco, Robert M.

1992-01-01

The Thermal Advanced Automation Project (TAAP) approach and architecture is described for automating the Space Station Freedom (SSF) Active Thermal Control System (ATCS). The baseline functionally and advanced automation techniques for Fault Detection, Isolation, and Recovery (FDIR) will be compared and contrasted. Advanced automation techniques such as rule-based systems and model-based reasoning should be utilized to efficiently control, monitor, and diagnose this extremely complex physical system. TAAP is developing advanced FDIR software for use on the SSF thermal control system. The goal of TAAP is to join Knowledge-Based System (KBS) technology, using a combination of rules and model-based reasoning, with conventional monitoring and control software in order to maximize autonomy of the ATCS. TAAP's predecessor was NASA's Thermal Expert System (TEXSYS) project which was the first large real-time expert system to use both extensive rules and model-based reasoning to control and perform FDIR on a large, complex physical system. TEXSYS showed that a method is needed for safely and inexpensively testing all possible faults of the ATCS, particularly those potentially damaging to the hardware, in order to develop a fully capable FDIR system. TAAP therefore includes the development of a high-fidelity simulation of the thermal control system. The simulation provides realistic, dynamic ATCS behavior and fault insertion capability for software testing without hardware related risks or expense. In addition, thermal engineers will gain greater confidence in the KBS FDIR software than was possible prior to this kind of simulation testing. The TAAP KBS will initially be a ground-based extension of the baseline ATCS monitoring and control software and could be migrated on-board as additional computation resources are made available.
Comparative Study of Fault Diagnostic Methods in Voltage Source Inverter Fed Three Phase Induction Motor Drive

NASA Astrophysics Data System (ADS)

Dhumale, R. B.; Lokhande, S. D.

2017-05-01

Three phase Pulse Width Modulation inverter plays vital role in industrial applications. The performance of inverter demeans as several types of faults take place in it. The widely used switching devices in power electronics are Insulated Gate Bipolar Transistors (IGBTs) and Metal Oxide Field Effect Transistors (MOSFET). The IGBTs faults are broadly classified as base or collector open circuit fault, misfiring fault and short circuit fault. To develop consistency and performance of inverter, knowledge of fault mode is extremely important. This paper presents the comparative study of IGBTs fault diagnosis. Experimental set up is implemented for data acquisition under various faulty and healthy conditions. Recent methods are executed using MATLAB-Simulink and compared using key parameters like average accuracy, fault detection time, implementation efforts, threshold dependency, and detection parameter, resistivity against noise and load dependency.
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics

PubMed Central

2010-01-01

Background Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. Description An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Conclusions Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms. PMID:21210976
An overview of the Hadoop/MapReduce/HBase framework and its current applications in bioinformatics.

PubMed

Taylor, Ronald C

2010-12-21

Bioinformatics researchers are now confronted with analysis of ultra large-scale data sets, a problem that will only increase at an alarming rate in coming years. Recent developments in open source software, that is, the Hadoop project and associated software, provide a foundation for scaling to petabyte scale data warehouses on Linux clusters, providing fault-tolerant parallelized analysis on such data using a programming style named MapReduce. An overview is given of the current usage within the bioinformatics community of Hadoop, a top-level Apache Software Foundation project, and of associated open source software projects. The concepts behind Hadoop and the associated HBase project are defined, and current bioinformatics software that employ Hadoop is described. The focus is on next-generation sequencing, as the leading application area to date. Hadoop and the MapReduce programming paradigm already have a substantial base in the bioinformatics community, especially in the field of next-generation sequencing analysis, and such use is increasing. This is due to the cost-effectiveness of Hadoop-based analysis on commodity Linux clusters, and in the cloud via data upload to cloud vendors who have implemented Hadoop/HBase; and due to the effectiveness and ease-of-use of the MapReduce method in parallelization of many data analysis algorithms.

Development of an Environment for Software Reliability Model Selection

DTIC Science & Technology

1992-09-01

now is directed to other related problems such as tools for model selection, multiversion programming, and software fault tolerance modeling... multiversion programming, 7. Hlardware can be repaired by spare modules, which is not. the case for software, 2-6 N. Preventive maintenance is very important
Power plant fault detection using artificial neural network

NASA Astrophysics Data System (ADS)

Thanakodi, Suresh; Nazar, Nazatul Shiema Moh; Joini, Nur Fazriana; Hidzir, Hidzrin Dayana Mohd; Awira, Mohammad Zulfikar Khairul

2018-02-01

The fault that commonly occurs in power plants is due to various factors that affect the system outage. There are many types of faults in power plants such as single line to ground fault, double line to ground fault, and line to line fault. The primary aim of this paper is to diagnose the fault in 14 buses power plants by using an Artificial Neural Network (ANN). The Multilayered Perceptron Network (MLP) that detection trained utilized the offline training methods such as Gradient Descent Backpropagation (GDBP), Levenberg-Marquardt (LM), and Bayesian Regularization (BR). The best method is used to build the Graphical User Interface (GUI). The modelling of 14 buses power plant, network training, and GUI used the MATLAB software.
Development of N-version software samples for an experiment in software fault tolerance

NASA Technical Reports Server (NTRS)

Lauterbach, L.

1987-01-01

The report documents the task planning and software development phases of an effort to obtain twenty versions of code independently designed and developed from a common specification. These versions were created for use in future experiments in software fault tolerance, in continuation of the experimental series underway at the Systems Validation Methods Branch (SVMB) at NASA Langley Research Center. The 20 versions were developed under controlled conditions at four U.S. universities, by 20 teams of two researchers each. The versions process raw data from a modified Redundant Strapped Down Inertial Measurement Unit (RSDIMU). The specifications, and over 200 questions submitted by the developers concerning the specifications, are included as appendices to this report. Design documents, and design and code walkthrough reports for each version, were also obtained in this task for use in future studies.
Logic flowgraph methodology - A tool for modeling embedded systems

NASA Technical Reports Server (NTRS)

Muthukumar, C. T.; Guarro, S. B.; Apostolakis, G. E.

1991-01-01

The logic flowgraph methodology (LFM), a method for modeling hardware in terms of its process parameters, has been extended to form an analytical tool for the analysis of integrated (hardware/software) embedded systems. In the software part of a given embedded system model, timing and the control flow among different software components are modeled by augmenting LFM with modified Petrinet structures. The objective of the use of such an augmented LFM model is to uncover possible errors and the potential for unanticipated software/hardware interactions. This is done by backtracking through the augmented LFM mode according to established procedures which allow the semiautomated construction of fault trees for any chosen state of the embedded system (top event). These fault trees, in turn, produce the possible combinations of lower-level states (events) that may lead to the top event.
Surrogate oracles, generalized dependency and simpler models

NASA Technical Reports Server (NTRS)

Wilson, Larry

1990-01-01

Software reliability models require the sequence of interfailure times from the debugging process as input. It was previously illustrated that using data from replicated debugging could greatly improve reliability predictions. However, inexpensive replication of the debugging process requires the existence of a cheap, fast error detector. Laboratory experiments can be designed around a gold version which is used as an oracle or around an n-version error detector. Unfortunately, software developers can not be expected to have an oracle or to bear the expense of n-versions. A generic technique is being investigated for approximating replicated data by using the partially debugged software as a difference detector. It is believed that the failure rate of each fault has significant dependence on the presence or absence of other faults. Thus, in order to discuss a failure rate for a known fault, the presence or absence of each of the other known faults needs to be specified. Also, in simpler models which use shorter input sequences without sacrificing accuracy are of interest. In fact, a possible gain in performance is conjectured. To investigate these propositions, NASA computers running LIC (RTI) versions are used to generate data. This data will be used to label the debugging graph associated with each version. These labeled graphs will be used to test the utility of a surrogate oracle, to analyze the dependent nature of fault failure rates and to explore the feasibility of reliability models which use the data of only the most recent failures.
An optimized implementation of a fault-tolerant clock synchronization circuit

NASA Technical Reports Server (NTRS)

Torres-Pomales, Wilfredo

1995-01-01

A fault-tolerant clock synchronization circuit was designed and tested. A comparison to a previous design and the procedure followed to achieve the current optimization are included. The report also includes a description of the system and the results of tests performed to study the synchronization and fault-tolerant characteristics of the implementation.
Automated Operations Development for Advanced Exploration Systems

NASA Technical Reports Server (NTRS)

Haddock, Angie; Stetson, Howard K.

2012-01-01

Automated space operations command and control software development and its implementation must be an integral part of the vehicle design effort. The software design must encompass autonomous fault detection, isolation, recovery capabilities and also provide single button intelligent functions for the crew. Development, operations and safety approval experience with the Timeliner system on-board the International Space Station (ISS), which provided autonomous monitoring with response and single command functionality of payload systems, can be built upon for future automated operations as the ISS Payload effort was the first and only autonomous command and control system to be in continuous execution (6 years), 24 hours a day, 7 days a week within a crewed spacecraft environment. Utilizing proven capabilities from the ISS Higher Active Logic (HAL) System [1] , along with the execution component design from within the HAL 9000 Space Operating System [2] , this design paper will detail the initial HAL System software architecture and interfaces as applied to NASA s Habitat Demonstration Unit (HDU) in support of the Advanced Exploration Systems, Autonomous Mission Operations project. The development and implementation of integrated simulators within this development effort will also be detailed and is the first step in verifying the HAL 9000 Integrated Test-Bed Component [2] designs effectiveness. This design paper will conclude with a summary of the current development status and future development goals as it pertains to automated command and control for the HDU.
Automated Operations Development for Advanced Exploration Systems

NASA Technical Reports Server (NTRS)

Haddock, Angie T.; Stetson, Howard

2012-01-01

Automated space operations command and control software development and its implementation must be an integral part of the vehicle design effort. The software design must encompass autonomous fault detection, isolation, recovery capabilities and also provide "single button" intelligent functions for the crew. Development, operations and safety approval experience with the Timeliner system onboard the International Space Station (ISS), which provided autonomous monitoring with response and single command functionality of payload systems, can be built upon for future automated operations as the ISS Payload effort was the first and only autonomous command and control system to be in continuous execution (6 years), 24 hours a day, 7 days a week within a crewed spacecraft environment. Utilizing proven capabilities from the ISS Higher Active Logic (HAL) System, along with the execution component design from within the HAL 9000 Space Operating System, this design paper will detail the initial HAL System software architecture and interfaces as applied to NASA's Habitat Demonstration Unit (HDU) in support of the Advanced Exploration Systems, Autonomous Mission Operations project. The development and implementation of integrated simulators within this development effort will also be detailed and is the first step in verifying the HAL 9000 Integrated Test-Bed Component [2] designs effectiveness. This design paper will conclude with a summary of the current development status and future development goals as it pertains to automated command and control for the HDU.
Intelligent systems technology infrastructure for integrated systems

NASA Technical Reports Server (NTRS)

Lum, Henry, Jr.

1991-01-01

Significant advances have occurred during the last decade in intelligent systems technologies (a.k.a. knowledge-based systems, KBS) including research, feasibility demonstrations, and technology implementations in operational environments. Evaluation and simulation data obtained to date in real-time operational environments suggest that cost-effective utilization of intelligent systems technologies can be realized for Automated Rendezvous and Capture applications. The successful implementation of these technologies involve a complex system infrastructure integrating the requirements of transportation, vehicle checkout and health management, and communication systems without compromise to systems reliability and performance. The resources that must be invoked to accomplish these tasks include remote ground operations and control, built-in system fault management and control, and intelligent robotics. To ensure long-term evolution and integration of new validated technologies over the lifetime of the vehicle, system interfaces must also be addressed and integrated into the overall system interface requirements. An approach for defining and evaluating the system infrastructures including the testbed currently being used to support the on-going evaluations for the evolutionary Space Station Freedom Data Management System is presented and discussed. Intelligent system technologies discussed include artificial intelligence (real-time replanning and scheduling), high performance computational elements (parallel processors, photonic processors, and neural networks), real-time fault management and control, and system software development tools for rapid prototyping capabilities.
Scheduling lessons learned from the Autonomous Power System

NASA Technical Reports Server (NTRS)

Ringer, Mark J.

1992-01-01

The Autonomous Power System (APS) project at NASA LeRC is designed to demonstrate the applications of integrated intelligent diagnosis, control, and scheduling techniques to space power distribution systems. The project consists of three elements: the Autonomous Power Expert System (APEX) for Fault Diagnosis, Isolation, and Recovery (FDIR); the Autonomous Intelligent Power Scheduler (AIPS) to efficiently assign activities start times and resources; and power hardware (Brassboard) to emulate a space-based power system. The AIPS scheduler was tested within the APS system. This scheduler is able to efficiently assign available power to the requesting activities and share this information with other software agents within the APS system in order to implement the generated schedule. The AIPS scheduler is also able to cooperatively recover from fault situations by rescheduling the affected loads on the Brassboard in conjunction with the APEX FDIR system. AIPS served as a learning tool and an initial scheduling testbed for the integration of FDIR and automated scheduling systems. Many lessons were learned from the AIPS scheduler and are now being integrated into a new scheduler called SCRAP (Scheduler for Continuous Resource Allocation and Planning). This paper will service three purposes: an overview of the AIPS implementation, lessons learned from the AIPS scheduler, and a brief section on how these lessons are being applied to the new SCRAP scheduler.
Large Scale Screening of Low Cost Ferritic Steel Designs For Advanced Ultra Supercritical Boiler Using First Principles Methods

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ouyang, Lizhi

Advanced Ultra Supercritical Boiler (AUSC) requires materials that can operate in corrosive environment at temperature and pressure as high as 760°C (or 1400°F) and 5000psi, respectively, while at the same time maintain good ductility at low temperature. We develop automated simulation software tools to enable fast large scale screening studies of candidate designs. While direct evaluation of creep rupture strength and ductility are currently not feasible, properties such as energy, elastic constants, surface energy, interface energy, and stack fault energy can be used to assess their relative ductility and creeping strength. We implemented software to automate the complex calculations tomore » minimize human inputs in the tedious screening studies which involve model structures generation, settings for first principles calculations, results analysis and reporting. The software developed in the project and library of computed mechanical properties of phases found in ferritic steels, many are complex solid solutions estimated for the first time, will certainly help the development of low cost ferritic steel for AUSC.« less
When probabilistic seismic hazard climbs volcanoes: the Mt. Etna case, Italy - Part 1: Model components for sources parameterization

NASA Astrophysics Data System (ADS)

Azzaro, Raffaele; Barberi, Graziella; D'Amico, Salvatore; Pace, Bruno; Peruzza, Laura; Tuvè, Tiziana

2017-11-01

The volcanic region of Mt. Etna (Sicily, Italy) represents a perfect lab for testing innovative approaches to seismic hazard assessment. This is largely due to the long record of historical and recent observations of seismic and tectonic phenomena, the high quality of various geophysical monitoring and particularly the rapid geodynamics clearly demonstrate some seismotectonic processes. We present here the model components and the procedures adopted for defining seismic sources to be used in a new generation of probabilistic seismic hazard assessment (PSHA), the first results and maps of which are presented in a companion paper, Peruzza et al. (2017). The sources include, with increasing complexity, seismic zones, individual faults and gridded point sources that are obtained by integrating geological field data with long and short earthquake datasets (the historical macroseismic catalogue, which covers about 3 centuries, and a high-quality instrumental location database for the last decades). The analysis of the frequency-magnitude distribution identifies two main fault systems within the volcanic complex featuring different seismic rates that are controlled essentially by volcano-tectonic processes. We discuss the variability of the mean occurrence times of major earthquakes along the main Etnean faults by using an historical approach and a purely geologic method. We derive a magnitude-size scaling relationship specifically for this volcanic area, which has been implemented into a recently developed software tool - FiSH (Pace et al., 2016) - that we use to calculate the characteristic magnitudes and the related mean recurrence times expected for each fault. Results suggest that for the Mt. Etna area, the traditional assumptions of uniform and Poissonian seismicity can be relaxed; a time-dependent fault-based modeling, joined with a 3-D imaging of volcano-tectonic sources depicted by the recent instrumental seismicity, can therefore be implemented in PSHA maps. They can be relevant for the retrofitting of the existing building stock and for driving risk reduction interventions. These analyses do not account for regional M > 6 seismogenic sources which dominate the hazard over long return times (≥ 500 years).
Integrated Environment for Development and Assurance

DTIC Science & Technology

2015-01-26

Jan 26, 2015 © 2015 Carnegie Mellon University We Rely on Software for Safe Aircraft Operation Embedded software systems introduce a new class of...eveloper Compute Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency jitter affects...Why do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software system as major source of
Machine-checked proofs of the design and implementation of a fault-tolerant circuit

NASA Technical Reports Server (NTRS)

Bevier, William R.; Young, William D.

1990-01-01

A formally verified implementation of the 'oral messages' algorithm of Pease, Shostak, and Lamport is described. An abstract implementation of the algorithm is verified to achieve interactive consistency in the presence of faults. This abstract characterization is then mapped down to a hardware level implementation which inherits the fault-tolerant characteristics of the abstract version. All steps in the proof were checked with the Boyer-Moore theorem prover. A significant results is the demonstration of a fault-tolerant device that is formally specified and whose implementation is proved correct with respect to this specification. A significant simplifying assumption is that the redundant processors behave synchronously. A mechanically checked proof that the oral messages algorithm is 'optimal' in the sense that no algorithm which achieves agreement via similar message passing can tolerate a larger proportion of faulty processor is also described.
Tsunami Amplitude Estimation from Real-Time GNSS.

NASA Astrophysics Data System (ADS)

Jeffries, C.; MacInnes, B. T.; Melbourne, T. I.

2017-12-01

Tsunami early warning systems currently comprise modeling of observations from the global seismic network, deep-ocean DART buoys, and a global distribution of tide gauges. While these tools work well for tsunamis traveling teleseismic distances, saturation of seismic magnitude estimation in the near field can result in significant underestimation of tsunami excitation for local warning. Moreover, DART buoy and tide gauge observations cannot be used to rectify the underestimation in the available time, typically 10-20 minutes, before local runup occurs. Real-time GNSS measurements of coseismic offsets may be used to estimate finite faulting within 1-2 minutes and, in turn, tsunami excitation for local warning purposes. We describe here a tsunami amplitude estimation algorithm; implemented for the Cascadia subduction zone, that uses continuous GNSS position streams to estimate finite faulting. The system is based on a time-domain convolution of fault slip that uses a pre-computed catalog of hydrodynamic Green's functions generated with the GeoClaw shallow-water wave simulation software and maps seismic slip along each section of the fault to points located off the Cascadia coast in 20m of water depth and relies on the principle of the linearity in tsunami wave propagation. The system draws continuous slip estimates from a message broker, convolves the slip with appropriate Green's functions which are then superimposed to produce wave amplitude at each coastal location. The maximum amplitude and its arrival time are then passed into a database for subsequent monitoring and display. We plan on testing this system using a suite of synthetic earthquakes calculated for Cascadia whose ground motions are simulated at 500 existing Cascadia GPS sites, as well as real earthquakes for which we have continuous GNSS time series and surveyed runup heights, including Maule, Chile 2010 and Tohoku, Japan 2011. This system has been implemented in the CWU Geodesy Lab for the Cascadia subduction zone but will be expanded to the circum-Pacific as real-time processing of international GNSS data streams become available.
A nodal discontinuous Galerkin approach to 3-D viscoelastic wave propagation in complex geological media

NASA Astrophysics Data System (ADS)

Lambrecht, L.; Lamert, A.; Friederich, W.; Möller, T.; Boxberg, M. S.

2018-03-01

A nodal discontinuous Galerkin (NDG) approach is developed and implemented for the computation of viscoelastic wavefields in complex geological media. The NDG approach combines unstructured tetrahedral meshes with an element-wise, high-order spatial interpolation of the wavefield based on Lagrange polynomials. Numerical fluxes are computed from an exact solution of the heterogeneous Riemann problem. Our implementation offers capabilities for modelling viscoelastic wave propagation in 1-D, 2-D and 3-D settings of very different spatial scale with little logistical overhead. It allows the import of external tetrahedral meshes provided by independent meshing software and can be run in a parallel computing environment. Computation of adjoint wavefields and an interface for the computation of waveform sensitivity kernels are offered. The method is validated in 2-D and 3-D by comparison to analytical solutions and results from a spectral element method. The capabilities of the NDG method are demonstrated through a 3-D example case taken from tunnel seismics which considers high-frequency elastic wave propagation around a curved underground tunnel cutting through inclined and faulted sedimentary strata. The NDG method was coded into the open-source software package NEXD and is available from GitHub.
Techniques and implementation of the embedded rule-based expert system using Ada

NASA Technical Reports Server (NTRS)

Liberman, Eugene M.; Jones, Robert E.

1991-01-01

Ada is becoming an increasingly popular programming language for large Government-funded software projects. Ada with its portability, transportability, and maintainability lends itself well to today's complex programming environment. In addition, expert systems have also assured a growing role in providing human-like reasoning capability and expertise for computer systems. The integration of expert system technology with Ada programming language, specifically a rule-based expert system using an ART-Ada (Automated Reasoning Tool for Ada) system shell is discussed. The NASA Lewis Research Center was chosen as a beta test site for ART-Ada. The test was conducted by implementing the existing Autonomous Power EXpert System (APEX), a Lisp-base power expert system, in ART-Ada. Three components, the rule-based expert system, a graphics user interface, and communications software make up SMART-Ada (Systems fault Management with ART-Ada). The main objective, to conduct a beta test on the ART-Ada rule-based expert system shell, was achieved. The system is operational. New Ada tools will assist in future successful projects. ART-Ada is one such tool and is a viable alternative to the straight Ada code when an application requires a rule-based or knowledge-based approach.
Combined Application of Shallow Seismic Reflection and High-resolution Refraction Exploration Approach to Active Fault Survey, Central Orogenic Belt, China

NASA Astrophysics Data System (ADS)

Lin, S.; Luo, D.; Yanlin, F.; Li, Y.

2016-12-01

Shallow Seismic Reflection (SSR) is a major geophysical exploration method with its exploration depth range, high-resolution in urban active fault exploration. In this paper, we carried out (SSR) and High-resolution refraction (HRR) test in the Liangyun Basin to explore a buried fault. We used NZ distributed 64 channel seismic instrument, 60HZ high sensitivity detector, Geode multi-channel portable acquisition system and hammer source. We selected single side hammer hit multiple overlay, 48 channels received and 12 times of coverage. As there are some coincidence measuring lines of SSR and HRR, we chose multi chase and encounter observation system. Based on the satellite positioning, we arranged 11 survey lines in our study area with total length for 8132 meters. GEOGIGA seismic reflection data processing software was used to deal with the SSR data. After repeated tests from the aspects of single shot record compilation, interference wave pressing, static correction, velocity parameter extraction, dynamic correction, eventually got the shallow seismic reflection profile images. Meanwhile, we used Canadian technology company good refraction and tomographic imaging software to deal with HRR seismic data, which is based on nonlinear first arrival wave travel time tomography. Combined with drilling geological profiles, we explained 11 measured seismic profiles. Results show 18 obvious fault feature breakpoints, including 4 normal faults of south-west, 7 reverse faults of south-west, one normal fault of north-east and 6 reverse faults of north-east. Breakpoints buried depth is 15-18 meters, and the inferred fault distance is 3-12 meters. Comprehensive analysis shows that the fault property is reverse fault with northeast incline section, and fewer branch normal faults presenting southwest incline section. Since good corresponding relationship between the seismic interpretation results, drilling data and SEM results on the property, occurrence, broken length of the fault, we considered the Liangyun fault to be an active fault which has strong activity during the Neogene Pliocene and early Pleistocene, Middle Pleistocene period. The combined application of SSR and HRR can provide more parameters to explain the seismic results, and improve the accuracy of the interpretation.
Fly-By-Light/Power-By-Wire Fault-Tolerant Fiber-Optic Backplane

NASA Technical Reports Server (NTRS)

Malekpour, Mahyar R.

2002-01-01

The design and development of a fault-tolerant fiber-optic backplane to demonstrate feasibility of such architecture is presented. The simulation results of test cases on the backplane in the advent of induced faults are presented, and the fault recovery capability of the architecture is demonstrated. The architecture was designed, developed, and implemented using the Very High Speed Integrated Circuits (VHSIC) Hardware Description Language (VHDL). The architecture was synthesized and implemented in hardware using Field Programmable Gate Arrays (FPGA) on multiple prototype boards.
TES: A modular systems approach to expert system development for real-time space applications

NASA Technical Reports Server (NTRS)

Cacace, Ralph; England, Brenda

1988-01-01

A major goal of the Space Station era is to reduce reliance on support from ground based experts. The development of software programs using expert systems technology is one means of reaching this goal without requiring crew members to become intimately familiar with the many complex spacecraft subsystems. Development of an expert systems program requires a validation of the software with actual flight hardware. By combining accurate hardware and software modelling techniques with a modular systems approach to expert systems development, the validation of these software programs can be successfully completed with minimum risk and effort. The TIMES Expert System (TES) is an application that monitors and evaluates real time data to perform fault detection and fault isolation tasks as they would otherwise be carried out by a knowledgeable designer. The development process and primary features of TES, a modular systems approach, and the lessons learned are discussed.

An Assessment of Integrated Health Management (IHM) Frameworks

DOE Office of Scientific and Technical Information (OSTI.GOV)

N. Lybeck; M. Tawfik; L. Bond

In order to meet the ever increasing demand for energy, the United States nuclear industry is turning to life extension of existing nuclear power plants (NPPs). Economically ensuring the safe, secure, and reliable operation of aging nuclear power plants presents many challenges. The 2009 Light Water Reactor Sustainability Workshop identified online monitoring of active and structural components as essential to the better understanding and management of the challenges posed by aging nuclear power plants. Additionally, there is increasing adoption of condition-based maintenance (CBM) for active components in NPPs. These techniques provide a foundation upon which a variety of advanced onlinemore » surveillance, diagnostic, and prognostic techniques can be deployed to continuously monitor and assess the health of NPP systems and components. The next step in the development of advanced online monitoring is to move beyond CBM to estimating the remaining useful life of active components using prognostic tools. Deployment of prognostic health management (PHM) on the scale of a NPP requires the use of an integrated health management (IHM) framework - a software product (or suite of products) used to manage the necessary elements needed for a complete implementation of online monitoring and prognostics. This paper provides a thoughtful look at the desirable functions and features of IHM architectures. A full PHM system involves several modules, including data acquisition, system modeling, fault detection, fault diagnostics, system prognostics, and advisory generation (operations and maintenance planning). The standards applicable to PHM applications are indentified and summarized. A list of evaluation criteria for PHM software products, developed to ensure scalability of the toolset to an environment with the complexity of a NPP, is presented. Fourteen commercially available PHM software products are identified and classified into four groups: research tools, PHM system development tools, deployable architectures, and peripheral tools.« less
Fault Detection, Isolation and Recovery (FDIR) Portable Liquid Oxygen Hardware Demonstrator

NASA Technical Reports Server (NTRS)

Oostdyk, Rebecca L.; Perotti, Jose M.

2011-01-01

The Fault Detection, Isolation and Recovery (FDIR) hardware demonstration will highlight the effort being conducted by Constellation's Ground Operations (GO) to provide the Launch Control System (LCS) with system-level health management during vehicle processing and countdown activities. A proof-of-concept demonstration of the FDIR prototype established the capability of the software to provide real-time fault detection and isolation using generated Liquid Hydrogen data. The FDIR portable testbed unit (presented here) aims to enhance FDIR by providing a dynamic simulation of Constellation subsystems that feed the FDIR software live data based on Liquid Oxygen system properties. The LO2 cryogenic ground system has key properties that are analogous to the properties of an electronic circuit. The LO2 system is modeled using electrical components and an equivalent circuit is designed on a printed circuit board to simulate the live data. The portable testbed is also be equipped with data acquisition and communication hardware to relay the measurements to the FDIR application running on a PC. This portable testbed is an ideal capability to perform FDIR software testing, troubleshooting, training among others.
Fault-tolerant wait-free shared objects

NASA Technical Reports Server (NTRS)

Jayanti, Prasad; Chandra, Tushar D.; Toueg, Sam

1992-01-01

A concurrent system consists of processes communicating via shared objects, such as shared variables, queues, etc. The concept of wait-freedom was introduced to cope with process failures: each process that accesses a wait-free object is guaranteed to get a response even if all the other processes crash. However, if a wait-free object 'crashes,' all the processes that access that object are prevented from making progress. In this paper, we introduce the concept of fault-tolerant wait-free objects, and study the problem of implementing them. We give a universal method to construct fault-tolerant wait-free objects, for all types of 'responsive' failures (including one in which faulty objects may 'lie'). In sharp contrast, we prove that many common and interesting types (such as queues, sets, and test&set) have no fault-tolerant wait-free implementations even under the most benign of the 'non-responsive' types of failure. We also introduce several concepts and techniques that are central to the design of fault-tolerant concurrent systems: the concepts of self-implementation and graceful degradation, and techniques to automatically increase the fault-tolerance of implementations. We prove matching lower bounds on the resource complexity of most of our algorithms.
Onboard Nonlinear Engine Sensor and Component Fault Diagnosis and Isolation Scheme

NASA Technical Reports Server (NTRS)

Tang, Liang; DeCastro, Jonathan A.; Zhang, Xiaodong

2011-01-01

A method detects and isolates in-flight sensor, actuator, and component faults for advanced propulsion systems. In sharp contrast to many conventional methods, which deal with either sensor fault or component fault, but not both, this method considers sensor fault, actuator fault, and component fault under one systemic and unified framework. The proposed solution consists of two main components: a bank of real-time, nonlinear adaptive fault diagnostic estimators for residual generation, and a residual evaluation module that includes adaptive thresholds and a Transferable Belief Model (TBM)-based residual evaluation scheme. By employing a nonlinear adaptive learning architecture, the developed approach is capable of directly dealing with nonlinear engine models and nonlinear faults without the need of linearization. Software modules have been developed and evaluated with the NASA C-MAPSS engine model. Several typical engine-fault modes, including a subset of sensor/actuator/components faults, were tested with a mild transient operation scenario. The simulation results demonstrated that the algorithm was able to successfully detect and isolate all simulated faults as long as the fault magnitudes were larger than the minimum detectable/isolable sizes, and no misdiagnosis occurred
A New Seismic Hazard Model for Mainland China

NASA Astrophysics Data System (ADS)

Rong, Y.; Xu, X.; Chen, G.; Cheng, J.; Magistrale, H.; Shen, Z. K.

2017-12-01

We are developing a new seismic hazard model for Mainland China by integrating historical earthquake catalogs, geological faults, geodetic GPS data, and geology maps. To build the model, we construct an Mw-based homogeneous historical earthquake catalog spanning from 780 B.C. to present, create fault models from active fault data, and derive a strain rate model based on the most complete GPS measurements and a new strain derivation algorithm. We divide China and the surrounding regions into about 20 large seismic source zones. For each zone, a tapered Gutenberg-Richter (TGR) magnitude-frequency distribution is used to model the seismic activity rates. The a- and b-values of the TGR distribution are calculated using observed earthquake data, while the corner magnitude is constrained independently using the seismic moment rate inferred from the geodetically-based strain rate model. Small and medium sized earthquakes are distributed within the source zones following the location and magnitude patterns of historical earthquakes. Some of the larger earthquakes are distributed onto active faults, based on their geological characteristics such as slip rate, fault length, down-dip width, and various paleoseismic data. The remaining larger earthquakes are then placed into the background. A new set of magnitude-rupture scaling relationships is developed based on earthquake data from China and vicinity. We evaluate and select appropriate ground motion prediction equations by comparing them with observed ground motion data and performing residual analysis. To implement the modeling workflow, we develop a tool that builds upon the functionalities of GEM's Hazard Modeler's Toolkit. The GEM OpenQuake software is used to calculate seismic hazard at various ground motion periods and various return periods. To account for site amplification, we construct a site condition map based on geology. The resulting new seismic hazard maps can be used for seismic risk analysis and management.
Non-Pilot Protection of the HVDC Grid

NASA Astrophysics Data System (ADS)

Badrkhani Ajaei, Firouz

This thesis develops a non-pilot protection system for the next generation power transmission system, the High-Voltage Direct Current (HVDC) grid. The HVDC grid protection system is required to be (i) adequately fast to prevent damages and/or converter blocking and (ii) reliable to minimize the impacts of faults. This study is mainly focused on the Modular Multilevel Converter (MMC) -based HVDC grid since the MMC is considered as the building block of the future HVDC systems. The studies reported in this thesis include (i) developing an enhanced equivalent model of the MMC to enable accurate representation of its DC-side fault response, (ii) developing a realistic HVDC-AC test system that includes a five-terminal MMC-based HVDC grid embedded in a large interconnected AC network, (iii) investigating the transient response of the developed test system to AC-side and DC-side disturbances in order to determine the HVDC grid protection requirements, (iv) investigating the fault surge propagation in the HVDC grid to determine the impacts of the DC-side fault location on the measured signals at each relay location, (v) designing a protection algorithm that detects and locates DC-side faults reliably and sufficiently fast to prevent relay malfunction and unnecessary blocking of the converters, and (vi) performing hardware-in-the-loop tests on the designed relay to verify its potential to be implemented in hardware. The results of the off-line time domain transients studies in the PSCAD software platform and the real-time hardware-in-the-loop tests using an enhanced version of the RTDS platform indicate that the developed HVDC grid relay meets all technical requirements including speed, dependability, security, selectivity, and robustness. Moreover, the developed protection algorithm does not impose considerable computational burden on the hardware.
Object-Oriented Algorithm For Evaluation Of Fault Trees

NASA Technical Reports Server (NTRS)

Patterson-Hine, F. A.; Koen, B. V.

1992-01-01

Algorithm for direct evaluation of fault trees incorporates techniques of object-oriented programming. Reduces number of calls needed to solve trees with repeated events. Provides significantly improved software environment for such computations as quantitative analyses of safety and reliability of complicated systems of equipment (e.g., spacecraft or factories).
An experimental evaluation of software redundancy as a strategy for improving reliability

NASA Technical Reports Server (NTRS)

Eckhardt, Dave E., Jr.; Caglayan, Alper K.; Knight, John C.; Lee, Larry D.; Mcallister, David F.; Vouk, Mladen A.; Kelly, John P. J.

1990-01-01

The strategy of using multiple versions of independently developed software as a means to tolerate residual software design faults is suggested by the success of hardware redundancy for tolerating hardware failures. Although, as generally accepted, the independence of hardware failures resulting from physical wearout can lead to substantial increases in reliability for redundant hardware structures, a similar conclusion is not immediate for software. The degree to which design faults are manifested as independent failures determines the effectiveness of redundancy as a method for improving software reliability. Interest in multi-version software centers on whether it provides an adequate measure of increased reliability to warrant its use in critical applications. The effectiveness of multi-version software is studied by comparing estimates of the failure probabilities of these systems with the failure probabilities of single versions. The estimates are obtained under a model of dependent failures and compared with estimates obtained when failures are assumed to be independent. The experimental results are based on twenty versions of an aerospace application developed and certified by sixty programmers from four universities. Descriptions of the application, development and certification processes, and operational evaluation are given together with an analysis of the twenty versions.
Agile deployment and code coverage testing metrics of the boot software on-board Solar Orbiter's Energetic Particle Detector

NASA Astrophysics Data System (ADS)

Parra, Pablo; da Silva, Antonio; Polo, Óscar R.; Sánchez, Sebastián

2018-02-01

In this day and age, successful embedded critical software needs agile and continuous development and testing procedures. This paper presents the overall testing and code coverage metrics obtained during the unit testing procedure carried out to verify the correctness of the boot software that will run in the Instrument Control Unit (ICU) of the Energetic Particle Detector (EPD) on-board Solar Orbiter. The ICU boot software is a critical part of the project so its verification should be addressed at an early development stage, so any test case missed in this process may affect the quality of the overall on-board software. According to the European Cooperation for Space Standardization ESA standards, testing this kind of critical software must cover 100% of the source code statement and decision paths. This leads to the complete testing of fault tolerance and recovery mechanisms that have to resolve every possible memory corruption or communication error brought about by the space environment. The introduced procedure enables fault injection from the beginning of the development process and enables to fulfill the exigent code coverage demands on the boot software.
Hardware fault insertion and instrumentation system: Mechanization and validation

NASA Technical Reports Server (NTRS)

Benson, J. W.

1987-01-01

Automated test capability for extensive low-level hardware fault insertion testing is developed. The test capability is used to calibrate fault detection coverage and associated latency times as relevant to projecting overall system reliability. Described are modifications made to the NASA Ames Reconfigurable Flight Control System (RDFCS) Facility to fully automate the total test loop involving the Draper Laboratories' Fault Injector Unit. The automated capability provided included the application of sequences of simulated low-level hardware faults, the precise measurement of fault latency times, the identification of fault symptoms, and bulk storage of test case results. A PDP-11/60 served as a test coordinator, and a PDP-11/04 as an instrumentation device. The fault injector was controlled by applications test software in the PDP-11/60, rather than by manual commands from a terminal keyboard. The time base was especially developed for this application to use a variety of signal sources in the system simulator.
On Why It Is Impossible to Prove that the BDX90 Dispatcher Implements a Time-sharing System

NASA Technical Reports Server (NTRS)

Boyer, R. S.; Moore, J. S.

1983-01-01

The Software Implemented Fault Tolerance SIFT system, is written in PASCAL except for about a page of machine code. The SIFT system implements a small time sharing system in which PASCAL programs for separate application tasks are executed according to a schedule with real time constraints. The PASCAL language has no provision for handling the notion of an interrupt such as the B930 clock interrupt. The PASCAL language also lacks the notion of running a PASCAL subroutine for a given amount of time, suspending it, saving away the suspension, and later activating the suspension. Machine code was used to overcome these inadequacies of PASCAL. Code which handles clock interrupts and suspends processes is called a dispatcher. The time sharing/virtual machine idea is completely destroyed by the reconfiguration task. After termination of the reconfiguration task, the tasks run by the dispatcher have no relation to those run before reconfiguration. It is impossible to view the dispatcher as a time-sharing system implementing virtual BDX930s running concurrently when one process can wipe out the others.
RAMP: A fault tolerant distributed microcomputer structure for aircraft navigation and control

NASA Technical Reports Server (NTRS)

Dunn, W. R.

1980-01-01

RAMP consists of distributed sets of parallel computers partioned on the basis of software and packaging constraints. To minimize hardware and software complexity, the processors operate asynchronously. It was shown that through the design of asymptotically stable control laws, data errors due to the asynchronism were minimized. It was further shown that by designing control laws with this property and making minor hardware modifications to the RAMP modules, the system became inherently tolerant to intermittent faults. A laboratory version of RAMP was constructed and is described in the paper along with the experimental results.
Validation environment for AIPS/ALS: Implementation and results

NASA Technical Reports Server (NTRS)

Segall, Zary; Siewiorek, Daniel; Caplan, Eddie; Chung, Alan; Czeck, Edward; Vrsalovic, Dalibor

1990-01-01

The work is presented which was performed in porting the Fault Injection-based Automated Testing (FIAT) and Programming and Instrumentation Environments (PIE) validation tools, to the Advanced Information Processing System (AIPS) in the context of the Ada Language System (ALS) application, as well as an initial fault free validation of the available AIPS system. The PIE components implemented on AIPS provide the monitoring mechanisms required for validation. These mechanisms represent a substantial portion of the FIAT system. Moreover, these are required for the implementation of the FIAT environment on AIPS. Using these components, an initial fault free validation of the AIPS system was performed. The implementation is described of the FIAT/PIE system, configured for fault free validation of the AIPS fault tolerant computer system. The PIE components were modified to support the Ada language. A special purpose AIPS/Ada runtime monitoring and data collection was implemented. A number of initial Ada programs running on the PIE/AIPS system were implemented. The instrumentation of the Ada programs was accomplished automatically inside the PIE programming environment. PIE's on-line graphical views show vividly and accurately the performance characteristics of Ada programs, AIPS kernel and the application's interaction with the AIPS kernel. The data collection mechanisms were written in a high level language, Ada, and provide a high degree of flexibility for implementation under various system conditions.
Development of a software safety process and a case study of its use

NASA Technical Reports Server (NTRS)

Knight, John C.

1993-01-01

The goal of this research is to continue the development of a comprehensive approach to software safety and to evaluate the approach with a case study. The case study is a major part of the project, and it involves the analysis of a specific safety-critical system from the medical equipment domain. The particular application being used was selected because of the availability of a suitable candidate system. We consider the results to be generally applicable and in no way particularly limited by the domain. The research is concentrating on issues raised by the specification and verification phases of the software lifecycle since they are central to our previously-developed rigorous definitions of software safety. The theoretical research is based on our framework of definitions for software safety. In the area of specification, the main topics being investigated are the development of techniques for building system fault trees that correctly incorporate software issues and the development of rigorous techniques for the preparation of software safety specifications. The research results are documented. Another area of theoretical investigation is the development of verification methods tailored to the characteristics of safety requirements. Verification of the correct implementation of the safety specification is central to the goal of establishing safe software. The empirical component of this research is focusing on a case study in order to provide detailed characterizations of the issues as they appear in practice, and to provide a testbed for the evaluation of various existing and new theoretical results, tools, and techniques. The Magnetic Stereotaxis System is summarized.
Finite Element Simulations of Kaikoura, NZ Earthquake using DInSAR and High-Resolution DSMs

NASA Astrophysics Data System (ADS)

Barba, M.; Willis, M. J.; Tiampo, K. F.; Glasscoe, M. T.; Clark, M. K.; Zekkos, D.; Stahl, T. A.; Massey, C. I.

2017-12-01

Three-dimensional displacements from the Kaikoura, NZ, earthquake in November 2016 are imaged here using Differential Interferometric Synthetic Aperture Radar (DInSAR) and high-resolution Digital Surface Model (DSM) differencing and optical pixel tracking. Full-resolution co- and post-seismic interferograms of Sentinel-1A/B images are constructed using the JPL ISCE software. The OSU SETSM software is used to produce repeat 0.5 m posting DSMs from commercial satellite imagery, which are supplemented with UAV derived DSMs over the Kaikoura fault rupture on the eastern South Island, NZ. DInSAR provides long-wavelength motions while DSM differencing and optical pixel tracking provides both horizontal and vertical near fault motions, improving the modeling of shallow rupture dynamics. JPL GeoFEST software is used to perform finite element modeling of the fault segments and slip distributions and, in turn, the associated asperity distribution. The asperity profile is then used to simulate event rupture, the spatial distribution of stress drop, and the associated stress changes. Finite element modeling of slope stability is accomplished using the ultra high-resolution UAV derived DSMs to examine the evolution of post-earthquake topography, landslide dynamics and volumes. Results include new insights into shallow dynamics of fault slip and partitioning, estimates of stress change, and improved understanding of its relationship with the associated seismicity, deformation, and triggered cascading hazards.
SSME digital control design characteristics

NASA Technical Reports Server (NTRS)

Mitchell, W. T.; Searle, R. F.

1985-01-01

To protect against a latent programming error (software fault) existing in an untried branch combination that would render the space shuttle out of control in a critical flight phase, the Backup Flight System (BFS) was chartered to provide a safety alternative. The BFS is designed to operate in critical flight phases (ascent and descent) by monitoring the activities of the space shuttle flight subsystems that are under control of the primary flight software (PFS) (e.g., navigation, crew interface, propulsion), then, upon manual command by the flightcrew, to assume control of the space shuttle and deliver it to a noncritical flight condition (safe orbit or touchdown). The problems associated with the selection of the PFS/BFS system architecture, the internal BFS architecture, the fault tolerant software mechanisms, and the long term BFS utility are discussed.
Advanced information processing system: Hosting of advanced guidance, navigation and control algorithms on AIPS using ASTER

NASA Technical Reports Server (NTRS)

Brenner, Richard; Lala, Jaynarayan H.; Nagle, Gail A.; Schor, Andrei; Turkovich, John

1994-01-01

This program demonstrated the integration of a number of technologies that can increase the availability and reliability of launch vehicles while lowering costs. Availability is increased with an advanced guidance algorithm that adapts trajectories in real-time. Reliability is increased with fault-tolerant computers and communication protocols. Costs are reduced by automatically generating code and documentation. This program was realized through the cooperative efforts of academia, industry, and government. The NASA-LaRC coordinated the effort, while Draper performed the integration. Georgia Institute of Technology supplied a weak Hamiltonian finite element method for optimal control problems. Martin Marietta used MATLAB to apply this method to a launch vehicle (FENOC). Draper supplied the fault-tolerant computing and software automation technology. The fault-tolerant technology includes sequential and parallel fault-tolerant processors (FTP & FTPP) and authentication protocols (AP) for communication. Fault-tolerant technology was incrementally incorporated. Development culminated with a heterogeneous network of workstations and fault-tolerant computers using AP. Draper's software automation system, ASTER, was used to specify a static guidance system based on FENOC, navigation, flight control (GN&C), models, and the interface to a user interface for mission control. ASTER generated Ada code for GN&C and C code for models. An algebraic transform engine (ATE) was developed to automatically translate MATLAB scripts into ASTER.
Software safety

NASA Technical Reports Server (NTRS)

Leveson, Nancy

1987-01-01

Software safety and its relationship to other qualities are discussed. It is shown that standard reliability and fault tolerance techniques will not solve the safety problem for the present. A new attitude requires: looking at what you do NOT want software to do along with what you want it to do; and assuming things will go wrong. New procedures and changes to entire software development process are necessary: special software safety analysis techniques are needed; and design techniques, especially eliminating complexity, can be very helpful.
Expanding the KATE toolbox

NASA Technical Reports Server (NTRS)

Thomas, Stan J.

1993-01-01

KATE (Knowledge-based Autonomous Test Engineer) is a model-based software system developed in the Artificial Intelligence Laboratory at the Kennedy Space Center for monitoring, fault detection, and control of launch vehicles and ground support systems. In order to bring KATE to the level of performance, functionality, and integratability needed for firing room applications, efforts are underway to implement KATE in the C++ programming language using an X-windows interface. Two programs which were designed and added to the collection of tools which comprise the KATE toolbox are described. The first tool, called the schematic viewer, gives the KATE user the capability to view digitized schematic drawings in the KATE environment. The second tool, called the model editor, gives the KATE model builder a tool for creating and editing knowledge base files. Design and implementation issues having to do with these two tools are discussed. It will be useful to anyone maintaining or extending either the schematic viewer or the model editor.
A Genetic Representation for Evolutionary Fault Recovery in Virtex FPGAs

NASA Technical Reports Server (NTRS)

Lohn, Jason; Larchev, Greg; DeMara, Ronald; Korsmeyer, David (Technical Monitor)

2003-01-01

Most evolutionary approaches to fault recovery in FPGAs focus on evolving alternative logic configurations as opposed to evolving the intra-cell routing. Since the majority of transistors in a typical FPGA are dedicated to interconnect, nearly 80% according to one estimate, evolutionary fault-recovery systems should benefit hy accommodating routing. In this paper, we propose an evolutionary fault-recovery system employing a genetic representation that takes into account both logic and routing configurations. Experiments were run using a software model of the Xilinx Virtex FPGA. We report that using four Virtex combinational logic blocks, we were able to evolve a 100% accurate quadrature decoder finite state machine in the presence of a stuck-at-zero fault.

Reinforcement and Drill by Microcomputer.

ERIC Educational Resources Information Center

Balajthy, Ernest

1984-01-01

Points out why drill work has a role in the language arts classroom, explores the possibilities of using a microcomputer to give children drill work, and discusses the characteristics of a good software program, along with faults found in many software programs. (FL)
Design and reliability analysis of DP-3 dynamic positioning control architecture

NASA Astrophysics Data System (ADS)

Wang, Fang; Wan, Lei; Jiang, Da-Peng; Xu, Yu-Ru

2011-12-01

As the exploration and exploitation of oil and gas proliferate throughout deepwater area, the requirements on the reliability of dynamic positioning system become increasingly stringent. The control objective ensuring safety operation at deep water will not be met by a single controller for dynamic positioning. In order to increase the availability and reliability of dynamic positioning control system, the triple redundancy hardware and software control architectures were designed and developed according to the safe specifications of DP-3 classification notation for dynamically positioned ships and rigs. The hardware redundant configuration takes the form of triple-redundant hot standby configuration including three identical operator stations and three real-time control computers which connect each other through dual networks. The function of motion control and redundancy management of control computers were implemented by software on the real-time operating system VxWorks. The software realization of task loose synchronization, majority voting and fault detection were presented in details. A hierarchical software architecture was planed during the development of software, consisting of application layer, real-time layer and physical layer. The behavior of the DP-3 dynamic positioning control system was modeled by a Markov model to analyze its reliability. The effects of variation in parameters on the reliability measures were investigated. The time domain dynamic simulation was carried out on a deepwater drilling rig to prove the feasibility of the proposed control architecture.
Control algorithm implementation for a redundant degree of freedom manipulator

NASA Technical Reports Server (NTRS)

Cohan, Steve

1991-01-01

This project's purpose is to develop and implement control algorithms for a kinematically redundant robotic manipulator. The manipulator is being developed concurrently by Odetics Inc., under internal research and development funding. This SBIR contract supports algorithm conception, development, and simulation, as well as software implementation and integration with the manipulator hardware. The Odetics Dexterous Manipulator is a lightweight, high strength, modular manipulator being developed for space and commercial applications. It has seven fully active degrees of freedom, is electrically powered, and is fully operational in 1 G. The manipulator consists of five self-contained modules. These modules join via simple quick-disconnect couplings and self-mating connectors which allow rapid assembly/disassembly for reconfiguration, transport, or servicing. Each joint incorporates a unique drive train design which provides zero backlash operation, is insensitive to wear, and is single fault tolerant to motor or servo amplifier failure. The sensing system is also designed to be single fault tolerant. Although the initial prototype is not space qualified, the design is well-suited to meeting space qualification requirements. The control algorithm design approach is to develop a hierarchical system with well defined access and interfaces at each level. The high level endpoint/configuration control algorithm transforms manipulator endpoint position/orientation commands to joint angle commands, providing task space motion. At the same time, the kinematic redundancy is resolved by controlling the configuration (pose) of the manipulator, using several different optimizing criteria. The center level of the hierarchy servos the joints to their commanded trajectories using both linear feedback and model-based nonlinear control techniques. The lowest control level uses sensed joint torque to close torque servo loops, with the goal of improving the manipulator dynamic behavior. The control algorithms are subjected to a dynamic simulation before implementation.
An evaluation of a real-time fault diagnosis expert system for aircraft applications

NASA Technical Reports Server (NTRS)

Schutte, Paul C.; Abbott, Kathy H.; Palmer, Michael T.; Ricks, Wendell R.

1987-01-01

A fault monitoring and diagnosis expert system called Faultfinder was conceived and developed to detect and diagnose in-flight failures in an aircraft. Faultfinder is an automated intelligent aid whose purpose is to assist the flight crew in fault monitoring, fault diagnosis, and recovery planning. The present implementation of this concept performs monitoring and diagnosis for a generic aircraft's propulsion and hydraulic subsystems. This implementation is capable of detecting and diagnosing failures of known and unknown (i.e., unforseeable) type in a real-time environment. Faultfinder uses both rule-based and model-based reasoning strategies which operate on causal, temporal, and qualitative information. A preliminary evaluation is made of the diagnostic concepts implemented in Faultfinder. The evaluation used actual aircraft accident and incident cases which were simulated to assess the effectiveness of Faultfinder in detecting and diagnosing failures. Results of this evaluation, together with the description of the current Faultfinder implementation, are presented.
AGSM Functional Fault Models for Fault Isolation Project

NASA Technical Reports Server (NTRS)

Harp, Janicce Leshay

2014-01-01

This project implements functional fault models to automate the isolation of failures during ground systems operations. FFMs will also be used to recommend sensor placement to improve fault isolation capabilities. The project enables the delivery of system health advisories to ground system operators.
Three-dimensional geologic map of the Hayward fault, northern California: Correlation of rock unites with variations in seismicity, creep rate, and fault dip

USGS Publications Warehouse

Graymer, R.W.; Ponce, D.A.; Jachens, R.C.; Simpson, R.W.; Phelps, G.A.; Wentworth, C.M.

2005-01-01

In order to better understand mechanisms of active faults, we studied relationships between fault behavior and rock units along the Hayward fault using a three-dimensional geologic map. The three-dimensional map-constructed from hypocenters, potential field data, and surface map data-provided a geologic map of each fault surface, showing rock units on either side of the fault truncated by the fault. The two fault-surface maps were superimposed to create a rock-rock juxtaposition map. The three maps were compared with seismicity, including aseismic patches, surface creep, and fault dip along the fault, by using visuallization software to explore three-dimensional relationships. Fault behavior appears to be correlated to the fault-surface maps, but not to the rock-rock juxtaposition map, suggesting that properties of individual wall-rock units, including rock strength, play an important role in fault behavior. Although preliminary, these results suggest that any attempt to understand the detailed distribution of earthquakes or creep along a fault should include consideration of the rock types that abut the fault surface, including the incorporation of observations of physical properties of the rock bodies that intersect the fault at depth. ?? 2005 Geological Society of America.
Technical Reference Suite Addressing Challenges of Providing Assurance for Fault Management Architectural Design

NASA Technical Reports Server (NTRS)

Fitz, Rhonda; Whitman, Gerek

2016-01-01

Research into complexities of software systems Fault Management (FM) and how architectural design decisions affect safety, preservation of assets, and maintenance of desired system functionality has coalesced into a technical reference (TR) suite that advances the provision of safety and mission assurance. The NASA Independent Verification and Validation (IV&V) Program, with Software Assurance Research Program support, extracted FM architectures across the IV&V portfolio to evaluate robustness, assess visibility for validation and test, and define software assurance methods applied to the architectures and designs. This investigation spanned IV&V projects with seven different primary developers, a wide range of sizes and complexities, and encompassed Deep Space Robotic, Human Spaceflight, and Earth Orbiter mission FM architectures. The initiative continues with an expansion of the TR suite to include Launch Vehicles, adding the benefit of investigating differences intrinsic to model-based FM architectures and insight into complexities of FM within an Agile software development environment, in order to improve awareness of how nontraditional processes affect FM architectural design and system health management. The identification of particular FM architectures, visibility, and associated IV&V techniques provides a TR suite that enables greater assurance that critical software systems will adequately protect against faults and respond to adverse conditions. Additionally, the role FM has with regard to strengthened security requirements, with potential to advance overall asset protection of flight software systems, is being addressed with the development of an adverse conditions database encompassing flight software vulnerabilities. Capitalizing on the established framework, this TR suite provides assurance capability for a variety of FM architectures and varied development approaches. Research results are being disseminated across NASA, other agencies, and the software community. This paper discusses the findings and TR suite informing the FM domain in best practices for FM architectural design, visibility observations, and methods employed for IV&V and mission assurance.
Minimizing student’s faults in determining the design of experiment through inquiry-based learning

NASA Astrophysics Data System (ADS)

Nilakusmawati, D. P. E.; Susilawati, M.

2017-10-01

The purpose of this study were to describe the used of inquiry method in an effort to minimize student’s fault in designing an experiment and to determine the effectiveness of the implementation of the inquiry method in minimizing student’s faults in designing experiments on subjects experimental design. This type of research is action research participants, with a model of action research design. The data source were students of the fifth semester who took a subject of experimental design at Mathematics Department, Faculty of Mathematics and Natural Sciences, Udayana University. Data was collected through tests, interviews, and observations. The hypothesis was tested by t-test. The result showed that the implementation of inquiry methods to minimize of students fault in designing experiments, analyzing experimental data, and interpret them in cycle 1 students can reduce fault by an average of 10.5%. While implementation in Cycle 2, students managed to reduce fault by an average of 8.78%. Based on t-test results can be concluded that the inquiry method effectively used to minimize of student’s fault in designing experiments, analyzing experimental data, and interpreting them. The nature of the teaching materials on subject of Experimental Design that demand the ability of students to think in a systematic, logical, and critical in analyzing the data and interpret the test cases makes the implementation of this inquiry become the proper method. In addition, utilization learning tool, in this case the teaching materials and the students worksheet is one of the factors that makes this inquiry method effectively minimizes of student’s fault when designing experiments.
Mark 4A antenna control system data handling architecture study

NASA Technical Reports Server (NTRS)

Briggs, H. C.; Eldred, D. B.

1991-01-01

A high-level review was conducted to provide an analysis of the existing architecture used to handle data and implement control algorithms for NASA's Deep Space Network (DSN) antennas and to make system-level recommendations for improving this architecture so that the DSN antennas can support the ever-tightening requirements of the next decade and beyond. It was found that the existing system is seriously overloaded, with processor utilization approaching 100 percent. A number of factors contribute to this overloading, including dated hardware, inefficient software, and a message-passing strategy that depends on serial connections between machines. At the same time, the system has shortcomings and idiosyncrasies that require extensive human intervention. A custom operating system kernel and an obscure programming language exacerbate the problems and should be modernized. A new architecture is presented that addresses these and other issues. Key features of the new architecture include a simplified message passing hierarchy that utilizes a high-speed local area network, redesign of particular processing function algorithms, consolidation of functions, and implementation of the architecture in modern hardware and software using mainstream computer languages and operating systems. The system would also allow incremental hardware improvements as better and faster hardware for such systems becomes available, and costs could potentially be low enough that redundancy would be provided economically. Such a system could support DSN requirements for the foreseeable future, though thorough consideration must be given to hard computational requirements, porting existing software functionality to the new system, and issues of fault tolerance and recovery.
Automatically generated acceptance test: A software reliability experiment

NASA Technical Reports Server (NTRS)

Protzel, Peter W.

1988-01-01

This study presents results of a software reliability experiment investigating the feasibility of a new error detection method. The method can be used as an acceptance test and is solely based on empirical data about the behavior of internal states of a program. The experimental design uses the existing environment of a multi-version experiment previously conducted at the NASA Langley Research Center, in which the launch interceptor problem is used as a model. This allows the controlled experimental investigation of versions with well-known single and multiple faults, and the availability of an oracle permits the determination of the error detection performance of the test. Fault interaction phenomena are observed that have an amplifying effect on the number of error occurrences. Preliminary results indicate that all faults examined so far are detected by the acceptance test. This shows promise for further investigations, and for the employment of this test method on other applications.
Impact of mineralization on carbon dioxide migration in term of critical value of fault permeability.

NASA Astrophysics Data System (ADS)

Alshammari, A.; Brantley, D.; Knapp, C. C.; Lakshmi, V.

2017-12-01

In this study, multi chemical components ((H2O, H2S) will be injected with supercritical carbon dioxide in onshore part of South Georgia Rift (SGR) Basin model. Chemical reaction expected issue between these components to produce stable mineral of carbonite rocks by the time. The 3D geological model has been extracted from petrel software and computer modelling group (CMG) package software has been used to build simulation model explain the effect of mineralization on fault permeability that control on plume migration critically between (0-0.05 m Darcy). The expected results will be correlated with single component case (CO2 only) to evaluate the importance the mineralization on CO2 plume migration in structure and stratigraphic traps and detect the variation of fault leakage in case of critical values (low permeability). The results will also, show us the ratio of every trapped phase in (SGR) basin reservoir model.
VLSI Implementation of Fault Tolerance Multiplier based on Reversible Logic Gate

NASA Astrophysics Data System (ADS)

Ahmad, Nabihah; Hakimi Mokhtar, Ahmad; Othman, Nurmiza binti; Fhong Soon, Chin; Rahman, Ab Al Hadi Ab

2017-08-01

Multiplier is one of the essential component in the digital world such as in digital signal processing, microprocessor, quantum computing and widely used in arithmetic unit. Due to the complexity of the multiplier, tendency of errors are very high. This paper aimed to design a 2×2 bit Fault Tolerance Multiplier based on Reversible logic gate with low power consumption and high performance. This design have been implemented using 90nm Complemetary Metal Oxide Semiconductor (CMOS) technology in Synopsys Electronic Design Automation (EDA) Tools. Implementation of the multiplier architecture is by using the reversible logic gates. The fault tolerance multiplier used the combination of three reversible logic gate which are Double Feynman gate (F2G), New Fault Tolerance (NFT) gate and Islam Gate (IG) with the area of 160μm x 420.3μm (67.25 mm2). This design achieved a low power consumption of 122.85μW and propagation delay of 16.99ns. The fault tolerance multiplier proposed achieved a low power consumption and high performance which suitable for application of modern computing as it has a fault tolerance capabilities.
Progressive retry for software error recovery in distributed systems

NASA Technical Reports Server (NTRS)

Wang, Yi-Min; Huang, Yennun; Fuchs, W. K.

1993-01-01

In this paper, we describe a method of execution retry for bypassing software errors based on checkpointing, rollback, message reordering and replaying. We demonstrate how rollback techniques, previously developed for transient hardware failure recovery, can also be used to recover from software faults by exploiting message reordering to bypass software errors. Our approach intentionally increases the degree of nondeterminism and the scope of rollback when a previous retry fails. Examples from our experience with telecommunications software systems illustrate the benefits of the scheme.
Performance Monitoring of Chilled-Water Distribution Systems Using HVAC-Cx

PubMed Central

Ferretti, Natascha Milesi; Galler, Michael A.; Bushby, Steven T.

2017-01-01

In this research we develop, test, and demonstrate the newest extension of the software HVAC-Cx (NIST and CSTB 2014), an automated commissioning tool for detecting common mechanical faults and control errors in chilled-water distribution systems (loops). The commissioning process can improve occupant comfort, ensure the persistence of correct system operation, and reduce energy consumption. Automated tools support the process by decreasing the time and the skill level required to carry out necessary quality assurance measures, and as a result they enable more thorough testing of building heating, ventilating, and air-conditioning (HVAC) systems. This paper describes the algorithm, developed by National Institute of Standards and Technology (NIST), to analyze chilled-water loops and presents the results of a passive monitoring investigation using field data obtained from BACnet® (ASHRAE 2016) controllers and presents field validation of the findings. The tool was successful in detecting faults in system operation in its first field implementation supporting the investigation phase through performance monitoring. Its findings led to a full energy retrocommissioning of the field site. PMID:29167584
Performance Monitoring of Chilled-Water Distribution Systems Using HVAC-Cx.

PubMed

Ferretti, Natascha Milesi; Galler, Michael A; Bushby, Steven T

2017-01-01

In this research we develop, test, and demonstrate the newest extension of the software HVAC-Cx (NIST and CSTB 2014), an automated commissioning tool for detecting common mechanical faults and control errors in chilled-water distribution systems (loops). The commissioning process can improve occupant comfort, ensure the persistence of correct system operation, and reduce energy consumption. Automated tools support the process by decreasing the time and the skill level required to carry out necessary quality assurance measures, and as a result they enable more thorough testing of building heating, ventilating, and air-conditioning (HVAC) systems. This paper describes the algorithm, developed by National Institute of Standards and Technology (NIST), to analyze chilled-water loops and presents the results of a passive monitoring investigation using field data obtained from BACnet ® (ASHRAE 2016) controllers and presents field validation of the findings. The tool was successful in detecting faults in system operation in its first field implementation supporting the investigation phase through performance monitoring. Its findings led to a full energy retrocommissioning of the field site.
Analysis of Seismotektonic Patterns in Sumatra Region Based on the Focal Mechanism of Earthquake Period 1976-2016

NASA Astrophysics Data System (ADS)

Indah, F. P.; Syafriani, S.; Andiyansyah, Z. S.

2018-04-01

Sumatra is in an active subduction zone between the indo-australian plate and the eurasian plate and is located at a fault along the sumatra fault so that sumatra is vulnerable to earthquakes. One of the ways to find out the cause of earthquake can be done by identifying the type of earthquake-causing faults based on earthquake of focal mechanism. The data used to identify the type of fault cause of earthquake is the earth tensor moment data which is sourced from global cmt period 1976-2016. The data used in this research using magnitude m ≥ 6 sr. This research uses gmt software (generic mapping tolls) to describe the form of fault. From the research result, it is found that the characteristics of fault field that formed in every region in sumatera island based on data processing and data of earthquake history of 1976-2016 period that the type of fault in sumatera fault is strike slip, fault type in mentawai fault is reverse fault (rising faults) and dip-slip, while the fault type in the subduction zone is dip-slip.
A research program in empirical computer science

NASA Technical Reports Server (NTRS)

Knight, J. C.

1991-01-01

During the grant reporting period our primary activities have been to begin preparation for the establishment of a research program in experimental computer science. The focus of research in this program will be safety-critical systems. Many questions that arise in the effort to improve software dependability can only be addressed empirically. For example, there is no way to predict the performance of the various proposed approaches to building fault-tolerant software. Performance models, though valuable, are parameterized and cannot be used to make quantitative predictions without experimental determination of underlying distributions. In the past, experimentation has been able to shed some light on the practical benefits and limitations of software fault tolerance. It is common, also, for experimentation to reveal new questions or new aspects of problems that were previously unknown. A good example is the Consistent Comparison Problem that was revealed by experimentation and subsequently studied in depth. The result was a clear understanding of a previously unknown problem with software fault tolerance. The purpose of a research program in empirical computer science is to perform controlled experiments in the area of real-time, embedded control systems. The goal of the various experiments will be to determine better approaches to the construction of the software for computing systems that have to be relied upon. As such it will validate research concepts from other sources, provide new research results, and facilitate the transition of research results from concepts to practical procedures that can be applied with low risk to NASA flight projects. The target of experimentation will be the production software development activities undertaken by any organization prepared to contribute to the research program. Experimental goals, procedures, data analysis and result reporting will be performed for the most part by the University of Virginia.
Application of neural networks to software quality modeling of a very large telecommunications system.

PubMed

Khoshgoftaar, T M; Allen, E B; Hudepohl, J P; Aud, S J

1997-01-01

Society relies on telecommunications to such an extent that telecommunications software must have high reliability. Enhanced measurement for early risk assessment of latent defects (EMERALD) is a joint project of Nortel and Bell Canada for improving the reliability of telecommunications software products. This paper reports a case study of neural-network modeling techniques developed for the EMERALD system. The resulting neural network is currently in the prototype testing phase at Nortel. Neural-network models can be used to identify fault-prone modules for extra attention early in development, and thus reduce the risk of operational problems with those modules. We modeled a subset of modules representing over seven million lines of code from a very large telecommunications software system. The set consisted of those modules reused with changes from the previous release. The dependent variable was membership in the class of fault-prone modules. The independent variables were principal components of nine measures of software design attributes. We compared the neural-network model with a nonparametric discriminant model and found the neural-network model had better predictive accuracy.
Extended Testability Analysis Tool

NASA Technical Reports Server (NTRS)

Melcher, Kevin; Maul, William A.; Fulton, Christopher

2012-01-01

The Extended Testability Analysis (ETA) Tool is a software application that supports fault management (FM) by performing testability analyses on the fault propagation model of a given system. Fault management includes the prevention of faults through robust design margins and quality assurance methods, or the mitigation of system failures. Fault management requires an understanding of the system design and operation, potential failure mechanisms within the system, and the propagation of those potential failures through the system. The purpose of the ETA Tool software is to process the testability analysis results from a commercial software program called TEAMS Designer in order to provide a detailed set of diagnostic assessment reports. The ETA Tool is a command-line process with several user-selectable report output options. The ETA Tool also extends the COTS testability analysis and enables variation studies with sensor sensitivity impacts on system diagnostics and component isolation using a single testability output. The ETA Tool can also provide extended analyses from a single set of testability output files. The following analysis reports are available to the user: (1) the Detectability Report provides a breakdown of how each tested failure mode was detected, (2) the Test Utilization Report identifies all the failure modes that each test detects, (3) the Failure Mode Isolation Report demonstrates the system s ability to discriminate between failure modes, (4) the Component Isolation Report demonstrates the system s ability to discriminate between failure modes relative to the components containing the failure modes, (5) the Sensor Sensor Sensitivity Analysis Report shows the diagnostic impact due to loss of sensor information, and (6) the Effect Mapping Report identifies failure modes that result in specified system-level effects.
Modeling of a latent fault detector in a digital system

NASA Technical Reports Server (NTRS)

Nagel, P. M.

1978-01-01

Methods of modeling the detection time or latency period of a hardware fault in a digital system are proposed that explain how a computer detects faults in a computational mode. The objectives were to study how software reacts to a fault, to account for as many variables as possible affecting detection and to forecast a given program's detecting ability prior to computation. A series of experiments were conducted on a small emulated microprocessor with fault injection capability. Results indicate that the detecting capability of a program largely depends on the instruction subset used during computation and the frequency of its use and has little direct dependence on such variables as fault mode, number set, degree of branching and program length. A model is discussed which employs an analog with balls in an urn to explain the rate of which subsequent repetitions of an instruction or instruction set detect a given fault.

Comparative analysis of techniques for evaluating the effectiveness of aircraft computing systems

NASA Technical Reports Server (NTRS)

Hitt, E. F.; Bridgman, M. S.; Robinson, A. C.

1981-01-01

Performability analysis is a technique developed for evaluating the effectiveness of fault-tolerant computing systems in multiphase missions. Performability was evaluated for its accuracy, practical usefulness, and relative cost. The evaluation was performed by applying performability and the fault tree method to a set of sample problems ranging from simple to moderately complex. The problems involved as many as five outcomes, two to five mission phases, permanent faults, and some functional dependencies. Transient faults and software errors were not considered. A different analyst was responsible for each technique. Significantly more time and effort were required to learn performability analysis than the fault tree method. Performability is inherently as accurate as fault tree analysis. For the sample problems, fault trees were more practical and less time consuming to apply, while performability required less ingenuity and was more checkable. Performability offers some advantages for evaluating very complex problems.
Proceedings of the Twenty-Third Annual Software Engineering Workshop

NASA Technical Reports Server (NTRS)

1999-01-01

The Twenty-third Annual Software Engineering Workshop (SEW) provided 20 presentations designed to further the goals of the Software Engineering Laboratory (SEL) of the NASA-GSFC. The presentations were selected on their creativity. The sessions which were held on 2-3 of December 1998, centered on the SEL, Experimentation, Inspections, Fault Prediction, Verification and Validation, and Embedded Systems and Safety-Critical Systems.
Operational Suitability Guide. Volume 2. Templates

DTIC Science & Technology

1990-05-01

Intended mission, and the required technical and operational characteristics. The mission must be adequately defined and key hardware and software ...operational availability. With the use of fault-tolerant computer hardware and software , the system R&M will significantly improve end-to-end...should Include both hardware and software elements, as appropriate. Unique characteristics or unique support concepts should be Identified if they result
An Incremental Life-cycle Assurance Strategy for Critical System Certification

DTIC Science & Technology

2014-11-04

for Safe Aircraft Operation Embedded software systems introduce a new class of problems not addressed by traditional system modeling & analysis...Platform Runtime Architecture Application Software Embedded SW System Engineer Data Stream Characteristics Latency jitter affects control behavior...do system level failures still occur despite fault tolerance techniques being deployed in systems ? Embedded software system as major source of
Resilience Design Patterns - A Structured Approach to Resilience at Extreme Scale (version 1.0)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hukerikar, Saurabh; Engelmann, Christian

Reliability is a serious concern for future extreme-scale high-performance computing (HPC) systems. Projections based on the current generation of HPC systems and technology roadmaps suggest that very high fault rates in future systems. The errors resulting from these faults will propagate and generate various kinds of failures, which may result in outcomes ranging from result corruptions to catastrophic application crashes. Practical limits on power consumption in HPC systems will require future systems to embrace innovative architectures, increasing the levels of hardware and software complexities. The resilience challenge for extreme-scale HPC systems requires management of various hardware and software technologies thatmore » are capable of handling a broad set of fault models at accelerated fault rates. These techniques must seek to improve resilience at reasonable overheads to power consumption and performance. While the HPC community has developed various solutions, application-level as well as system-based solutions, the solution space of HPC resilience techniques remains fragmented. There are no formal methods and metrics to investigate and evaluate resilience holistically in HPC systems that consider impact scope, handling coverage, and performance & power eciency across the system stack. Additionally, few of the current approaches are portable to newer architectures and software ecosystems, which are expected to be deployed on future systems. In this document, we develop a structured approach to the management of HPC resilience based on the concept of resilience-based design patterns. A design pattern is a general repeatable solution to a commonly occurring problem. We identify the commonly occurring problems and solutions used to deal with faults, errors and failures in HPC systems. The catalog of resilience design patterns provides designers with reusable design elements. We define a design framework that enhances our understanding of the important constraints and opportunities for solutions deployed at various layers of the system stack. The framework may be used to establish mechanisms and interfaces to coordinate flexible fault management across hardware and software components. The framework also enables optimization of the cost-benefit trade-os among performance, resilience, and power consumption. The overall goal of this work is to enable a systematic methodology for the design and evaluation of resilience technologies in extreme-scale HPC systems that keep scientific applications running to a correct solution in a timely and cost-ecient manner in spite of frequent faults, errors, and failures of various types.« less
Advanced Ground Systems Maintenance Functional Fault Models For Fault Isolation Project

NASA Technical Reports Server (NTRS)

Perotti, Jose M. (Compiler)

2014-01-01

This project implements functional fault models (FFM) to automate the isolation of failures during ground systems operations. FFMs will also be used to recommend sensor placement to improve fault isolation capabilities. The project enables the delivery of system health advisories to ground system operators.
Fuzzy model-based fault detection and diagnosis for a pilot heat exchanger

NASA Astrophysics Data System (ADS)

Habbi, Hacene; Kidouche, Madjid; Kinnaert, Michel; Zelmat, Mimoun

2011-04-01

This article addresses the design and real-time implementation of a fuzzy model-based fault detection and diagnosis (FDD) system for a pilot co-current heat exchanger. The design method is based on a three-step procedure which involves the identification of data-driven fuzzy rule-based models, the design of a fuzzy residual generator and the evaluation of the residuals for fault diagnosis using statistical tests. The fuzzy FDD mechanism has been implemented and validated on the real co-current heat exchanger, and has been proven to be efficient in detecting and isolating process, sensor and actuator faults.
Software Requirements Analysis as Fault Predictor

NASA Technical Reports Server (NTRS)

Wallace, Dolores

2003-01-01

Waiting until the integration and system test phase to discover errors leads to more costly rework than resolving those same errors earlier in the lifecycle. Costs increase even more significantly once a software system has become operational. WE can assess the quality of system requirements, but do little to correlate this information either to system assurance activities or long-term reliability projections - both of which remain unclear and anecdotal. Extending earlier work on requirements accomplished by the ARM tool, measuring requirements quality information against code complexity and test data for the same system may be used to predict specific software modules containing high impact or deeply embedded faults now escaping in operational systems. Such knowledge would lead to more effective and efficient test programs. It may enable insight into whether a program should be maintained or started over.
Technical Reference Suite Addressing Challenges of Providing Assurance for Fault Management Architectural Design

NASA Technical Reports Server (NTRS)

Fitz, Rhonda; Whitman, Gerek

2016-01-01

Research into complexities of software systems Fault Management (FM) and how architectural design decisions affect safety, preservation of assets, and maintenance of desired system functionality has coalesced into a technical reference (TR) suite that advances the provision of safety and mission assurance. The NASA Independent Verification and Validation (IVV) Program, with Software Assurance Research Program support, extracted FM architectures across the IVV portfolio to evaluate robustness, assess visibility for validation and test, and define software assurance methods applied to the architectures and designs. This investigation spanned IVV projects with seven different primary developers, a wide range of sizes and complexities, and encompassed Deep Space Robotic, Human Spaceflight, and Earth Orbiter mission FM architectures. The initiative continues with an expansion of the TR suite to include Launch Vehicles, adding the benefit of investigating differences intrinsic to model-based FM architectures and insight into complexities of FM within an Agile software development environment, in order to improve awareness of how nontraditional processes affect FM architectural design and system health management.
A framework for software fault tolerance in real-time systems

NASA Technical Reports Server (NTRS)

Anderson, T.; Knight, J. C.

1983-01-01

A classification scheme for errors and a technique for the provision of software fault tolerance in cyclic real-time systems is presented. The technique requires that the process structure of a system be represented by a synchronization graph which is used by an executive as a specification of the relative times at which they will communicate during execution. Communication between concurrent processes is severely limited and may only take place between processes engaged in an exchange. A history of error occurrences is maintained by an error handler. When an error is detected, the error handler classifies it using the error history information and then initiates appropriate recovery action.
Regional tectonic evaluation of the Tuscan Apenine, vulcanism, thermal anomalies and the relation to structural units

NASA Technical Reports Server (NTRS)

Bodechtel, J. (Principal Investigator)

1975-01-01

The author has identified the following significant results. The geological interpretation on data exhibiting the Italian peninsula led to the recognition of tectonic features which are explained by a clockwise rotation of various blocks along left-handed transform faults. These faults can be interpreted as resulting from shear due to main stress directed north-eastwards. A land use map of the mountainous regions of Italy was produced on a scale of 1:250,000. For the digital treatment of MSS-CCTs an image processing software was written in FORTRAN 4. The software package includes descriptive statistics and also classification algorithms.
The Development of Design Tools for Fault Tolerant Quantum Dot Cellular Automata Based Logic

NASA Technical Reports Server (NTRS)

Armstrong, Curtis D.; Humphreys, William M.

2003-01-01

We are developing software to explore the fault tolerance of quantum dot cellular automata gate architectures in the presence of manufacturing variations and device defects. The Topology Optimization Methodology using Applied Statistics (TOMAS) framework extends the capabilities of the A Quantum Interconnected Network Array Simulator (AQUINAS) by adding front-end and back-end software and creating an environment that integrates all of these components. The front-end tools establish all simulation parameters, configure the simulation system, automate the Monte Carlo generation of simulation files, and execute the simulation of these files. The back-end tools perform automated data parsing, statistical analysis and report generation.
Monitoring microearthquakes with the San Andreas fault observatory at depth

USGS Publications Warehouse

Oye, V.; Ellsworth, W.L.

2007-01-01

In 2005, the San Andreas Fault Observatory at Depth (SAFOD) was drilled through the San Andreas Fault zone at a depth of about 3.1 km. The borehole has subsequently been instrumented with high-frequency geophones in order to better constrain locations and source processes of nearby microearthquakes that will be targeted in the upcoming phase of SAFOD. The microseismic monitoring software MIMO, developed by NORSAR, has been installed at SAFOD to provide near-real time locations and magnitude estimates using the high sampling rate (4000 Hz) waveform data. To improve the detection and location accuracy, we incorporate data from the nearby, shallow borehole (???250 m) seismometers of the High Resolution Seismic Network (HRSN). The event association algorithm of the MIMO software incorporates HRSN detections provided by the USGS real time earthworm software. The concept of the new event association is based on the generalized beam forming, primarily used in array seismology. The method requires the pre-computation of theoretical travel times in a 3D grid of potential microearthquake locations to the seismometers of the current station network. By minimizing the differences between theoretical and observed detection times an event is associated and the location accuracy is significantly improved.
Fuzzy logic based on-line fault detection and classification in transmission line.

PubMed

Adhikari, Shuma; Sinha, Nidul; Dorendrajit, Thingam

2016-01-01

This study presents fuzzy logic based online fault detection and classification of transmission line using Programmable Automation and Control technology based National Instrument Compact Reconfigurable i/o (CRIO) devices. The LabVIEW software combined with CRIO can perform real time data acquisition of transmission line. When fault occurs in the system current waveforms are distorted due to transients and their pattern changes according to the type of fault in the system. The three phase alternating current, zero sequence and positive sequence current data generated by LabVIEW through CRIO-9067 are processed directly for relaying. The result shows that proposed technique is capable of right tripping action and classification of type of fault at high speed therefore can be employed in practical application.
Various Indices for Diagnosis of Air-gap Eccentricity Fault in Induction Motor-A Review

NASA Astrophysics Data System (ADS)

Nikhil; Mathew, Lini, Dr.; Sharma, Amandeep

2018-03-01

From the past few years, research has gained an ardent pace in the field of fault detection and diagnosis in induction motors. In the current scenario, software is being introduced with diagnostic features to improve stability and reliability in fault diagnostic techniques. Human involvement in decision making for fault detection is slowly being replaced by Artificial Intelligence techniques. In this paper, a brief introduction of eccentricity fault is presented along with their causes and effects on the health of induction motors. Various indices used to detect eccentricity are being introduced along with their boundary conditions and their future scope of research. At last, merits and demerits of all indices are discussed and a comparison is made between them.
Optimization of the coherence function estimation for multi-core central processing unit

NASA Astrophysics Data System (ADS)

Cheremnov, A. G.; Faerman, V. A.; Avramchuk, V. S.

2017-02-01

The paper considers use of parallel processing on multi-core central processing unit for optimization of the coherence function evaluation arising in digital signal processing. Coherence function along with other methods of spectral analysis is commonly used for vibration diagnosis of rotating machinery and its particular nodes. An algorithm is given for the function evaluation for signals represented with digital samples. The algorithm is analyzed for its software implementation and computational problems. Optimization measures are described, including algorithmic, architecture and compiler optimization, their results are assessed for multi-core processors from different manufacturers. Thus, speeding-up of the parallel execution with respect to sequential execution was studied and results are presented for Intel Core i7-4720HQ и AMD FX-9590 processors. The results show comparatively high efficiency of the optimization measures taken. In particular, acceleration indicators and average CPU utilization have been significantly improved, showing high degree of parallelism of the constructed calculating functions. The developed software underwent state registration and will be used as a part of a software and hardware solution for rotating machinery fault diagnosis and pipeline leak location with acoustic correlation method.
The role of thin, mechanical discontinuities on the propagation of reverse faults: insights from analogue models

NASA Astrophysics Data System (ADS)

Bonanno, Emanuele; Bonini, Lorenzo; Basili, Roberto; Toscani, Giovanni; Seno, Silvio

2016-04-01

Fault-related folding kinematic models are widely used to explain accommodation of crustal shortening. These models, however, include simplifications, such as the assumption of constant growth rate of faults. This value sometimes is not constant in isotropic materials, and even more variable if one considers naturally anisotropic geological systems. , This means that these simplifications could lead to incorrect interpretations of the reality. In this study, we use analogue models to evaluate how thin, mechanical discontinuities, such as beddings or thin weak layers, influence the propagation of reverse faults and related folds. The experiments are performed with two different settings to simulate initially-blind master faults dipping at 30° and 45°. The 30° dip represents one of the Andersonian conjugate fault, and 45° dip is very frequent in positive reactivation of normal faults. The experimental apparatus consists of a clay layer placed above two plates: one plate, the footwall, is fixed; the other one, the hanging wall, is mobile. Motor-controlled sliding of the hanging wall plate along an inclined plane reproduces the reverse fault movement. We run thirty-six experiments: eighteen with dip of 30° and eighteen with dip of 45°. For each dip-angle setting, we initially run isotropic experiments that serve as a reference. Then, we run the other experiments with one or two discontinuities (horizontal precuts performed into the clay layer). We monitored the experiments collecting side photographs every 1.0 mm of displacement of the master fault. These images have been analyzed through PIVlab software, a tool based on the Digital Image Correlation method. With the "displacement field analysis" (one of the PIVlab tools) we evaluated, the variation of the trishear zone shape and how the master-fault tip and newly-formed faults propagate into the clay medium. With the "strain distribution analysis", we observed the amount of the on-fault and off-fault deformation with respect to the faulting pattern and evolution. Secondly, using MOVE software, we extracted the positions of fault tips and folds every 5 mm of displacement on the master fault. Analyzing these positions in all of the experiments, we found that the growth rate of the faults and the related fold shape vary depending on the number of discontinuities in the clay medium. Other results can be summarized as follows: 1) the fault growth rate is not constant, but varies especially while the new faults interacts with precuts; 2) the new faults tend to crosscut the discontinuities when the angle between them is approximately 90°; 3) the trishear zone change its shape during the experiments especially when the main fault interacts with the discontinuities.
Dataflow models for fault-tolerant control systems

NASA Technical Reports Server (NTRS)

Papadopoulos, G. M.

1984-01-01

Dataflow concepts are used to generate a unified hardware/software model of redundant physical systems which are prone to faults. Basic results in input congruence and synchronization are shown to reduce to a simple model of data exchanges between processing sites. Procedures are given for the construction of congruence schemata, the distinguishing features of any correctly designed redundant system.
Fault Injection Campaign for a Fault Tolerant Duplex Framework

NASA Technical Reports Server (NTRS)

Sacco, Gian Franco; Ferraro, Robert D.; von llmen, Paul; Rennels, Dave A.

2007-01-01

Fault tolerance is an efficient approach adopted to avoid or reduce the damage of a system failure. In this work we present the results of a fault injection campaign we conducted on the Duplex Framework (DF). The DF is a software developed by the UCLA group [1, 2] that uses a fault tolerant approach and allows to run two replicas of the same process on two different nodes of a commercial off-the-shelf (COTS) computer cluster. A third process running on a different node, constantly monitors the results computed by the two replicas, and eventually restarts the two replica processes if an inconsistency in their computation is detected. This approach is very cost efficient and can be adopted to control processes on spacecrafts where the fault rate produced by cosmic rays is not very high.
Lessons Learned on Implementing Fault Detection, Isolation, and Recovery (FDIR) in a Ground Launch Environment

NASA Technical Reports Server (NTRS)

Ferell, Bob; Lewis, Mark; Perotti, Jose; Oostdyk, Rebecca; Goerz, Jesse; Brown, Barbara

2010-01-01

This paper's main purpose is to detail issues and lessons learned regarding designing, integrating, and implementing Fault Detection Isolation and Recovery (FDIR) for Constellation Exploration Program (CxP) Ground Operations at Kennedy Space Center (KSC).

Adjustable Autonomy Testbed

NASA Technical Reports Server (NTRS)

Malin, Jane T.; Schrenkenghost, Debra K.

2001-01-01

The Adjustable Autonomy Testbed (AAT) is a simulation-based testbed located in the Intelligent Systems Laboratory in the Automation, Robotics and Simulation Division at NASA Johnson Space Center. The purpose of the testbed is to support evaluation and validation of prototypes of adjustable autonomous agent software for control and fault management for complex systems. The AA T project has developed prototype adjustable autonomous agent software and human interfaces for cooperative fault management. This software builds on current autonomous agent technology by altering the architecture, components and interfaces for effective teamwork between autonomous systems and human experts. Autonomous agents include a planner, flexible executive, low level control and deductive model-based fault isolation. Adjustable autonomy is intended to increase the flexibility and effectiveness of fault management with an autonomous system. The test domain for this work is control of advanced life support systems for habitats for planetary exploration. The CONFIG hybrid discrete event simulation environment provides flexible and dynamically reconfigurable models of the behavior of components and fluids in the life support systems. Both discrete event and continuous (discrete time) simulation are supported, and flows and pressures are computed globally. This provides fast dynamic simulations of interacting hardware systems in closed loops that can be reconfigured during operations scenarios, producing complex cascading effects of operations and failures. Current object-oriented model libraries support modeling of fluid systems, and models have been developed of physico-chemical and biological subsystems for processing advanced life support gases. In FY01, water recovery system models will be developed.
Remote monitoring and fault recovery for FPGA-based field controllers of telescope and instruments

NASA Astrophysics Data System (ADS)

Zhu, Yuhua; Zhu, Dan; Wang, Jianing

2012-09-01

As the increasing size and more and more functions, modern telescopes have widely used the control architecture, i.e. central control unit plus field controller. FPGA-based field controller has the advantages of field programmable, which provide a great convenience for modifying software and hardware of control system. It also gives a good platform for implementation of the new control scheme. Because of multi-controlled nodes and poor working environment in scattered locations, reliability and stability of the field controller should be fully concerned. This paper mainly describes how we use the FPGA-based field controller and Ethernet remote to construct monitoring system with multi-nodes. When failure appearing, the new FPGA chip does self-recovery first in accordance with prerecovery strategies. In case of accident, remote reconstruction for the field controller can be done through network intervention if the chip is not being restored. This paper also introduces the network remote reconstruction solutions of controller, the system structure and transport protocol as well as the implementation methods. The idea of hardware and software design is given based on the FPGA. After actual operation on the large telescopes, desired results have been achieved. The improvement increases system reliability and reduces workload of maintenance, showing good application and popularization.
Solar Photovoltaic (PV) Distributed Generation Systems - Control and Protection

NASA Astrophysics Data System (ADS)

Yi, Zhehan

This dissertation proposes a comprehensive control, power management, and fault detection strategy for solar photovoltaic (PV) distribution generations. Battery storages are typically employed in PV systems to mitigate the power fluctuation caused by unstable solar irradiance. With AC and DC loads, a PV-battery system can be treated as a hybrid microgrid which contains both DC and AC power resources and buses. In this thesis, a control power and management system (CAPMS) for PV-battery hybrid microgrid is proposed, which provides 1) the DC and AC bus voltage and AC frequency regulating scheme and controllers designed to track set points; 2) a power flow management strategy in the hybrid microgrid to achieve system generation and demand balance in both grid-connected and islanded modes; 3) smooth transition control during grid reconnection by frequency and phase synchronization control between the main grid and microgrid. Due to the increasing demands for PV power, scales of PV systems are getting larger and fault detection in PV arrays becomes challenging. High-impedance faults, low-mismatch faults, and faults occurred in low irradiance conditions tend to be hidden due to low fault currents, particularly, when a PV maximum power point tracking (MPPT) algorithm is in-service. If remain undetected, these faults can considerably lower the output energy of solar systems, damage the panels, and potentially cause fire hazards. In this dissertation, fault detection challenges in PV arrays are analyzed in depth, considering the crossing relations among the characteristics of PV, interactions with MPPT algorithms, and the nature of solar irradiance. Two fault detection schemes are then designed as attempts to address these technical issues, which detect faults inside PV arrays accurately even under challenging circumstances, e.g., faults in low irradiance conditions or high-impedance faults. Taking advantage of multi-resolution signal decomposition (MSD), a powerful signal processing technique based on discrete wavelet transformation (DWT), the first attempt is devised, which extracts the features of both line-to-line (L-L) and line-to-ground (L-G) faults and employs a fuzzy inference system (FIS) for the decision-making stage of fault detection. This scheme is then improved as the second attempt by further studying the system's behaviors during L-L faults, extracting more efficient fault features, and devising a more advanced decision-making stage: the two-stage support vector machine (SVM). For the first time, the two-stage SVM method is proposed in this dissertation to detect L-L faults in PV system with satisfactory accuracies. Numerous simulation and experimental case studies are carried out to verify the proposed control and protection strategies. Simulation environment is set up using the PSCAD/EMTDC and Matlab/Simulink software packages. Experimental case studies are conducted in a PV-battery hybrid microgrid using the dSPACE real-time controller to demonstrate the ease of hardware implementation and the controller performance. Another small-scale grid-connected PV system is set up to verify both fault detection algorithms which demonstrate promising performances and fault detecting accuracies.
Reliable High Performance Peta- and Exa-Scale Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bronevetsky, G

2012-04-02

As supercomputers become larger and more powerful, they are growing increasingly complex. This is reflected both in the exponentially increasing numbers of components in HPC systems (LLNL is currently installing the 1.6 million core Sequoia system) as well as the wide variety of software and hardware components that a typical system includes. At this scale it becomes infeasible to make each component sufficiently reliable to prevent regular faults somewhere in the system or to account for all possible cross-component interactions. The resulting faults and instability cause HPC applications to crash, perform sub-optimally or even produce erroneous results. As supercomputers continuemore » to approach Exascale performance and full system reliability becomes prohibitively expensive, we will require novel techniques to bridge the gap between the lower reliability provided by hardware systems and users unchanging need for consistent performance and reliable results. Previous research on HPC system reliability has developed various techniques for tolerating and detecting various types of faults. However, these techniques have seen very limited real applicability because of our poor understanding of how real systems are affected by complex faults such as soft fault-induced bit flips or performance degradations. Prior work on such techniques has had very limited practical utility because it has generally focused on analyzing the behavior of entire software/hardware systems both during normal operation and in the face of faults. Because such behaviors are extremely complex, such studies have only produced coarse behavioral models of limited sets of software/hardware system stacks. Since this provides little insight into the many different system stacks and applications used in practice, this work has had little real-world impact. My project addresses this problem by developing a modular methodology to analyze the behavior of applications and systems during both normal and faulty operation. By synthesizing models of individual components into a whole-system behavior models my work is making it possible to automatically understand the behavior of arbitrary real-world systems to enable them to tolerate a wide range of system faults. My project is following a multi-pronged research strategy. Section II discusses my work on modeling the behavior of existing applications and systems. Section II.A discusses resilience in the face of soft faults and Section II.B looks at techniques to tolerate performance faults. Finally Section III presents an alternative approach that studies how a system should be designed from the ground up to make resilience natural and easy.« less
Software Health Management: A Short Review of Challenges and Existing Techniques

NASA Technical Reports Server (NTRS)

Pipatsrisawat, Knot; Darwiche, Adnan; Mengshoel, Ole J.; Schumann, Johann

2009-01-01

Modern spacecraft (as well as most other complex mechanisms like aircraft, automobiles, and chemical plants) rely more and more on software, to a point where software failures have caused severe accidents and loss of missions. Software failures during a manned mission can cause loss of life, so there are severe requirements to make the software as safe and reliable as possible. Typically, verification and validation (V&V) has the task of making sure that all software errors are found before the software is deployed and that it always conforms to the requirements. Experience, however, shows that this gold standard of error-free software cannot be reached in practice. Even if the software alone is free of glitches, its interoperation with the hardware (e.g., with sensors or actuators) can cause problems. Unexpected operational conditions or changes in the environment may ultimately cause a software system to fail. Is there a way to surmount this problem? In most modern aircraft and many automobiles, hardware such as central electrical, mechanical, and hydraulic components are monitored by IVHM (Integrated Vehicle Health Management) systems. These systems can recognize, isolate, and identify faults and failures, both those that already occurred as well as imminent ones. With the help of diagnostics and prognostics, appropriate mitigation strategies can be selected (replacement or repair, switch to redundant systems, etc.). In this short paper, we discuss some challenges and promising techniques for software health management (SWHM). In particular, we identify unique challenges for preventing software failure in systems which involve both software and hardware components. We then present our classifications of techniques related to SWHM. These classifications are performed based on dimensions of interest to both developers and users of the techniques, and hopefully provide a map for dealing with software faults and failures.
Fault-tolerant Control of a Cyber-physical System

NASA Astrophysics Data System (ADS)

Roxana, Rusu-Both; Eva-Henrietta, Dulf

2017-10-01

Cyber-physical systems represent a new emerging field in automatic control. The fault system is a key component, because modern, large scale processes must meet high standards of performance, reliability and safety. Fault propagation in large scale chemical processes can lead to loss of production, energy, raw materials and even environmental hazard. The present paper develops a multi-agent fault-tolerant control architecture using robust fractional order controllers for a (13C) cryogenic separation column cascade. The JADE (Java Agent DEvelopment Framework) platform was used to implement the multi-agent fault tolerant control system while the operational model of the process was implemented in Matlab/SIMULINK environment. MACSimJX (Multiagent Control Using Simulink with Jade Extension) toolbox was used to link the control system and the process model. In order to verify the performance and to prove the feasibility of the proposed control architecture several fault simulation scenarios were performed.
Fault-tolerant building-block computer study

NASA Technical Reports Server (NTRS)

Rennels, D. A.

1978-01-01

Ultra-reliable core computers are required for improving the reliability of complex military systems. Such computers can provide reliable fault diagnosis, failure circumvention, and, in some cases serve as an automated repairman for their host systems. A small set of building-block circuits which can be implemented as single very large integration devices, and which can be used with off-the-shelf microprocessors and memories to build self checking computer modules (SCCM) is described. Each SCCM is a microcomputer which is capable of detecting its own faults during normal operation and is described to communicate with other identical modules over one or more Mil Standard 1553A buses. Several SCCMs can be connected into a network with backup spares to provide fault-tolerant operation, i.e. automated recovery from faults. Alternative fault-tolerant SCCM configurations are discussed along with the cost and reliability associated with their implementation.
Abstract for 1999 Rational Software User Conference

NASA Technical Reports Server (NTRS)

Dunphy, Julia; Rouquette, Nicolas; Feather, Martin; Tung, Yu-Wen

1999-01-01

We develop spacecraft fault-protection software at NASA/JPL. Challenges exemplified by our task: 1) high-quality systems - need for extensive validation & verification; 2) multi-disciplinary context - involves experts from diverse areas; 3) embedded systems - must adapt to external practices, notations, etc.; and 4) development pressures - NASA's mandate of "better, faster, cheaper".
Software quality: Process or people

NASA Technical Reports Server (NTRS)

Palmer, Regina; Labaugh, Modenna

1993-01-01

This paper will present data related to software development processes and personnel involvement from the perspective of software quality assurance. We examine eight years of data collected from six projects. Data collected varied by project but usually included defect and fault density with limited use of code metrics, schedule adherence, and budget growth information. The data are a blend of AFSCP 800-14 and suggested productivity measures in Software Metrics: A Practioner's Guide to Improved Product Development. A software quality assurance database tool, SQUID, was used to store and tabulate the data.
A software upgrade method for micro-electronics medical implants.

PubMed

Cao, Yang; Hao, Hongwei; Xue, Lin; Li, Luming; Ma, Bozhi

2006-01-01

A software upgrade method for micro-electronics medical implants is designed to enhance the devices' function or renew the software if there are some bugs found, the software updating or some memory units disabled. The implants needn't be replaced by operations if the faults can be corrected through reprogramming, which reduces the patients' pain and improves the safety effectively. This paper introduces the software upgrade method using in-application programming (IAP) and emphasizes how to insure the system, especially the implanted part's reliability and stability while upgrading.
Design Description of the X-33 Avionics Architecture

NASA Technical Reports Server (NTRS)

Reichenfeld, Curtis J.; Jones, Paul G.

1999-01-01

In this paper, we provide a design description of the X-33 avionics architecture. The X-33 is an autonomous Single Stage to Orbit (SSTO) launch vehicle currently being developed by Lockheed Martin for NASA as a technology demonstrator for the VentureStar Reusable Launch Vehicle (RLV). The X-33 avionics provides autonomous control of die vehicle throughout takeoff, ascent, descent, approach, landing, rollout, and vehicle safing. During flight the avionics provides communication to the range through uplinked commands and downlinked telemetry. During pre-launch and post-safing activities, the avionics provides interfaces to ground support consoles that perform vehicle flight preparations and maintenance. The X-33 Avionics is a hybrid of centralized and distributed processing elements connected by three dual redundant Mil-Std 1553 data buses. These data buses are controlled by a central processing suite located in the avionics bay and composed of triplex redundant Vehicle Mission Computers (VMCs). The VMCs integrate mission management, guidance, navigation, flight control, subsystem control and redundancy management functions. The vehicle sensors, effectors and subsystems are interfaced directly to the centralized VMCs as remote terminals or through dual redundant Data Interface Units (DIUs). The DIUs are located forward and aft of the avionics bay and provide signal conditioning, health monitoring, low level subsystem control and data interface functions. Each VMC is connected to all three redundant 1553 data buses for monitoring and provides a complete identical data set to the processing algorithms. This enables bus faults to be detected and reconfigured through a voted bus control configuration. Data is also shared between VMCs though a cross channel data link that is implemented in hardware and controlled by AlliedSignal's Fault Tolerant Executive (FTE). The FTE synchronizes processors within the VMC and synchronizes redundant VMCs to each other. The FTE provides an output-voting plane to detect, isolate and contain faults due to internal hardware or software faults and reconfigures the VMCs to accommodate these faults. Critical data in the 1553 messages are scheduled and synchronized to specific processing frames in order to minimize data latency. In order to achieve an open architecture, military and commercial off-the-shelf equipment is incorporated using common processors, standard VME backplanes and chassis, the VxWorks operating system, and MartixX for automatic code generation. The use of off-the-shelf tools and equipment helps reduce development time and enables software reuse. The open architecture allows for technology insertion, while the distributed modular elements allow for expansion to increased redundancy levels to meet the higher reliability goals of future RLVs.
Use of Soft Computing Technologies for a Qualitative and Reliable Engine Control System for Propulsion Systems

NASA Technical Reports Server (NTRS)

Trevino, Luis; Brown, Terry; Crumbley, R. T. (Technical Monitor)

2001-01-01

The problem to be addressed in this paper is to explore how the use of Soft Computing Technologies (SCT) could be employed to improve overall vehicle system safety, reliability, and rocket engine performance by development of a qualitative and reliable engine control system (QRECS). Specifically, this will be addressed by enhancing rocket engine control using SCT, innovative data mining tools, and sound software engineering practices used in Marshall's Flight Software Group (FSG). The principle goals for addressing the issue of quality are to improve software management, software development time, software maintenance, processor execution, fault tolerance and mitigation, and nonlinear control in power level transitions. The intent is not to discuss any shortcomings of existing engine control methodologies, but to provide alternative design choices for control, implementation, performance, and sustaining engineering, all relative to addressing the issue of reliability. The approaches outlined in this paper will require knowledge in the fields of rocket engine propulsion (system level), software engineering for embedded flight software systems, and soft computing technologies (i.e., neural networks, fuzzy logic, data mining, and Bayesian belief networks); some of which are briefed in this paper. For this effort, the targeted demonstration rocket engine testbed is the MC-1 engine (formerly FASTRAC) which is simulated with hardware and software in the Marshall Avionics & Software Testbed (MAST) laboratory that currently resides at NASA's Marshall Space Flight Center, building 4476, and is managed by the Avionics Department. A brief plan of action for design, development, implementation, and testing a Phase One effort for QRECS is given, along with expected results. Phase One will focus on development of a Smart Start Engine Module and a Mainstage Engine Module for proper engine start and mainstage engine operations. The overall intent is to demonstrate that by employing soft computing technologies, the quality and reliability of the overall scheme to engine controller development is further improved and vehicle safety is further insured. The final product that this paper proposes is an approach to development of an alternative low cost engine controller that would be capable of performing in unique vision spacecraft vehicles requiring low cost advanced avionics architectures for autonomous operations from engine pre-start to engine shutdown.
Fault Diagnosis from Raw Sensor Data Using Deep Neural Networks Considering Temporal Coherence.

PubMed

Zhang, Ran; Peng, Zhen; Wu, Lifeng; Yao, Beibei; Guan, Yong

2017-03-09

Intelligent condition monitoring and fault diagnosis by analyzing the sensor data can assure the safety of machinery. Conventional fault diagnosis and classification methods usually implement pretreatments to decrease noise and extract some time domain or frequency domain features from raw time series sensor data. Then, some classifiers are utilized to make diagnosis. However, these conventional fault diagnosis approaches suffer from the expertise of feature selection and they do not consider the temporal coherence of time series data. This paper proposes a fault diagnosis model based on Deep Neural Networks (DNN). The model can directly recognize raw time series sensor data without feature selection and signal processing. It also takes advantage of the temporal coherence of the data. Firstly, raw time series training data collected by sensors are used to train the DNN until the cost function of DNN gets the minimal value; Secondly, test data are used to test the classification accuracy of the DNN on local time series data. Finally, fault diagnosis considering temporal coherence with former time series data is implemented. Experimental results show that the classification accuracy of bearing faults can get 100%. The proposed fault diagnosis approach is effective in recognizing the type of bearing faults.
Fault Diagnosis from Raw Sensor Data Using Deep Neural Networks Considering Temporal Coherence

PubMed Central

Zhang, Ran; Peng, Zhen; Wu, Lifeng; Yao, Beibei; Guan, Yong

2017-01-01

Intelligent condition monitoring and fault diagnosis by analyzing the sensor data can assure the safety of machinery. Conventional fault diagnosis and classification methods usually implement pretreatments to decrease noise and extract some time domain or frequency domain features from raw time series sensor data. Then, some classifiers are utilized to make diagnosis. However, these conventional fault diagnosis approaches suffer from the expertise of feature selection and they do not consider the temporal coherence of time series data. This paper proposes a fault diagnosis model based on Deep Neural Networks (DNN). The model can directly recognize raw time series sensor data without feature selection and signal processing. It also takes advantage of the temporal coherence of the data. Firstly, raw time series training data collected by sensors are used to train the DNN until the cost function of DNN gets the minimal value; Secondly, test data are used to test the classification accuracy of the DNN on local time series data. Finally, fault diagnosis considering temporal coherence with former time series data is implemented. Experimental results show that the classification accuracy of bearing faults can get 100%. The proposed fault diagnosis approach is effective in recognizing the type of bearing faults. PMID:28282936
Advanced information processing system: Local system services

NASA Technical Reports Server (NTRS)

Burkhardt, Laura; Alger, Linda; Whittredge, Roy; Stasiowski, Peter

1989-01-01

The Advanced Information Processing System (AIPS) is a multi-computer architecture composed of hardware and software building blocks that can be configured to meet a broad range of application requirements. The hardware building blocks are fault-tolerant, general-purpose computers, fault-and damage-tolerant networks (both computer and input/output), and interfaces between the networks and the computers. The software building blocks are the major software functions: local system services, input/output, system services, inter-computer system services, and the system manager. The foundation of the local system services is an operating system with the functions required for a traditional real-time multi-tasking computer, such as task scheduling, inter-task communication, memory management, interrupt handling, and time maintenance. Resting on this foundation are the redundancy management functions necessary in a redundant computer and the status reporting functions required for an operator interface. The functional requirements, functional design and detailed specifications for all the local system services are documented.
A second generation experiment in fault-tolerant software

NASA Technical Reports Server (NTRS)

Knight, J. C.

1986-01-01

Information was collected on the efficacy of fault-tolerant software by conducting two large-scale controlled experiments. In the first, an empirical study of multi-version software (MVS) was conducted. The second experiment is an empirical evaluation of self testing as a method of error detection (STED). The purpose ot the MVS experiment was to obtain empirical measurement of the performance of multi-version systems. Twenty versions of a program were prepared at four different sites under reasonably realistic development conditions from the same specifications. The purpose of the STED experiment was to obtain empirical measurements of the performance of assertions in error detection. Eight versions of a program were modified to include assertions at two different sites under controlled conditions. The overall structure of the testing environment for the MVS experiment and its status are described. Work to date in the STED experiment is also presented.
An Integrated Crustal Dynamics Simulator

NASA Astrophysics Data System (ADS)

Xing, H. L.; Mora, P.

2007-12-01

Numerical modelling offers an outstanding opportunity to gain an understanding of the crustal dynamics and complex crustal system behaviour. This presentation provides our long-term and ongoing effort on finite element based computational model and software development to simulate the interacting fault system for earthquake forecasting. A R-minimum strategy based finite-element computational model and software tool, PANDAS, for modelling 3-dimensional nonlinear frictional contact behaviour between multiple deformable bodies with the arbitrarily-shaped contact element strategy has been developed by the authors, which builds up a virtual laboratory to simulate interacting fault systems including crustal boundary conditions and various nonlinearities (e.g. from frictional contact, materials, geometry and thermal coupling). It has been successfully applied to large scale computing of the complex nonlinear phenomena in the non-continuum media involving the nonlinear frictional instability, multiple material properties and complex geometries on supercomputers, such as the South Australia (SA) interacting fault system, South California fault model and Sumatra subduction model. It has been also extended and to simulate the hot fractured rock (HFR) geothermal reservoir system in collaboration of Geodynamics Ltd which is constructing the first geothermal reservoir system in Australia and to model the tsunami generation induced by earthquakes. Both are supported by Australian Research Council.
Three-Dimensional Geologic Map of the Hayward Fault Zone, San Francisco Bay Region, California

USGS Publications Warehouse

Phelps, G.A.; Graymer, R.W.; Jachens, R.C.; Ponce, D.A.; Simpson, R.W.; Wentworth, C.M.

2008-01-01

A three-dimensional (3D) geologic map of the Hayward Fault zone was created by integrating the results from geologic mapping, potential field geophysics, and seismology investigations. The map volume is 100 km long, 20 km wide, and extends to a depth of 12 km below sea level. The map volume is oriented northwest and is approximately bisected by the Hayward Fault. The complex geologic structure of the region makes it difficult to trace many geologic units into the subsurface. Therefore, the map units are generalized from 1:24,000-scale geologic maps. Descriptions of geologic units and structures are offered, along with a discussion of the methods used to map them and incorporate them into the 3D geologic map. The map spatial database and associated viewing software are provided. Elements of the map, such as individual fault surfaces, are also provided in a non-proprietary format so that the user can access the map via open-source software. The sheet accompanying this manuscript shows views taken from the 3D geologic map for the user to access. The 3D geologic map is designed as a multi-purpose resource for further geologic investigations and process modeling.
A hierarchical distributed control model for coordinating intelligent systems

NASA Technical Reports Server (NTRS)

Adler, Richard M.

1991-01-01

A hierarchical distributed control (HDC) model for coordinating cooperative problem-solving among intelligent systems is described. The model was implemented using SOCIAL, an innovative object-oriented tool for integrating heterogeneous, distributed software systems. SOCIAL embeds applications in 'wrapper' objects called Agents, which supply predefined capabilities for distributed communication, control, data specification, and translation. The HDC model is realized in SOCIAL as a 'Manager'Agent that coordinates interactions among application Agents. The HDC Manager: indexes the capabilities of application Agents; routes request messages to suitable server Agents; and stores results in a commonly accessible 'Bulletin-Board'. This centralized control model is illustrated in a fault diagnosis application for launch operations support of the Space Shuttle fleet at NASA, Kennedy Space Center.
A highly reliable, high performance open avionics architecture for real time Nap-of-the-Earth operations

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Elks, Carl

1995-01-01

An Army Fault Tolerant Architecture (AFTA) has been developed to meet real-time fault tolerant processing requirements of future Army applications. AFTA is the enabling technology that will allow the Army to configure existing processors and other hardware to provide high throughput and ultrahigh reliability necessary for TF/TA/NOE flight control and other advanced Army applications. A comprehensive conceptual study of AFTA has been completed that addresses a wide range of issues including requirements, architecture, hardware, software, testability, producibility, analytical models, validation and verification, common mode faults, VHDL, and a fault tolerant data bus. A Brassboard AFTA for demonstration and validation has been fabricated, and two operating systems and a flight-critical Army application have been ported to it. Detailed performance measurements have been made of fault tolerance and operating system overheads while AFTA was executing the flight application in the presence of faults.

Novel elastic protection against DDF failures in an enhanced software-defined SIEPON

NASA Astrophysics Data System (ADS)

Pakpahan, Andrew Fernando; Hwang, I.-Shyan; Yu, Yu-Ming; Hsu, Wu-Hsiao; Liem, Andrew Tanny; Nikoukar, AliAkbar

2017-07-01

Ever-increasing bandwidth demands on passive optical networks (PONs) are pushing the utilization of every fiber strand to its limit. This is mandating comprehensive protection until the end of the distribution drop fiber (DDF). Hence, it is important to provide refined protection with an advanced fault-protection architecture and recovery mechanism that is able to cope with various DDF failures. We propose a novel elastic protection against DDF failures that incorporates a software-defined networking (SDN) capability and a bus protection line to enhance the resiliency of the existing Service Interoperability in Ethernet Passive Optical Networks (SIEPON) system. We propose the addition of an integrated SDN controller and flow tables to the optical line terminal and optical network units (ONUs) in order to deliver various DDF protection scenarios. The proposed architecture enables flexible assignment of backup ONU(s) in pre/post-fault conditions depending on the PON traffic load. A transient backup ONU and multiple backup ONUs can be deployed in the pre-fault and post-fault scenarios, respectively. Our extensively discussed simulation results show that our proposed architecture provides better overall throughput and drop probability compared to the architecture with a fixed DDF protection mechanism. It does so while still maintaining overall QoS performance in terms of packet delay, mean jitter, packet loss, and throughput under various fault conditions.
An Analytical Model for Assessing Stability of Pre-Existing Faults in Caprock Caused by Fluid Injection and Extraction in a Reservoir

NASA Astrophysics Data System (ADS)

Wang, Lei; Bai, Bing; Li, Xiaochun; Liu, Mingze; Wu, Haiqing; Hu, Shaobin

2016-07-01

Induced seismicity and fault reactivation associated with fluid injection and depletion were reported in hydrocarbon, geothermal, and waste fluid injection fields worldwide. Here, we establish an analytical model to assess fault reactivation surrounding a reservoir during fluid injection and extraction that considers the stress concentrations at the fault tips and the effects of fault length. In this model, induced stress analysis in a full-space under the plane strain condition is implemented based on Eshelby's theory of inclusions in terms of a homogeneous, isotropic, and poroelastic medium. The stress intensity factor concept in linear elastic fracture mechanics is adopted as an instability criterion for pre-existing faults in surrounding rocks. To characterize the fault reactivation caused by fluid injection and extraction, we define a new index, the "fault reactivation factor" η, which can be interpreted as an index of fault stability in response to fluid pressure changes per unit within a reservoir resulting from injection or extraction. The critical fluid pressure change within a reservoir is also determined by the superposition principle using the in situ stress surrounding a fault. Our parameter sensitivity analyses show that the fault reactivation tendency is strongly sensitive to fault location, fault length, fault dip angle, and Poisson's ratio of the surrounding rock. Our case study demonstrates that the proposed model focuses on the mechanical behavior of the whole fault, unlike the conventional methodologies. The proposed method can be applied to engineering cases related to injection and depletion within a reservoir owing to its efficient computational codes implementation.
FINDS: A fault inferring nonlinear detection system programmers manual, version 3.0

NASA Technical Reports Server (NTRS)

Lancraft, R. E.

1985-01-01

Detailed software documentation of the digital computer program FINDS (Fault Inferring Nonlinear Detection System) Version 3.0 is provided. FINDS is a highly modular and extensible computer program designed to monitor and detect sensor failures, while at the same time providing reliable state estimates. In this version of the program the FINDS methodology is used to detect, isolate, and compensate for failures in simulated avionics sensors used by the Advanced Transport Operating Systems (ATOPS) Transport System Research Vehicle (TSRV) in a Microwave Landing System (MLS) environment. It is intended that this report serve as a programmers guide to aid in the maintenance, modification, and revision of the FINDS software.
Adopting software quality measures for healthcare processes.

PubMed

Yildiz, Ozkan; Demirörs, Onur

2009-01-01

In this study, we investigated the adoptability of software quality measures for healthcare process measurement. Quality measures of ISO/IEC 9126 are redefined from a process perspective to build a generic healthcare process quality measurement model. Case study research method is used, and the model is applied to a public hospital's Entry to Care process. After the application, weak and strong aspects of the process can be easily observed. Access audibility, fault removal, completeness of documentation, and machine utilization are weak aspects and these aspects are the candidates for process improvement. On the other hand, functional completeness, fault ratio, input validity checking, response time, and throughput time are the strong aspects of the process.
Agent Based Fault Tolerance for the Mobile Environment

NASA Astrophysics Data System (ADS)

Park, Taesoon

This paper presents a fault-tolerance scheme based on mobile agents for the reliable mobile computing systems. Mobility of the agent is suitable to trace the mobile hosts and the intelligence of the agent makes it efficient to support the fault tolerance services. This paper presents two approaches to implement the mobile agent based fault tolerant service and their performances are evaluated and compared with other fault-tolerant schemes.
A domain decomposition approach to implementing fault slip in finite-element models of quasi-static and dynamic crustal deformation

USGS Publications Warehouse

Aagaard, Brad T.; Knepley, M.G.; Williams, C.A.

2013-01-01

We employ a domain decomposition approach with Lagrange multipliers to implement fault slip in a finite-element code, PyLith, for use in both quasi-static and dynamic crustal deformation applications. This integrated approach to solving both quasi-static and dynamic simulations leverages common finite-element data structures and implementations of various boundary conditions, discretization schemes, and bulk and fault rheologies. We have developed a custom preconditioner for the Lagrange multiplier portion of the system of equations that provides excellent scalability with problem size compared to conventional additive Schwarz methods. We demonstrate application of this approach using benchmarks for both quasi-static viscoelastic deformation and dynamic spontaneous rupture propagation that verify the numerical implementation in PyLith.
Digital Database of Recently Active Traces of the Hayward Fault, California

USGS Publications Warehouse

Lienkaemper, James J.

2006-01-01

The purpose of this map is to show the location of and evidence for recent movement on active fault traces within the Hayward Fault Zone, California. The mapped traces represent the integration of the following three different types of data: (1) geomorphic expression, (2) creep (aseismic fault slip),and (3) trench exposures. This publication is a major revision of an earlier map (Lienkaemper, 1992), which both brings up to date the evidence for faulting and makes it available formatted both as a digital database for use within a geographic information system (GIS) and for broader public access interactively using widely available viewing software. The pamphlet describes in detail the types of scientific observations used to make the map, gives references pertaining to the fault and the evidence of faulting, and provides guidance for use of and limitations of the map. [Last revised Nov. 2008, a minor update for 2007 LiDAR and recent trench investigations; see version history below.
Real-time fault diagnosis for propulsion systems

NASA Technical Reports Server (NTRS)

Merrill, Walter C.; Guo, Ten-Huei; Delaat, John C.; Duyar, Ahmet

1991-01-01

Current research toward real time fault diagnosis for propulsion systems at NASA-Lewis is described. The research is being applied to both air breathing and rocket propulsion systems. Topics include fault detection methods including neural networks, system modeling, and real time implementations.
Semi-automatic mapping of fault rocks on a Digital Outcrop Model, Gole Larghe Fault Zone (Southern Alps, Italy)

NASA Astrophysics Data System (ADS)

Mittempergher, Silvia; Vho, Alice; Bistacchi, Andrea

2016-04-01

A quantitative analysis of fault-rock distribution in outcrops of exhumed fault zones is of fundamental importance for studies of fault zone architecture, fault and earthquake mechanics, and fluid circulation. We present a semi-automatic workflow for fault-rock mapping on a Digital Outcrop Model (DOM), developed on the Gole Larghe Fault Zone (GLFZ), a well exposed strike-slip fault in the Adamello batholith (Italian Southern Alps). The GLFZ has been exhumed from ca. 8-10 km depth, and consists of hundreds of individual seismogenic slip surfaces lined by green cataclasites (crushed wall rocks cemented by the hydrothermal epidote and K-feldspar) and black pseudotachylytes (solidified frictional melts, considered as a marker for seismic slip). A digital model of selected outcrop exposures was reconstructed with photogrammetric techniques, using a large number of high resolution digital photographs processed with VisualSFM software. The resulting DOM has a resolution up to 0.2 mm/pixel. Most of the outcrop was imaged using images each one covering a 1 x 1 m2 area, while selected structural features, such as sidewall ripouts or stepovers, were covered with higher-resolution images covering 30 x 40 cm2 areas.Image processing algorithms were preliminarily tested using the ImageJ-Fiji package, then a workflow in Matlab was developed to process a large collection of images sequentially. Particularly in detailed 30 x 40 cm images, cataclasites and hydrothermal veins were successfully identified using spectral analysis in RGB and HSV color spaces. This allows mapping the network of cataclasites and veins which provided the pathway for hydrothermal fluid circulation, and also the volume of mineralization, since we are able to measure the thickness of cataclasites and veins on the outcrop surface. The spectral signature of pseudotachylyte veins is indistinguishable from that of biotite grains in the wall rock (tonalite), so we tested morphological analysis tools to discriminate them with respect to biotite. In higher resolution images this could be performed using circularity and size thresholds, however this could not be easily implemented in an automated procedure since the thresholds must be varied by the interpreter almost for each image. In 1 x 1 m images the resolution is generally too low to distinguish cataclasite and pseudotachylyte, so most of the time fault rocks were treated together. For this analysis we developed a fully automated workflow that, after applying noise correction, classification and skeletonization algorithms, returns labeled edge images of fault segments together with vector polylines associated to edge properties. Vector and edge properties represent a useful format to perform further quantitative analysis, for instance for classifying fault segments based on structural criteria, detect continuous fault traces, and detect the kind of termination of faults/fractures. This approach allows to collect statistically relevant datasets useful for further quantitative structural analysis.
The Orion GN and C Data-Driven Flight Software Architecture for Automated Sequencing and Fault Recovery

NASA Technical Reports Server (NTRS)

King, Ellis; Hart, Jeremy; Odegard, Ryan

2010-01-01

The Orion Crew Exploration Vehicle (CET) is being designed to include significantly more automation capability than either the Space Shuttle or the International Space Station (ISS). In particular, the vehicle flight software has requirements to accommodate increasingly automated missions throughout all phases of flight. A data-driven flight software architecture will provide an evolvable automation capability to sequence through Guidance, Navigation & Control (GN&C) flight software modes and configurations while maintaining the required flexibility and human control over the automation. This flexibility is a key aspect needed to address the maturation of operational concepts, to permit ground and crew operators to gain trust in the system and mitigate unpredictability in human spaceflight. To allow for mission flexibility and reconfrgurability, a data driven approach is being taken to load the mission event plan as well cis the flight software artifacts associated with the GN&C subsystem. A database of GN&C level sequencing data is presented which manages and tracks the mission specific and algorithm parameters to provide a capability to schedule GN&C events within mission segments. The flight software data schema for performing automated mission sequencing is presented with a concept of operations for interactions with ground and onboard crew members. A prototype architecture for fault identification, isolation and recovery interactions with the automation software is presented and discussed as a forward work item.
Thermal Expert System (TEXSYS): Systems autonomy demonstration project, volume 2. Results

NASA Technical Reports Server (NTRS)

Glass, B. J. (Editor)

1992-01-01

The Systems Autonomy Demonstration Project (SADP) produced a knowledge-based real-time control system for control and fault detection, isolation, and recovery (FDIR) of a prototype two-phase Space Station Freedom external active thermal control system (EATCS). The Thermal Expert System (TEXSYS) was demonstrated in recent tests to be capable of reliable fault anticipation and detection, as well as ordinary control of the thermal bus. Performance requirements were addressed by adopting a hierarchical symbolic control approach-layering model-based expert system software on a conventional, numerical data acquisition and control system. The model-based reasoning capabilities of TEXSYS were shown to be advantageous over typical rule-based expert systems, particularly for detection of unforeseen faults and sensor failures. Volume 1 gives a project overview and testing highlights. Volume 2 provides detail on the EATCS testbed, test operations, and online test results. Appendix A is a test archive, while Appendix B is a compendium of design and user manuals for the TEXSYS software.
Thermal Expert System (TEXSYS): Systems automony demonstration project, volume 1. Overview

NASA Technical Reports Server (NTRS)

Glass, B. J. (Editor)

1992-01-01

The Systems Autonomy Demonstration Project (SADP) produced a knowledge-based real-time control system for control and fault detection, isolation, and recovery (FDIR) of a prototype two-phase Space Station Freedom external active thermal control system (EATCS). The Thermal Expert System (TEXSYS) was demonstrated in recent tests to be capable of reliable fault anticipation and detection, as well as ordinary control of the thermal bus. Performance requirements were addressed by adopting a hierarchical symbolic control approach-layering model-based expert system software on a conventional, numerical data acquisition and control system. The model-based reasoning capabilities of TEXSYS were shown to be advantageous over typical rule-based expert systems, particularly for detection of unforeseen faults and sensor failures. Volume 1 gives a project overview and testing highlights. Volume 2 provides detail on the EATCS test bed, test operations, and online test results. Appendix A is a test archive, while Appendix B is a compendium of design and user manuals for the TEXSYS software.
Development and realization of the open fault diagnosis system based on XPE

NASA Astrophysics Data System (ADS)

Deng, Hui; Wang, TaiYong; He, HuiLong; Xu, YongGang; Zeng, JuXiang

2005-12-01

To make the complex mechanical equipment work in good service, the technology for realizing an embedded open system is introduced systematically, including open hardware configuration, customized embedded operation system and open software structure. The ETX technology is adopted in this system, integrating the CPU main-board functions, and achieving the quick, real-time signal acquisition and intelligent data analysis with applying DSP and CPLD data acquisition card. Under the open configuration, the signal bus mode such as PCI, ISA and PC/104 can be selected and the styles of the signals can be chosen too. In addition, through customizing XPE system, adopting the EWF (Enhanced Write Filter), and realizing the open system authentically, the stability of the system is enhanced. Multi-thread and multi-task programming techniques are adopted in the software programming process. Interconnecting with the remote fault diagnosis center via the net interface, cooperative diagnosis is conducted and the intelligent degree of the fault diagnosis is improved.
Thermal Expert System (TEXSYS): Systems autonomy demonstration project, volume 2. Results

NASA Astrophysics Data System (ADS)

Glass, B. J.

1992-10-01

The Systems Autonomy Demonstration Project (SADP) produced a knowledge-based real-time control system for control and fault detection, isolation, and recovery (FDIR) of a prototype two-phase Space Station Freedom external active thermal control system (EATCS). The Thermal Expert System (TEXSYS) was demonstrated in recent tests to be capable of reliable fault anticipation and detection, as well as ordinary control of the thermal bus. Performance requirements were addressed by adopting a hierarchical symbolic control approach-layering model-based expert system software on a conventional, numerical data acquisition and control system. The model-based reasoning capabilities of TEXSYS were shown to be advantageous over typical rule-based expert systems, particularly for detection of unforeseen faults and sensor failures. Volume 1 gives a project overview and testing highlights. Volume 2 provides detail on the EATCS testbed, test operations, and online test results. Appendix A is a test archive, while Appendix B is a compendium of design and user manuals for the TEXSYS software.
Quantitative Measures for Software Independent Verification and Validation

NASA Technical Reports Server (NTRS)

Lee, Alice

1996-01-01

As software is maintained or reused, it undergoes an evolution which tends to increase the overall complexity of the code. To understand the effects of this, we brought in statistics experts and leading researchers in software complexity, reliability, and their interrelationships. These experts' project has resulted in our ability to statistically correlate specific code complexity attributes, in orthogonal domains, to errors found over time in the HAL/S flight software which flies in the Space Shuttle. Although only a prototype-tools experiment, the result of this research appears to be extendable to all other NASA software, given appropriate data similar to that logged for the Shuttle onboard software. Our research has demonstrated that a more complete domain coverage can be mathematically demonstrated with the approach we have applied, thereby ensuring full insight into the cause-and-effects relationship between the complexity of a software system and the fault density of that system. By applying the operational profile we can characterize the dynamic effects of software path complexity under this same approach We now have the ability to measure specific attributes which have been statistically demonstrated to correlate to increased error probability, and to know which actions to take, for each complexity domain. Shuttle software verifiers can now monitor the changes in the software complexity, assess the added or decreased risk of software faults in modified code, and determine necessary corrections. The reports, tool documentation, user's guides, and new approach that have resulted from this research effort represent advances in the state of the art of software quality and reliability assurance. Details describing how to apply this technique to other NASA code are contained in this document.
Software reliability models for critical applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pham, H.; Pham, M.

This report presents the results of the first phase of the ongoing EG G Idaho, Inc. Software Reliability Research Program. The program is studying the existing software reliability models and proposes a state-of-the-art software reliability model that is relevant to the nuclear reactor control environment. This report consists of three parts: (1) summaries of the literature review of existing software reliability and fault tolerant software reliability models and their related issues, (2) proposed technique for software reliability enhancement, and (3) general discussion and future research. The development of this proposed state-of-the-art software reliability model will be performed in the secondmore » place. 407 refs., 4 figs., 2 tabs.« less
Software reliability models for critical applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pham, H.; Pham, M.

This report presents the results of the first phase of the ongoing EG&G Idaho, Inc. Software Reliability Research Program. The program is studying the existing software reliability models and proposes a state-of-the-art software reliability model that is relevant to the nuclear reactor control environment. This report consists of three parts: (1) summaries of the literature review of existing software reliability and fault tolerant software reliability models and their related issues, (2) proposed technique for software reliability enhancement, and (3) general discussion and future research. The development of this proposed state-of-the-art software reliability model will be performed in the second place.more » 407 refs., 4 figs., 2 tabs.« less
FTMP (Fault Tolerant Multiprocessor) programmer's manual

NASA Technical Reports Server (NTRS)

Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

1986-01-01

The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.
Universal fault-tolerant quantum computation with only transversal gates and error correction.

PubMed

Paetznick, Adam; Reichardt, Ben W

2013-08-30

Transversal implementations of encoded unitary gates are highly desirable for fault-tolerant quantum computation. Though transversal gates alone cannot be computationally universal, they can be combined with specially distilled resource states in order to achieve universality. We show that "triorthogonal" stabilizer codes, introduced for state distillation by Bravyi and Haah [Phys. Rev. A 86, 052329 (2012)], admit transversal implementation of the controlled-controlled-Z gate. We then construct a universal set of fault-tolerant gates without state distillation by using only transversal controlled-controlled-Z, transversal Hadamard, and fault-tolerant error correction. We also adapt the distillation procedure of Bravyi and Haah to Toffoli gates, improving on existing Toffoli distillation schemes.
A note on adding viscoelasticity to earthquake simulators

USGS Publications Warehouse

Pollitz, Fred

2017-01-01

Here, I describe how time‐dependent quasi‐static stress transfer can be implemented in an earthquake simulator code that is used to generate long synthetic seismicity catalogs. Most existing seismicity simulators use precomputed static stress interaction coefficients to rapidly implement static stress transfer in fault networks with typically tens of thousands of fault patches. The extension to quasi‐static deformation, which accounts for viscoelasticity of Earth’s ductile lower crust and mantle, involves the precomputation of additional interaction coefficients that represent time‐dependent stress transfer among the model fault patches, combined with defining and evolving additional state variables that track this stress transfer. The new approach is illustrated with application to a California‐wide synthetic fault network.

Numerical simulation of the stress distribution in a coal mine caused by a normal fault

NASA Astrophysics Data System (ADS)

Zhang, Hongmei; Wu, Jiwen; Zhai, Xiaorong

2017-06-01

Luling coal mine was used for research using FLAC3D software to analyze the stress distribution characteristics of the two sides of a normal fault zone with two different working face models. The working faces were, respectively, on the hanging wall and the foot wall; the two directions of mining were directed to the fault. The stress distributions were different across the fault. The stress was concentrated and the influenced range of stress was gradually larger while the working face was located on the hanging wall. The fault zone played a negative effect to the stress transmission. Obviously, the fault prevented stress transmission, the stress concentrated on the fault zone and the hanging wall. In the second model, the stress on the two sides decreased at first, but then increased continuing to transmit to the hanging wall. The concentrated stress in the fault zone decreased and the stress transmission was obvious. Because of this, the result could be used to minimize roadway damage and lengthen the time available for coal mining by careful design of the roadway and working face.
Trade Studies of Space Launch Architectures using Modular Probabilistic Risk Analysis

NASA Technical Reports Server (NTRS)

Mathias, Donovan L.; Go, Susie

2006-01-01

A top-down risk assessment in the early phases of space exploration architecture development can provide understanding and intuition of the potential risks associated with new designs and technologies. In this approach, risk analysts draw from their past experience and the heritage of similar existing systems as a source for reliability data. This top-down approach captures the complex interactions of the risk driving parts of the integrated system without requiring detailed knowledge of the parts themselves, which is often unavailable in the early design stages. Traditional probabilistic risk analysis (PRA) technologies, however, suffer several drawbacks that limit their timely application to complex technology development programs. The most restrictive of these is a dependence on static planning scenarios, expressed through fault and event trees. Fault trees incorporating comprehensive mission scenarios are routinely constructed for complex space systems, and several commercial software products are available for evaluating fault statistics. These static representations cannot capture the dynamic behavior of system failures without substantial modification of the initial tree. Consequently, the development of dynamic models using fault tree analysis has been an active area of research in recent years. This paper discusses the implementation and demonstration of dynamic, modular scenario modeling for integration of subsystem fault evaluation modules using the Space Architecture Failure Evaluation (SAFE) tool. SAFE is a C++ code that was originally developed to support NASA s Space Launch Initiative. It provides a flexible framework for system architecture definition and trade studies. SAFE supports extensible modeling of dynamic, time-dependent risk drivers of the system and functions at the level of fidelity for which design and failure data exists. The approach is scalable, allowing inclusion of additional information as detailed data becomes available. The tool performs a Monte Carlo analysis to provide statistical estimates. Example results of an architecture system reliability study are summarized for an exploration system concept using heritage data from liquid-fueled expendable Saturn V/Apollo launch vehicles.
Windows .NET Network Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST)

PubMed Central

Dowd, Scot E; Zaragoza, Joaquin; Rodriguez, Javier R; Oliver, Melvin J; Payton, Paxton R

2005-01-01

Background BLAST is one of the most common and useful tools for Genetic Research. This paper describes a software application we have termed Windows .NET Distributed Basic Local Alignment Search Toolkit (W.ND-BLAST), which enhances the BLAST utility by improving usability, fault recovery, and scalability in a Windows desktop environment. Our goal was to develop an easy to use, fault tolerant, high-throughput BLAST solution that incorporates a comprehensive BLAST result viewer with curation and annotation functionality. Results W.ND-BLAST is a comprehensive Windows-based software toolkit that targets researchers, including those with minimal computer skills, and provides the ability increase the performance of BLAST by distributing BLAST queries to any number of Windows based machines across local area networks (LAN). W.ND-BLAST provides intuitive Graphic User Interfaces (GUI) for BLAST database creation, BLAST execution, BLAST output evaluation and BLAST result exportation. This software also provides several layers of fault tolerance and fault recovery to prevent loss of data if nodes or master machines fail. This paper lays out the functionality of W.ND-BLAST. W.ND-BLAST displays close to 100% performance efficiency when distributing tasks to 12 remote computers of the same performance class. A high throughput BLAST job which took 662.68 minutes (11 hours) on one average machine was completed in 44.97 minutes when distributed to 17 nodes, which included lower performance class machines. Finally, there is a comprehensive high-throughput BLAST Output Viewer (BOV) and Annotation Engine components, which provides comprehensive exportation of BLAST hits to text files, annotated fasta files, tables, or association files. Conclusion W.ND-BLAST provides an interactive tool that allows scientists to easily utilizing their available computing resources for high throughput and comprehensive sequence analyses. The install package for W.ND-BLAST is freely downloadable from . With registration the software is free, installation, networking, and usage instructions are provided as well as a support forum. PMID:15819992
An Integrated Fault Tolerant Robotic Controller System for High Reliability and Safety

NASA Technical Reports Server (NTRS)

Marzwell, Neville I.; Tso, Kam S.; Hecht, Myron

1994-01-01

This paper describes the concepts and features of a fault-tolerant intelligent robotic control system being developed for applications that require high dependability (reliability, availability, and safety). The system consists of two major elements: a fault-tolerant controller and an operator workstation. The fault-tolerant controller uses a strategy which allows for detection and recovery of hardware, operating system, and application software failures.The fault-tolerant controller can be used by itself in a wide variety of applications in industry, process control, and communications. The controller in combination with the operator workstation can be applied to robotic applications such as spaceborne extravehicular activities, hazardous materials handling, inspection and maintenance of high value items (e.g., space vehicles, reactor internals, or aircraft), medicine, and other tasks where a robot system failure poses a significant risk to life or property.
Reliability of Fault Tolerant Control Systems. Part 1

NASA Technical Reports Server (NTRS)

Wu, N. Eva

2001-01-01

This paper reports Part I of a two part effort, that is intended to delineate the relationship between reliability and fault tolerant control in a quantitative manner. Reliability analysis of fault-tolerant control systems is performed using Markov models. Reliability properties, peculiar to fault-tolerant control systems are emphasized. As a consequence, coverage of failures through redundancy management can be severely limited. It is shown that in the early life of a syi1ein composed of highly reliable subsystems, the reliability of the overall system is affine with respect to coverage, and inadequate coverage induces dominant single point failures. The utility of some existing software tools for assessing the reliability of fault tolerant control systems is also discussed. Coverage modeling is attempted in Part II in a way that captures its dependence on the control performance and on the diagnostic resolution.
Predictive modelling of fault related fracturing in carbonate damage-zones: analytical and numerical models of field data (Central Apennines, Italy)

NASA Astrophysics Data System (ADS)

Mannino, Irene; Cianfarra, Paola; Salvini, Francesco

2010-05-01

Permeability in carbonates is strongly influenced by the presence of brittle deformation patterns, i.e pressure-solution surfaces, extensional fractures, and faults. Carbonate rocks achieve fracturing both during diagenesis and tectonic processes. Attitude, spatial distribution and connectivity of brittle deformation features rule the secondary permeability of carbonatic rocks and therefore the accumulation and the pathway of deep fluids (ground-water, hydrocarbon). This is particularly true in fault zones, where the damage zone and the fault core show different hydraulic properties from the pristine rock as well as between them. To improve the knowledge of fault architecture and faults hydraulic properties we study the brittle deformation patterns related to fault kinematics in carbonate successions. In particular we focussed on the damage-zone fracturing evolution. Fieldwork was performed in Meso-Cenozoic carbonate units of the Latium-Abruzzi Platform, Central Apennines, Italy. These units represent field analogues of rock reservoir in the Southern Apennines. We combine the study of rock physical characteristics of 22 faults and quantitative analyses of brittle deformation for the same faults, including bedding attitudes, fracturing type, attitudes, and spatial intensity distribution by using the dimension/spacing ratio, namely H/S ratio where H is the dimension of the fracture and S is the spacing between two analogous fractures of the same set. Statistical analyses of structural data (stereonets, contouring and H/S transect) were performed to infer a focussed, general algorithm that describes the expected intensity of fracturing process. The analytical model was fit to field measurements by a Montecarlo-convergent approach. This method proved a useful tool to quantify complex relations with a high number of variables. It creates a large sequence of possible solution parameters and results are compared with field data. For each item an error mean value is computed (RMS), representing the effectiveness of the fit and so the validity of this analysis. Eventually, the method selects the set of parameters that produced the least values. The tested algorithm describes the expected H/S values as a function of the distance from the fault core (D), the clay content (S), and the fault throw (T). The preliminary results of the Montecarlo inversion show that the distance (D) has the most effective influence in the H/S spatial distribution and the H/S value decreases with the distance from the fault-core. The rheological parameter shows a value similar to the diagenetic H/S values (1-1.5). The resulting equation has a reasonable RMS value of 0.116. The results of the Montecarlo models were finally implemented in FRAP, a fault environment modelling software. It is a true 4D tool that can predict stress conditions and permeability architecture associated to a given faults during single or multiple tectonic events. We present some models of fault-related fracturing among the studied faults performed by FRAP and we compare them with the field measurements, to test the validity of our methodology.
Traffic accident reconstruction and an approach for prediction of fault rates using artificial neural networks: A case study in Turkey.

PubMed

Can Yilmaz, Ali; Aci, Cigdem; Aydin, Kadir

2016-08-17

Currently, in Turkey, fault rates in traffic accidents are determined according to the initiative of accident experts (no speed analyses of vehicles just considering accident type) and there are no specific quantitative instructions on fault rates related to procession of accidents which just represents the type of collision (side impact, head to head, rear end, etc.) in No. 2918 Turkish Highway Traffic Act (THTA 1983). The aim of this study is to introduce a scientific and systematic approach for determination of fault rates in most frequent property damage-only (PDO) traffic accidents in Turkey. In this study, data (police reports, skid marks, deformation, crush depth, etc.) collected from the most frequent and controversial accident types (4 sample vehicle-vehicle scenarios) that consist of PDO were inserted into a reconstruction software called vCrash. Sample real-world scenarios were simulated on the software to generate different vehicle deformations that also correspond to energy-equivalent speed data just before the crash. These values were used to train a multilayer feedforward artificial neural network (MFANN), function fitting neural network (FITNET, a specialized version of MFANN), and generalized regression neural network (GRNN) models within 10-fold cross-validation to predict fault rates without using software. The performance of the artificial neural network (ANN) prediction models was evaluated using mean square error (MSE) and multiple correlation coefficient (R). It was shown that the MFANN model performed better for predicting fault rates (i.e., lower MSE and higher R) than FITNET and GRNN models for accident scenarios 1, 2, and 3, whereas FITNET performed the best for scenario 4. The FITNET model showed the second best results for prediction for the first 3 scenarios. Because there is no training phase in GRNN, the GRNN model produced results much faster than MFANN and FITNET models. However, the GRNN model had the worst prediction results. The R values for prediction of fault rates were close to 1 for all folds and scenarios. This study focuses on exhibiting new aspects and scientific approaches for determining fault rates of involvement in most frequent PDO accidents occurring in Turkey by discussing some deficiencies in THTA and without regard to initiative and/or experience of experts. This study yields judicious decisions to be made especially on forensic investigations and events involving insurance companies. Referring to this approach, injury/fatal and/or pedestrian-related accidents may be analyzed as future work by developing new scientific models.
The Assistant for Specifying the Quality Software (ASQS) Operational Concept Document. Volume 1

DTIC Science & Technology

1990-09-01

Assistant in which the manager supplies system-specific characteristics and needs and the Assistant fills in the software quality concepts and methods. The...member(s) of the Computer Resources Working Group (CRWG) to aid in performing a software quality engineering study. Figure 3.4-1 outlines the...need to recovery from faults more likely than need _o provide alternative functions or interfaces), and more on Autcncmy - 27 - that Modularity
An Open Avionics and Software Architecture to Support Future NASA Exploration Missions

NASA Technical Reports Server (NTRS)

Schlesinger, Adam

2017-01-01

The presentation describes an avionics and software architecture that has been developed through NASAs Advanced Exploration Systems (AES) division. The architecture is open-source, highly reliable with fault tolerance, and utilizes standard capabilities and interfaces, which are scalable and customizable to support future exploration missions. Specific focus areas of discussion will include command and data handling, software, human interfaces, communication and wireless systems, and systems engineering and integration.
1393 Ring Bus at JPL: Description and Status

NASA Technical Reports Server (NTRS)

Wysocky, Terry R.

2007-01-01

Completed Ring Bus IC V&V Phase - Ring Bus Test Plan Completed for SIM Project - Applicable to Other Projects Implemented a Avionics Bus Based upon the IEEE 1393 Standard - Excellent Starting Point for a General Purpose High-Speed Spacecraft Bus - Designed to Meet SIM Requirements for - Real-time deterministic, distributed systems. - Control system requirements - Fault detection and recovery Other JPL Projects Considering Implementation F'light Software Ring Bus Driver Module Began in 2006, Continues Participating in Standard Revision. Search for Earth-like planets orbiting nearby stars and measure the masses and orbits of the planets it finds. Survey 2000 nearby stars for planetary systems to learn whether our Solar System is unusual, or typical. Make a new catalog of star position 100 times more accurate than current measurements. Learn how our galaxy formed and will evolve by studying the dynamics of its stars. Critically test models of exactly how stars shine, including exotic objects like black holes, neutron stars and white dwarfs.
The Software Design for the Wide-Field Infrared Explorer Attitude Control System

NASA Technical Reports Server (NTRS)

Anderson, Mark O.; Barnes, Kenneth C.; Melhorn, Charles M.; Phillips, Tom

1998-01-01

The Wide-Field Infrared Explorer (WIRE), currently scheduled for launch in September 1998, is the fifth of five spacecraft in the NASA/Goddard Small Explorer (SMEX) series. This paper presents the design of WIRE's Attitude Control System flight software (ACS FSW). WIRE is a momentum-biased, three-axis stabilized stellar pointer which provides high-accuracy pointing and autonomous acquisition for eight to ten stellar targets per orbit. WIRE's short mission life and limited cryogen supply motivate requirements for Sun and Earth avoidance constraints which are designed to prevent catastrophic instrument damage and to minimize the heat load on the cryostat. The FSW implements autonomous fault detection and handling (FDH) to enforce these instrument constraints and to perform several other checks which insure the safety of the spacecraft. The ACS FSW implements modules for sensor data processing, attitude determination, attitude control, guide star acquisition, actuator command generation, command/telemetry processing, and FDH. These software components are integrated with a hierarchical control mode managing module that dictates which software components are currently active. The lowest mode in the hierarchy is the 'safest' one, in the sense that it utilizes a minimal complement of sensors and actuators to keep the spacecraft in a stable configuration (power and pointing constraints are maintained). As higher modes in the hierarchy are achieved, the various software functions are activated by the mode manager, and an increasing level of attitude control accuracy is provided. If FDH detects a constraint violation or other anomaly, it triggers a safing transition to a lower control mode. The WIRE ACS FSW satisfies all target acquisition and pointing accuracy requirements, enforces all pointing constraints, provides the ground with a simple means for reconfiguring the system via table load, and meets all the demands of its real-time embedded environment (16 MHz Intel 80386 processor with 80387 coprocessor running under the VRTX operating system). The mode manager organizes and controls all the software modules used to accomplish these goals, and in particular, the FDH module is tightly coupled with the mode manager.
On the engineering of crucial software

NASA Technical Reports Server (NTRS)

Pratt, T. W.; Knight, J. C.; Gregory, S. T.

1983-01-01

The various aspects of the conventional software development cycle are examined. This cycle was the basis of the augmented approach contained in the original grant proposal. This cycle was found inadequate for crucial software development, and the justification for this opinion is presented. Several possible enhancements to the conventional software cycle are discussed. Software fault tolerance, a possible enhancement of major importance, is discussed separately. Formal verification using mathematical proof is considered. Automatic programming is a radical alternative to the conventional cycle and is discussed. Recommendations for a comprehensive approach are presented, and various experiments which could be conducted in AIRLAB are described.
Information technologies in optimization process of monitoring of software and hardware status

NASA Astrophysics Data System (ADS)

Nikitin, P. V.; Savinov, A. N.; Bazhenov, R. I.; Ryabov, I. V.

2018-05-01

The article describes a model of a hardware and software monitoring system for a large company that provides customers with software as a service (SaaS solution) using information technology. The main functions of the monitoring system are: provision of up-todate data for analyzing the state of the IT infrastructure, rapid detection of the fault and its effective elimination. The main risks associated with the provision of these services are described; the comparative characteristics of the software are given; author's methods of monitoring the status of software and hardware are proposed.
Software architecture of the Magdalena Ridge Observatory Interferometer

NASA Astrophysics Data System (ADS)

Farris, Allen; Klinglesmith, Dan; Seamons, John; Torres, Nicolas; Buscher, David; Young, John

2010-07-01

Merging software from 36 independent work packages into a coherent, unified software system with a lifespan of twenty years is the challenge faced by the Magdalena Ridge Observatory Interferometer (MROI). We solve this problem by using standardized interface software automatically generated from simple highlevel descriptions of these systems, relying only on Linux, GNU, and POSIX without complex software such as CORBA. This approach, based on gigabit Ethernet with a TCP/IP protocol, provides the flexibility to integrate and manage diverse, independent systems using a centralized supervisory system that provides a database manager, data collectors, fault handling, and an operator interface.
Operations management system advanced automation: Fault detection isolation and recovery prototyping

NASA Technical Reports Server (NTRS)

Hanson, Matt

1990-01-01

The purpose of this project is to address the global fault detection, isolation and recovery (FDIR) requirements for Operation's Management System (OMS) automation within the Space Station Freedom program. This shall be accomplished by developing a selected FDIR prototype for the Space Station Freedom distributed processing systems. The prototype shall be based on advanced automation methodologies in addition to traditional software methods to meet the requirements for automation. A secondary objective is to expand the scope of the prototyping to encompass multiple aspects of station-wide fault management (SWFM) as discussed in OMS requirements documentation.
Scientific Research Program for Power, Energy, and Thermal Technologies. Task Order 0002: Power, Thermal and Control Technologies and Processes Experimental Research. Subtask: Laboratory Test Set-up to Evaluate Electromechanical Actuation Systems for Aircraft Flight Control

DTIC Science & Technology

2015-08-01

faults are incorporated into the system in order to better understand the EMA reliability, and to aid in designing fault detection software for real...to a fixed angle repeatedly and accurately [16]. The motor in the EHA is used to drive a reversible pump tied to a hydraulic cylinder which moves...24] [25] [26]. These test stands are used for the prognostic testing of EMAS that have had mechanical or electrical faults injected into them. The
Physical fault tolerance of nanoelectronics.

PubMed

Szkopek, Thomas; Roychowdhury, Vwani P; Antoniadis, Dimitri A; Damoulakis, John N

2011-04-29

The error rate in complementary transistor circuits is suppressed exponentially in electron number, arising from an intrinsic physical implementation of fault-tolerant error correction. Contrariwise, explicit assembly of gates into the most efficient known fault-tolerant architecture is characterized by a subexponential suppression of error rate with electron number, and incurs significant overhead in wiring and complexity. We conclude that it is more efficient to prevent logical errors with physical fault tolerance than to correct logical errors with fault-tolerant architecture.
An empirical study of software design practices

NASA Technical Reports Server (NTRS)

Card, David N.; Church, Victor E.; Agresti, William W.

1986-01-01

Software engineers have developed a large body of software design theory and folklore, much of which was never validated. The results of an empirical study of software design practices in one specific environment are presented. The practices examined affect module size, module strength, data coupling, descendant span, unreferenced variables, and software reuse. Measures characteristic of these practices were extracted from 887 FORTRAN modules developed for five flight dynamics software projects monitored by the Software Engineering Laboratory (SEL). The relationship of these measures to cost and fault rate was analyzed using a contingency table procedure. The results show that some recommended design practices, despite their intuitive appeal, are ineffective in this environment, whereas others are very effective.
Modelling earthquake ruptures with dynamic off-fault damage

NASA Astrophysics Data System (ADS)

Okubo, Kurama; Bhat, Harsha S.; Klinger, Yann; Rougier, Esteban

2017-04-01

Earthquake rupture modelling has been developed for producing scenario earthquakes. This includes understanding the source mechanisms and estimating far-field ground motion with given a priori constraints like fault geometry, constitutive law of the medium and friction law operating on the fault. It is necessary to consider all of the above complexities of a fault systems to conduct realistic earthquake rupture modelling. In addition to the complexity of the fault geometry in nature, coseismic off-fault damage, which is observed by a variety of geological and seismological methods, plays a considerable role on the resultant ground motion and its spectrum compared to a model with simple planer fault surrounded by purely elastic media. Ideally all of these complexities should be considered in earthquake modelling. State of the art techniques developed so far, however, cannot treat all of them simultaneously due to a variety of computational restrictions. Therefore, we adopt the combined finite-discrete element method (FDEM), which can effectively deal with pre-existing complex fault geometry such as fault branches and kinks and can describe coseismic off-fault damage generated during the dynamic rupture. The advantage of FDEM is that it can handle a wide range of length scales, from metric to kilometric scale, corresponding to the off-fault damage and complex fault geometry respectively. We used the FDEM-based software tool called HOSSedu (Hybrid Optimization Software Suite - Educational Version) for the earthquake rupture modelling, which was developed by Los Alamos National Laboratory. We firstly conducted the cross-validation of this new methodology against other conventional numerical schemes such as the finite difference method (FDM), the spectral element method (SEM) and the boundary integral equation method (BIEM), to evaluate the accuracy with various element sizes and artificial viscous damping values. We demonstrate the capability of the FDEM tool for modelling earthquake ruptures. We then modelled earthquake ruptures allowing for coseismic off-fault damage with appropriate fracture nucleation and growth criteria. We studied the effect of different conditions such as rupture speed (sub-Rayleigh or supershear), the orientation of the initial maximum principal stress with respect to the fault and the magnitude of the initial stress (to mimic depth). The comparison between the sub-Rayleigh and supershear case shows that the coseismic off-fault damage is enhanced in the supershear case when compared with the sub-Rayleigh case. The orientation of the maximum principal stress also has significant difference such that the dynamic off-fault cracking is more likely to occur on the extensional side of the fault for high principal stress orientation. It is found that the coseismic off-fault damage reduces the rupture speed due to the dissipation of the energy by dynamic off-fault cracking generated in the vicinity of the rupture front. In terms of the ground motion amplitude spectra it is shown that the high-frequency radiation is enhanced by the coseismic off-fault damage though it is quickly attenuated. This is caused by the intricate superposition of the radiation generated by the off-fault damage and the perturbation of the rupture speed on the main fault.
Lessons learned in creating spacecraft computer systems: Implications for using Ada (R) for the space station

NASA Technical Reports Server (NTRS)

Tomayko, James E.

1986-01-01

Twenty-five years of spacecraft onboard computer development have resulted in a better understanding of the requirements for effective, efficient, and fault tolerant flight computer systems. Lessons from eight flight programs (Gemini, Apollo, Skylab, Shuttle, Mariner, Voyager, and Galileo) and three reserach programs (digital fly-by-wire, STAR, and the Unified Data System) are useful in projecting the computer hardware configuration of the Space Station and the ways in which the Ada programming language will enhance the development of the necessary software. The evolution of hardware technology, fault protection methods, and software architectures used in space flight in order to provide insight into the pending development of such items for the Space Station are reviewed.

Analytical sensor redundancy assessment

NASA Technical Reports Server (NTRS)

Mulcare, D. B.; Downing, L. E.; Smith, M. K.

1988-01-01

The rationale and mechanization of sensor fault tolerance based on analytical redundancy principles are described. The concept involves the substitution of software procedures, such as an observer algorithm, to supplant additional hardware components. The observer synthesizes values of sensor states in lieu of their direct measurement. Such information can then be used, for example, to determine which of two disagreeing sensors is more correct, thus enhancing sensor fault survivability. Here a stability augmentation system is used as an example application, with required modifications being made to a quadruplex digital flight control system. The impact on software structure and the resultant revalidation effort are illustrated as well. Also, the use of an observer algorithm for wind gust filtering of the angle-of-attack sensor signal is presented.
TASK ALLOCATION IN GEO-DISTRIBUTED CYBER-PHYSICAL SYSTEMS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aggarwal, Rachit; Smidts, Carol

This paper studies the task allocation algorithm for a distributed test facility (DTF), which aims to assemble geo-distributed cyber (software) and physical (hardware in the loop components into a prototype cyber-physical system (CPS). This allows low cost testing on an early conceptual prototype (ECP) of the ultimate CPS (UCPS) to be developed. The DTF provides an instrumentation interface for carrying out reliability experiments remotely such as fault propagation analysis and in-situ testing of hardware and software components in a simulated environment. Unfortunately, the geo-distribution introduces an overhead that is not inherent to the UCPS, i.e. a significant time delay inmore » communication that threatens the stability of the ECP and is not an appropriate representation of the behavior of the UCPS. This can be mitigated by implementing a task allocation algorithm to find a suitable configuration and assign the software components to appropriate computational locations, dynamically. This would allow the ECP to operate more efficiently with less probability of being unstable due to the delays introduced by geo-distribution. The task allocation algorithm proposed in this work uses a Monte Carlo approach along with Dynamic Programming to identify the optimal network configuration to keep the time delays to a minimum.« less
The fault-tree compiler

NASA Technical Reports Server (NTRS)

Martensen, Anna L.; Butler, Ricky W.

1987-01-01

The Fault Tree Compiler Program is a new reliability tool used to predict the top event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, and M OF N gates. The high level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precise (within the limits of double precision floating point arithmetic) to the five digits in the answer. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Corporation VAX with the VMS operation system.
The Fault Tree Compiler (FTC): Program and mathematics

NASA Technical Reports Server (NTRS)

Butler, Ricky W.; Martensen, Anna L.

1989-01-01

The Fault Tree Compiler Program is a new reliability tool used to predict the top-event probability for a fault tree. Five different gate types are allowed in the fault tree: AND, OR, EXCLUSIVE OR, INVERT, AND m OF n gates. The high-level input language is easy to understand and use when describing the system tree. In addition, the use of the hierarchical fault tree capability can simplify the tree description and decrease program execution time. The current solution technique provides an answer precisely (within the limits of double precision floating point arithmetic) within a user specified number of digits accuracy. The user may vary one failure rate or failure probability over a range of values and plot the results for sensitivity analyses. The solution technique is implemented in FORTRAN; the remaining program code is implemented in Pascal. The program is written to run on a Digital Equipment Corporation (DEC) VAX computer with the VMS operation system.
Single Event Analysis and Fault Injection Techniques Targeting Complex Designs Implemented in Xilinx-Virtex Family Field Programmable Gate Array (FPGA) Devices

NASA Technical Reports Server (NTRS)

Berg, Melanie D.; LaBel, Kenneth; Kim, Hak

2014-01-01

An informative session regarding SRAM FPGA basics. Presenting a framework for fault injection techniques applied to Xilinx Field Programmable Gate Arrays (FPGAs). Introduce an overlooked time component that illustrates fault injection is impractical for most real designs as a stand-alone characterization tool. Demonstrate procedures that benefit from fault injection error analysis.
Analysis of Multilayered Printed Circuit Boards using Computed Tomography

DTIC Science & Technology

2014-05-01

complex PCBs that present a challenge for any testing or fault analysis. Set-to- work testing and fault analysis of any electronic circuit require...Electronic Warfare and Radar Division in December 2010. He is currently in Electro- Optic Countermeasures Group. Samuel works on embedded system design...and software optimisation of complex electro-optical systems, including the set to work and characterisation of these systems. He has a Bachelor of
Flight test results of the Strapdown hexad Inertial Reference Unit (SIRU). Volume 1: Flight test summary

NASA Technical Reports Server (NTRS)

Hruby, R. J.; Bjorkman, W. S.

1977-01-01

Flight test results of the strapdown inertial reference unit (SIRU) navigation system are presented. The fault-tolerant SIRU navigation system features a redundant inertial sensor unit and dual computers. System software provides for detection and isolation of inertial sensor failures and continued operation in the event of failures. Flight test results include assessments of the system's navigational performance and fault tolerance.
Pratt and Whitney Overview and Advanced Health Management Program

NASA Technical Reports Server (NTRS)

Inabinett, Calvin

2008-01-01

Hardware Development Activity: Design and Test Custom Multi-layer Circuit Boards for use in the Fault Emulation Unit; Logic design performed using VHDL; Layout power system for lab hardware; Work lab issues with software developers and software testers; Interface with Engine Systems personnel with performance of Engine hardware components; Perform off nominal testing with new engine hardware.
Human factors in software development

DOE Office of Scientific and Technical Information (OSTI.GOV)

Curtis, B.

1986-01-01

This book presents an overview of ergonomics/human factors in software development, recent research, and classic papers. Articles are drawn from the following areas of psychological research on programming: cognitive ergonomics, cognitive psychology, and psycholinguistics. Topics examined include: theoretical models of how programmers solve technical problems, the characteristics of programming languages, specification formats in behavioral research and psychological aspects of fault diagnosis.
Developing Software For Monitoring And Diagnosis

NASA Technical Reports Server (NTRS)

Edwards, S. J.; Caglayan, A. K.

1993-01-01

Expert-system software shell produces executable code. Report discusses beginning phase of research directed toward development of artificial intelligence for real-time monitoring of, and diagnosis of faults in, complicated systems of equipment. Motivated by need for onboard monitoring and diagnosis of electronic sensing and controlling systems of advanced aircraft. Also applicable to such equipment systems as refineries, factories, and powerplants.
Reliable and Fault-Tolerant Software-Defined Network Operations Scheme for Remote 3D Printing

NASA Astrophysics Data System (ADS)

Kim, Dongkyun; Gil, Joon-Min

2015-03-01

The recent wide expansion of applicable three-dimensional (3D) printing and software-defined networking (SDN) technologies has led to a great deal of attention being focused on efficient remote control of manufacturing processes. SDN is a renowned paradigm for network softwarization, which has helped facilitate remote manufacturing in association with high network performance, since SDN is designed to control network paths and traffic flows, guaranteeing improved quality of services by obtaining network requests from end-applications on demand through the separated SDN controller or control plane. However, current SDN approaches are generally focused on the controls and automation of the networks, which indicates that there is a lack of management plane development designed for a reliable and fault-tolerant SDN environment. Therefore, in addition to the inherent advantage of SDN, this paper proposes a new software-defined network operations center (SD-NOC) architecture to strengthen the reliability and fault-tolerance of SDN in terms of network operations and management in particular. The cooperation and orchestration between SDN and SD-NOC are also introduced for the SDN failover processes based on four principal SDN breakdown scenarios derived from the failures of the controller, SDN nodes, and connected links. The abovementioned SDN troubles significantly reduce the network reachability to remote devices (e.g., 3D printers, super high-definition cameras, etc.) and the reliability of relevant control processes. Our performance consideration and analysis results show that the proposed scheme can shrink operations and management overheads of SDN, which leads to the enhancement of responsiveness and reliability of SDN for remote 3D printing and control processes.
Real time software for a heat recovery steam generator control system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Valdes, R.; Delgadillo, M.A.; Chavez, R.

1995-12-31

This paper is addressed to the development and successful implementation of a real time software for the Heat Recovery Steam Generator (HRSG) control system of a Combined Cycle Power Plant. The real time software for the HRSG control system physically resides in a Control and Acquisition System (SAC) which is a component of a distributed control system (DCS). The SAC is a programmable controller. The DCS installed at the Gomez Palacio power plant in Mexico accomplishes the functions of logic, analog and supervisory control. The DCS is based on microprocessors and the architecture consists of workstations operating as a Man-Machinemore » Interface (MMI), linked to SAC controllers by means of a communication system. The HRSG real time software is composed of an operating system, drivers, dedicated computer program and application computer programs. The operating system used for the development of this software was the MultiTasking Operating System (MTOS). The application software developed at IIE for the HRSG control system basically consisted of a set of digital algorithms for the regulation of the main process variables at the HRSG. By using the multitasking feature of MTOS, the algorithms are executed pseudo concurrently. In this way, the applications programs continuously use the resources of the operating system to perform their functions through a uniform service interface. The application software of the HRSG consist of three tasks, each of them has dedicated responsibilities. The drivers were developed for the handling of hardware resources of the SAC controller which in turn allows the signals acquisition and data communication with a MMI. The dedicated programs were developed for hardware diagnostics, task initializations, access to the data base and fault tolerance. The application software and the dedicated software for the HRSG control system was developed using C programming language due to compactness, portability and efficiency.« less
Lessons Learned in the Livingstone 2 on Earth Observing One Flight Experiment

NASA Technical Reports Server (NTRS)

Hayden, Sandra C.; Sweet, Adam J.; Shulman, Seth

2005-01-01

The Livingstone 2 (L2) model-based diagnosis software is a reusable diagnostic tool for monitoring complex systems. In 2004, L2 was integrated with the JPL Autonomous Sciencecraft Experiment (ASE) and deployed on-board Goddard's Earth Observing One (EO-1) remote sensing satellite, to monitor and diagnose the EO-1 space science instruments and imaging sequence. This paper reports on lessons learned from this flight experiment. The goals for this experiment, including validation of minimum success criteria and of a series of diagnostic scenarios, have all been successfully net. Long-term operations in space are on-going, as a test of the maturity of the system, with L2 performance remaining flawless. L2 has demonstrated the ability to track the state of the system during nominal operations, detect simulated abnormalities in operations and isolate failures to their root cause fault. Specific advances demonstrated include diagnosis of ambiguity groups rather than a single fault candidate; hypothesis revision given new sensor evidence about the state of the system; and the capability to check for faults in a dynamic system without having to wait until the system is quiescent. The major benefits of this advanced health management technology are to increase mission duration and reliability through intelligent fault protection, and robust autonomous operations with reduced dependency on supervisory operations from Earth. The work-load for operators will be reduced by telemetry of processed state-of-health information rather than raw data. The long-term vision is that of making diagnosis available to the onboard planner or executive, allowing autonomy software to re-plan in order to work around known component failures. For a system that is expected to evolve substantially over its lifetime, as for the International Space Station, the model-based approach has definite advantages over rule-based expert systems and limit-checking fault protection systems, as these do not scale well. The model-based approach facilitates reuse of the L2 diagnostic software; only the model of the system to be diagnosed and telemetry monitoring software has to be rebuilt for a new system or expanded for a growing system. The hierarchical L2 model supports modularity and expendability, and as such is suitable solution for integrated system health management as envisioned for systems-of-systems.
The Importance of Architecture in DoD Software

DTIC Science & Technology

1991-07-01

01282 92 1 14 060 M91-35 The Importance of Architecture in DOD Software S ACCesion For- * DTIC "r,’L- .S Dr. Barry M. Horowitz July 1991 D;.t ibto...resource utilization: architecture determines how the system sustains , 06 operations when parts of the system fail. The architecture also determines...software maintainers to ensure that we deliver to them whatever is necessary for them Medium to sustain and use the architecture . Fault Rate 37% Getting
Software-Controlled Caches in the VMP Multiprocessor

DTIC Science & Technology

1986-03-01

programming system level that Processors is tuned for the VMP design. In this vein, we are interested in exploring how far the software support can go to ...handled in software, analogously to the handling agement of the shared program state is familiar and of virtual memory page faults. Hardware support for...ensure good behavior, as opposed to how Each cache miss results in bus traffic. Table 2 pro- vides the bus cost for the "average" cache miss. Fig
Functional Fault Modeling Conventions and Practices for Real-Time Fault Isolation

NASA Technical Reports Server (NTRS)

Ferrell, Bob; Lewis, Mark; Perotti, Jose; Oostdyk, Rebecca; Brown, Barbara

2010-01-01

The purpose of this paper is to present the conventions, best practices, and processes that were established based on the prototype development of a Functional Fault Model (FFM) for a Cryogenic System that would be used for real-time Fault Isolation in a Fault Detection, Isolation, and Recovery (FDIR) system. The FDIR system is envisioned to perform health management functions for both a launch vehicle and the ground systems that support the vehicle during checkout and launch countdown by using a suite of complimentary software tools that alert operators to anomalies and failures in real-time. The FFMs were created offline but would eventually be used by a real-time reasoner to isolate faults in a Cryogenic System. Through their development and review, a set of modeling conventions and best practices were established. The prototype FFM development also provided a pathfinder for future FFM development processes. This paper documents the rationale and considerations for robust FFMs that can easily be transitioned to a real-time operating environment.
Fault tolerance in computational grids: perspectives, challenges, and issues.

PubMed

Haider, Sajjad; Nazir, Babar

2016-01-01

Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computational grid environments. The proposed classification will help researchers, developers, and maintainers of grids to understand the types of issues to be anticipated. Moreover, different types of problems, such as omission, interaction, and timing related have been identified that need to be handled on various layers of the computational grid. In this survey, an analysis and examination is also performed pertaining to the fault tolerance and fault detection mechanisms. Our conclusion is that a dependable and reliable grid can only be established when more emphasis is on fault identification. Moreover, our survey reveals that adaptive and intelligent fault identification, and tolerance techniques can improve the dependability of grid working environments.
Investigating an API for resilient exascale computing.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Stearley, Jon R.; Tomkins, James; VanDyke, John P.

2013-05-01

Increased HPC capability comes with increased complexity, part counts, and fault occurrences. In- creasing the resilience of systems and applications to faults is a critical requirement facing the viability of exascale systems, as the overhead of traditional checkpoint/restart is projected to outweigh its bene ts due to fault rates outpacing I/O bandwidths. As faults occur and propagate throughout hardware and software layers, pervasive noti cation and handling mechanisms are necessary. This report describes an initial investigation of fault types and programming interfaces to mitigate them. Proof-of-concept APIs are presented for the frequent and important cases of memory errors and nodemore » failures, and a strategy proposed for lesystem failures. These involve changes to the operating system, runtime, I/O library, and application layers. While a single API for fault handling among hardware and OS and application system-wide remains elusive, the e ort increased our understanding of both the mountainous challenges and the promising trailheads. 3« less
Error rates and resource overheads of encoded three-qubit gates

NASA Astrophysics Data System (ADS)

Takagi, Ryuji; Yoder, Theodore J.; Chuang, Isaac L.

2017-10-01

A non-Clifford gate is required for universal quantum computation, and, typically, this is the most error-prone and resource-intensive logical operation on an error-correcting code. Small, single-qubit rotations are popular choices for this non-Clifford gate, but certain three-qubit gates, such as Toffoli or controlled-controlled-Z (ccz), are equivalent options that are also more suited for implementing some quantum algorithms, for instance, those with coherent classical subroutines. Here, we calculate error rates and resource overheads for implementing logical ccz with pieceable fault tolerance, a nontransversal method for implementing logical gates. We provide a comparison with a nonlocal magic-state scheme on a concatenated code and a local magic-state scheme on the surface code. We find the pieceable fault-tolerance scheme particularly advantaged over magic states on concatenated codes and in certain regimes over magic states on the surface code. Our results suggest that pieceable fault tolerance is a promising candidate for fault tolerance in a near-future quantum computer.
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB

NASA Technical Reports Server (NTRS)

Clune, E.; Segall, Z.; Siewiorek, D.

1985-01-01

This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.

The 26 May 2006 Yogyakarta earthquake fault observed by seismic data and satellite data based surface features

NASA Astrophysics Data System (ADS)

Anggraini, Ade; Sobiesiak, Monika; Walter, Thomas R.

2010-05-01

The Mw 6.3 May 26, 2006 Yogyakarta Earthquake caused severe damage and claimed thousands lives in the Yogyakarta Special Province and Klaten District of Central Java Province. The nearby Opak River fault was thought to be the source of this earthquake disaster. However, no significant surface movement was observed along the fault which could confirm that this fault was really the source of the earthquake. To investigate the earthquake source and to understand the earthquake mechanism, a rapid response team of the German Task Force for Earthquake, together with the Seismological Division of Badan Meteorologi Klimatologi dan Geofisika and Gadjah Mada University in Yogyakarta, had installed a temporary seismic network of 12 short period seismometers. More than 3000 aftershocks were recorded during the 3-month campaign. Here we present the result of several hundred processed aftershocks. We used integrated software package GIANTPitsa to pick P and S phases manually and HYPO71 to determine the hypocenters. HypoDD software was used for hypocenters relocation to obtain high precision aftershock locations. Our aftershock distribution shows a system of lineaments in southwest-northeast direction, about 10 km east to Opak River fault, at 5-18 km depth. The b-value map from the aftershocks shows that the main lineaments have relatively low b-value at the middle part which suggests this part is still under stress. We also observe several aftershock clusters cutting these lineaments in nearly perpendicular direction. To verify the interpretation of our aftershocks analysis, we will overlay it on surface feature we delineate from satellite data. Hopefully our result will give significant contribution to understand the near surface fault systems around Yogyakarta Area in order to mitigate similar earthquake hazard in the future.
Project Morpheus: Morpheus 1.5A Lander Failure Investigation Results

NASA Technical Reports Server (NTRS)

Devolites, Jennifer L.; Olansen, Jon B.; Munday, Stephen R.

2013-01-01

On August 9, 2012 the Morpheus 1.5A vehicle crashed shortly after lift off from the Kennedy Space Center. The loss was limited to the vehicle itself which was pre-declared to be a test failure and not a mishap. The Morpheus project is demonstrating advanced technologies for in space and planetary surface vehicles including: autonomous flight control, landing site hazard identification and safe site selection, relative surface and hazard navigation, precision landing, modular reusable flight software, and high performance, non-toxic, cryogenic liquid Oxygen and liquid Methane integrated main engine and attitude control propulsion system. A comprehensive failure investigation isolated the fault to the Inertial Measurement Unit (IMU) data path to the flight computer. Several improvements have been identified and implemented for the 1.5B and 1.5C vehicles.
A circuit-based photovoltaic module simulator with shadow and fault settings

NASA Astrophysics Data System (ADS)

Chao, Kuei-Hsiang; Chao, Yuan-Wei; Chen, Jyun-Ping

2016-03-01

The main purpose of this study was to develop a photovoltaic (PV) module simulator. The proposed simulator, using electrical parameters from solar cells, could simulate output characteristics not only during normal operational conditions, but also during conditions of partial shadow and fault conditions. Such a simulator should possess the advantages of low cost, small size and being easily realizable. Experiments have shown that results from a proposed PV simulator of this kind are very close to that from simulation software during partial shadow conditions, and with negligible differences during fault occurrence. Meanwhile, the PV module simulator, as developed, could be used on various types of series-parallel connections to form PV arrays, to conduct experiments on partial shadow and fault events occurring in some of the modules. Such experiments are designed to explore the impact of shadow and fault conditions on the output characteristics of the system as a whole.
Narrowing the scope of failure prediction using targeted fault load injection

NASA Astrophysics Data System (ADS)

Jordan, Paul L.; Peterson, Gilbert L.; Lin, Alan C.; Mendenhall, Michael J.; Sellers, Andrew J.

2018-05-01

As society becomes more dependent upon computer systems to perform increasingly critical tasks, ensuring that those systems do not fail becomes increasingly important. Many organizations depend heavily on desktop computers for day-to-day operations. Unfortunately, the software that runs on these computers is written by humans and, as such, is still subject to human error and consequent failure. A natural solution is to use statistical machine learning to predict failure. However, since failure is still a relatively rare event, obtaining labelled training data to train these models is not a trivial task. This work presents new simulated fault-inducing loads that extend the focus of traditional fault injection techniques to predict failure in the Microsoft enterprise authentication service and Apache web server. These new fault loads were successful in creating failure conditions that were identifiable using statistical learning methods, with fewer irrelevant faults being created.
On-line early fault detection and diagnosis of municipal solid waste incinerators

DOE Office of Scientific and Technical Information (OSTI.GOV)

Zhao Jinsong; Huang Jianchao; Sun Wei

A fault detection and diagnosis framework is proposed in this paper for early fault detection and diagnosis (FDD) of municipal solid waste incinerators (MSWIs) in order to improve the safety and continuity of production. In this framework, principal component analysis (PCA), one of the multivariate statistical technologies, is used for detecting abnormal events, while rule-based reasoning performs the fault diagnosis and consequence prediction, and also generates recommendations for fault mitigation once an abnormal event is detected. A software package, SWIFT, is developed based on the proposed framework, and has been applied in an actual industrial MSWI. The application shows thatmore » automated real-time abnormal situation management (ASM) of the MSWI can be achieved by using SWIFT, resulting in an industrially acceptable low rate of wrong diagnosis, which has resulted in improved process continuity and environmental performance of the MSWI.« less
A Testbed for Evaluating Lunar Habitat Autonomy Architectures

NASA Technical Reports Server (NTRS)

Lawler, Dennis G.

2008-01-01

A lunar outpost will involve a habitat with an integrated set of hardware and software that will maintain a safe environment for human activities. There is a desire for a paradigm shift whereby crew will be the primary mission operators, not ground controllers. There will also be significant periods when the outpost is uncrewed. This will require that significant automation software be resident in the habitat to maintain all system functions and respond to faults. JSC is developing a testbed to allow for early testing and evaluation of different autonomy architectures. This will allow evaluation of different software configurations in order to: 1) understand different operational concepts; 2) assess the impact of failures and perturbations on the system; and 3) mitigate software and hardware integration risks. The testbed will provide an environment in which habitat hardware simulations can interact with autonomous control software. Faults can be injected into the simulations and different mission scenarios can be scripted. The testbed allows for logging, replaying and re-initializing mission scenarios. An initial testbed configuration has been developed by combining an existing life support simulation and an existing simulation of the space station power distribution system. Results from this initial configuration will be presented along with suggested requirements and designs for the incremental development of a more sophisticated lunar habitat testbed.
A Thermal Expert System (TEXSYS) development overview - AI-based control of a Space Station prototype thermal bus

NASA Technical Reports Server (NTRS)

Glass, B. J.; Hack, E. C.

1990-01-01

A knowledge-based control system for real-time control and fault detection, isolation and recovery (FDIR) of a prototype two-phase Space Station Freedom external thermal control system (TCS) is discussed in this paper. The Thermal Expert System (TEXSYS) has been demonstrated in recent tests to be capable of both fault anticipation and detection and real-time control of the thermal bus. Performance requirements were achieved by using a symbolic control approach, layering model-based expert system software on a conventional numerical data acquisition and control system. The model-based capabilities of TEXSYS were shown to be advantageous during software development and testing. One representative example is given from on-line TCS tests of TEXSYS. The integration and testing of TEXSYS with a live TCS testbed provides some insight on the use of formal software design, development and documentation methodologies to qualify knowledge-based systems for on-line or flight applications.
Model-based reasoning for power system management using KATE and the SSM/PMAD

NASA Technical Reports Server (NTRS)

Morris, Robert A.; Gonzalez, Avelino J.; Carreira, Daniel J.; Mckenzie, F. D.; Gann, Brian

1993-01-01

The overall goal of this research effort has been the development of a software system which automates tasks related to monitoring and controlling electrical power distribution in spacecraft electrical power systems. The resulting software system is called the Intelligent Power Controller (IPC). The specific tasks performed by the IPC include continuous monitoring of the flow of power from a source to a set of loads, fast detection of anomalous behavior indicating a fault to one of the components of the distribution systems, generation of diagnosis (explanation) of anomalous behavior, isolation of faulty object from remainder of system, and maintenance of flow of power to critical loads and systems (e.g. life-support) despite fault conditions being present (recovery). The IPC system has evolved out of KATE (Knowledge-based Autonomous Test Engineer), developed at NASA-KSC. KATE consists of a set of software tools for developing and applying structure and behavior models to monitoring, diagnostic, and control applications.
Evaluation of an expert system for fault detection, isolation, and recovery in the manned maneuvering unit

NASA Technical Reports Server (NTRS)

Rushby, John; Crow, Judith

1990-01-01

The authors explore issues in the specification, verification, and validation of artificial intelligence (AI) based software, using a prototype fault detection, isolation and recovery (FDIR) system for the Manned Maneuvering Unit (MMU). They use this system as a vehicle for exploring issues in the semantics of C-Language Integrated Production System (CLIPS)-style rule-based languages, the verification of properties relating to safety and reliability, and the static and dynamic analysis of knowledge based systems. This analysis reveals errors and shortcomings in the MMU FDIR system and raises a number of issues concerning software engineering in CLIPs. The authors came to realize that the MMU FDIR system does not conform to conventional definitions of AI software, despite the fact that it was intended and indeed presented as an AI system. The authors discuss this apparent disparity and related questions such as the role of AI techniques in space and aircraft operations and the suitability of CLIPS for critical applications.
MAX - An advanced parallel computer for space applications

NASA Technical Reports Server (NTRS)

Lewis, Blair F.; Bunker, Robert L.

1991-01-01

MAX is a fault-tolerant multicomputer hardware and software architecture designed to meet the needs of NASA spacecraft systems. It consists of conventional computing modules (computers) connected via a dual network topology. One network is used to transfer data among the computers and between computers and I/O devices. This network's topology is arbitrary. The second network operates as a broadcast medium for operating system synchronization messages and supports the operating system's Byzantine resilience. A fully distributed operating system supports multitasking in an asynchronous event and data driven environment. A large grain dataflow paradigm is used to coordinate the multitasking and provide easy control of concurrency. It is the basis of the system's fault tolerance and allows both static and dynamical location of tasks. Redundant execution of tasks with software voting of results may be specified for critical tasks. The dataflow paradigm also supports simplified software design, test and maintenance. A unique feature is a method for reliably patching code in an executing dataflow application.
Advanced diagnostic system for piston slap faults in IC engines, based on the non-stationary characteristics of the vibration signals

NASA Astrophysics Data System (ADS)

Chen, Jian; Randall, Robert Bond; Peeters, Bart

2016-06-01

Artificial Neural Networks (ANNs) have the potential to solve the problem of automated diagnostics of piston slap faults, but the critical issue for the successful application of ANN is the training of the network by a large amount of data in various engine conditions (different speed/load conditions in normal condition, and with different locations/levels of faults). On the other hand, the latest simulation technology provides a useful alternative in that the effect of clearance changes may readily be explored without recourse to cutting metal, in order to create enough training data for the ANNs. In this paper, based on some existing simplified models of piston slap, an advanced multi-body dynamic simulation software was used to simulate piston slap faults with different speeds/loads and clearance conditions. Meanwhile, the simulation models were validated and updated by a series of experiments. Three-stage network systems are proposed to diagnose piston faults: fault detection, fault localisation and fault severity identification. Multi Layer Perceptron (MLP) networks were used in the detection stage and severity/prognosis stage and a Probabilistic Neural Network (PNN) was used to identify which cylinder has faults. Finally, it was demonstrated that the networks trained purely on simulated data can efficiently detect piston slap faults in real tests and identify the location and severity of the faults as well.
Identification of active fault using analysis of derivatives with vertical second based on gravity anomaly data (Case study: Seulimeum fault in Sumatera fault system)

NASA Astrophysics Data System (ADS)

Hududillah, Teuku Hafid; Simanjuntak, Andrean V. H.; Husni, Muhammad

2017-07-01

Gravity is a non-destructive geophysical technique that has numerous application in engineering and environmental field like locating a fault zone. The purpose of this study is to spot the Seulimeum fault system in Iejue, Aceh Besar (Indonesia) by using a gravity technique and correlate the result with geologic map and conjointly to grasp a trend pattern of fault system. An estimation of subsurface geological structure of Seulimeum fault has been done by using gravity field anomaly data. Gravity anomaly data which used in this study is from Topex that is processed up to Free Air Correction. The step in the Next data processing is applying Bouger correction and Terrin Correction to obtain complete Bouger anomaly that is topographically dependent. Subsurface modeling is done using the Gav2DC for windows software. The result showed a low residual gravity value at a north half compared to south a part of study space that indicated a pattern of fault zone. Gravity residual was successfully correlate with the geologic map that show the existence of the Seulimeum fault in this study space. The study of earthquake records can be used for differentiating the active and non active fault elements, this gives an indication that the delineated fault elements are active.
Fault-tolerant communication channel structures

NASA Technical Reports Server (NTRS)

Tai, Ann T. (Inventor); Alkalai, Leon (Inventor); Chau, Savio N. (Inventor)

2006-01-01

Systems and techniques for implementing fault-tolerant communication channels and features in communication systems. Selected commercial-off-the-shelf devices can be integrated in such systems to reduce the cost.
Effect of off-fault low-velocity elastic inclusions on supershear rupture dynamics

NASA Astrophysics Data System (ADS)

Ma, Xiao; Elbanna, A. E.

2015-10-01

Heterogeneous velocity structures are expected to affect fault rupture dynamics. To quantitatively evaluate some of these effects, we examine a model of dynamic rupture on a frictional fault embedded in an elastic full space, governed by plane strain elasticity, with a pair of off-fault inclusions that have a lower rigidity than the background medium. We solve the elastodynamic problem using the Finite Element software Pylith. The fault operates under linear slip-weakening friction law. We initiate the rupture by artificially overstressing a localized region near the left edge of the fault. We primarily consider embedded soft inclusions with 20 per cent reduction in both the pressure wave and shear wave speeds. The embedded inclusions are placed at different distances from the fault surface and have different sizes. We show that the existence of a soft inclusion may significantly shorten the transition length to supershear propagation through the Burridge-Andrews mechanism. We also observe that supershear rupture is generated at pre-stress values that are lower than what is theoretically predicted for a homogeneous medium. We discuss the implications of our results for dynamic rupture propagation in complex velocity structures as well as supershear propagation on understressed faults.
The correlation of 2D-resistivity and magnetic methods in fault verification at northern Sumatra, Indonesia

NASA Astrophysics Data System (ADS)

Kamaruddin, Nur Aminuda; Saad, Rosli; Nordiana, M. M.; Azwin, I. N.

2015-04-01

The Great Sumatra Fault system was split into two sub-parallel lines or segments at the Northern Sumatra. This event is one of the impacts of powerful earthquakes that hit Sumatra Island especially one that occurred in 2004. These two sub-parallel segments known as Aceh and Seulimeum fault. The study is focused on the Seulimeum fault and two geophysical methods chosen aimed to compare and verified the result obtained respectively. 2-D resistivity method is a common geophysical method used in determination of near surface structures such as faults, cavities, voids and sinkholes. Meanwhile, the magnetic method often chosen to delineate subsurface structures, determine depth of magnetic source bodies and possibly sediment thickness. Three survey lines of resistivity method and randomly magnetic stations were carried out covering Krueng district. The resistivity data processed using Res2Dinv and result presented using Surfer software. The fault identified by the contrast of low and high resistivity value. Meanwhile, the magnetic data were presented in magnetic residual contour map and the extended fault system is suspected represent by the contrast value of the magnetic anomalies. Within suspected fault zone, the results of resistivity are tally with magnetic result.
Using Combined SFTA and SFMECA Techniques for Space Critical Software

NASA Astrophysics Data System (ADS)

Nicodemos, F. G.; Lahoz, C. H. N.; Abdala, M. A. D.; Saotome, O.

2012-01-01

This work addresses the combined Software Fault Tree Analysis (SFTA) and Software Failure Modes, Effects and Criticality Analysis (SFMECA) techniques applied to space critical software of satellite launch vehicles. The combined approach is under research as part of the Verification and Validation (V&V) efforts to increase software dependability and as future application in other projects under development at Instituto de Aeronáutica e Espaço (IAE). The applicability of such approach was conducted on system software specification and applied to a case study based on the Brazilian Satellite Launcher (VLS). The main goal is to identify possible failure causes and obtain compensating provisions that lead to inclusion of new functional and non-functional system software requirements.
Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks

NASA Technical Reports Server (NTRS)

Lee, Y. H.; Shin, K. G.

1982-01-01

A fault-tolerant multiprocessor with a rollback recovery mechanism is discussed. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery block is constructed by consecutive state-save operations and several state-save units in every processor and memory module. When a fault is detected, the multiprocessor reconfigures itself to replace the faulty component and then the process originally assigned to the faulty component retreats to one of the previously saved states in order to resume fault-free execution. A mathematical model is proposed to calculate both the coverage of multi-step rollback recovery and the risk of restart. A performance evaluation in terms of task execution time is also presented.
Havens: Explicit Reliable Memory Regions for HPC Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hukerikar, Saurabh; Engelmann, Christian

2016-01-01

Supporting error resilience in future exascale-class supercomputing systems is a critical challenge. Due to transistor scaling trends and increasing memory density, scientific simulations are expected to experience more interruptions caused by transient errors in the system memory. Existing hardware-based detection and recovery techniques will be inadequate to manage the presence of high memory fault rates. In this paper we propose a partial memory protection scheme based on region-based memory management. We define the concept of regions called havens that provide fault protection for program objects. We provide reliability for the regions through a software-based parity protection mechanism. Our approach enablesmore » critical program objects to be placed in these havens. The fault coverage provided by our approach is application agnostic, unlike algorithm-based fault tolerance techniques.« less
Portable Automated Test Station: Using Engineering-Design Partnerships to Replace Obsolete Test Systems

DTIC Science & Technology

2015-04-01

troubleshooting avionics system faults while the aircraft is on the ground. The core component of the PATS-30, the ruggedized laptop, is no longer sustainable...as well as trouble shooting avionics system faults while the aircraft is on the ground. The PATS-70 utilizes up-to-date, sustainable technology for...Operational Flight Program (OFP) software loading and diagnostic avionics system testing and includes additional TPSs to enhance its capability
MAGMA: A Liquid Software Approach to Fault Tolerance, Computer Network Security, and Survivable Networking

DTIC Science & Technology

2001-12-01

and Lieutenant Namik Kaplan , Turkish Navy. Maj Tiefert’s thesis, “Modeling Control Channel Dynamics of SAAM using NS Network Simulation”, helped lay...DEC99] Deconinck , Dr. ir. Geert, Fault Tolerant Systems, ESAT / Division ACCA , Katholieke Universiteit Leuven, October 1999. [FRE00] Freed...Systems”, Addison-Wesley, 1989. [KAP99] Kaplan , Namik, “Prototyping of an Active and Lightweight Router,” March 1999 [KAT99] Kati, Effraim

Some links on this page may take you to non-federal websites. Their policies may differ from this site.