multiprocessor computing system: Topics by Science.gov

Sample records for multiprocessor computing system

Programmable partitioning for high-performance coherence domains in a multiprocessor system

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Salapura, Valentina [Chappaqua, NY

2011-01-25

A multiprocessor computing system and a method of logically partitioning a multiprocessor computing system are disclosed. The multiprocessor computing system comprises a multitude of processing units, and a multitude of snoop units. Each of the processing units includes a local cache, and the snoop units are provided for supporting cache coherency in the multiprocessor system. Each of the snoop units is connected to a respective one of the processing units and to all of the other snoop units. The multiprocessor computing system further includes a partitioning system for using the snoop units to partition the multitude of processing units into a plurality of independent, memory-consistent, adjustable-size processing groups. Preferably, when the processor units are partitioned into these processing groups, the partitioning system also configures the snoop units to maintain cache coherency within each of said groups.
Insertion of coherence requests for debugging a multiprocessor

DOEpatents

Blumrich, Matthias A.; Salapura, Valentina

2010-02-23

A method and system are disclosed to insert coherence events in a multiprocessor computer system, and to present those coherence events to the processors of the multiprocessor computer system for analysis and debugging purposes. The coherence events are inserted in the computer system by adding one or more special insert registers. By writing into the insert registers, coherence events are inserted in the multiprocessor system as if they were generated by the normal coherence protocol. Once these coherence events are processed, the processing of coherence events can continue in the normal operation mode.
File-System Workload on a Scientific Multiprocessor

NASA Technical Reports Server (NTRS)

Kotz, David; Nieuwejaar, Nils

1995-01-01

Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.
Two Fundamental Issues in Multiprocessing.

DTIC Science & Technology

1987-10-01

Structural Model of a Multiprocessor 6 Figure 5: Operational Model of a Multiprocessor 7 Figure 6: The von Neumann Processor (from Gajski and Peir [201) 10...Computer Society, June, 1983. 20. Gajski , D. D. & J-K. Peir. "Essential Issues in Multiprocessor Systems". Computer 18, 6 (June 1985), 9-27. 21. Gurd
Modelling parallel programs and multiprocessor architectures with AXE

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Fineman, Charles E.

1991-01-01

AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.
A large-grain mapping approach for multiprocessor systems through data flow model. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Kim, Hwa-Soo

1991-01-01

A large-grain level mapping method is presented of numerical oriented applications onto multiprocessor systems. The method is based on the large-grain data flow representation of the input application and it assumes a general interconnection topology of the multiprocessor system. The large-grain data flow model was used because such representation best exhibits inherited parallelism in many important applications, e.g., CFD models based on partial differential equations can be presented in large-grain data flow format, very effectively. A generalized interconnection topology of the multiprocessor architecture is considered, including such architectural issues as interprocessor communication cost, with the aim to identify the 'best matching' between the application and the multiprocessor structure. The objective is to minimize the total execution time of the input algorithm running on the target system. The mapping strategy consists of the following: (1) large-grain data flow graph generation from the input application using compilation techniques; (2) data flow graph partitioning into basic computation blocks; and (3) physical mapping onto the target multiprocessor using a priority allocation scheme for the computation blocks.
Shared versus distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1991-01-01

The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
High Performance, Dependable Multiprocessor

NASA Technical Reports Server (NTRS)

Ramos, Jeremy; Samson, John R.; Troxel, Ian; Subramaniyan, Rajagopal; Jacobs, Adam; Greco, James; Cieslewski, Grzegorz; Curreri, John; Fischer, Michael; Grobelny, Eric;

2006-01-01

With the ever increasing demand for higher bandwidth and processing capacity of today's space exploration, space science, and defense missions, the ability to efficiently apply commercial-off-the-shelf (COTS) processors for on-board computing is now a critical need. In response to this need, NASA's New Millennium Program office has commissioned the development of Dependable Multiprocessor (DM) technology for use in payload and robotic missions. The Dependable Multiprocessor technology is a COTS-based, power efficient, high performance, highly dependable, fault tolerant cluster computer. To date, Honeywell has successfully demonstrated a TRL4 prototype of the Dependable Multiprocessor [I], and is now working on the development of a TRLS prototype. For the present effort Honeywell has teamed up with the University of Florida's High-performance Computing and Simulation (HCS) Lab, and together the team has demonstrated major elements of the Dependable Multiprocessor TRLS system.

Queueing analysis of a canonical model of real-time multiprocessors

NASA Technical Reports Server (NTRS)

Krishna, C. M.; Shin, K. G.

1983-01-01

A logical classification of multiprocessor structures from the point of view of control applications is presented. A computation of the response time distribution for a canonical model of a real time multiprocessor is presented. The multiprocessor is approximated by a blocking model. Two separate models are derived: one created from the system's point of view, and the other from the point of view of an incoming task.
Dataflow computing approach in high-speed digital simulation

NASA Technical Reports Server (NTRS)

Ercegovac, M. D.; Karplus, W. J.

1984-01-01

New computational tools and methodologies for the digital simulation of continuous systems were explored. Programmability, and cost effective performance in multiprocessor organizations for real time simulation was investigated. Approach is based on functional style languages and data flow computing principles, which allow for the natural representation of parallelism in algorithms and provides a suitable basis for the design of cost effective high performance distributed systems. The objectives of this research are to: (1) perform comparative evaluation of several existing data flow languages and develop an experimental data flow language suitable for real time simulation using multiprocessor systems; (2) investigate the main issues that arise in the architecture and organization of data flow multiprocessors for real time simulation; and (3) develop and apply performance evaluation models in typical applications.
Operating system for a real-time multiprocessor propulsion system simulator. User's manual

NASA Technical Reports Server (NTRS)

Cole, G. L.

1985-01-01

The NASA Lewis Research Center is developing and evaluating experimental hardware and software systems to help meet future needs for real-time, high-fidelity simulations of air-breathing propulsion systems. Specifically, the real-time multiprocessor simulator project focuses on the use of multiple microprocessors to achieve the required computing speed and accuracy at relatively low cost. Operating systems for such hardware configurations are generally not available. A real time multiprocessor operating system (RTMPOS) that supports a variety of multiprocessor configurations was developed at Lewis. With some modification, RTMPOS can also support various microprocessors. RTMPOS, by means of menus and prompts, provides the user with a versatile, user-friendly environment for interactively loading, running, and obtaining results from a multiprocessor-based simulator. The menu functions are described and an example simulation session is included to demonstrate the steps required to go from the simulation loading phase to the execution phase.
Multiprocessor architectural study

NASA Technical Reports Server (NTRS)

Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.

1972-01-01

An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
Backend Control Processor for a Multi-Processor Relational Database Computer System.

DTIC Science & Technology

1984-12-01

SCHOOL OF ENGI. UNCRSIFID MPONTIFF DEC 84 AFXT/GCS/ENG/84D-22 F/O 9/2 L ommhhhhmhhml mhhhommhhhhhm i-2 8 -- U0. 11111= Q. 2 111.8IIII- 1111111..6...THESIS Presented to the Faculty of the School of Engineering of the Air Force Institute of Technology Air University In Partial Fulfillment of the...development of a Backend Multi-Processor Relational Database Computer System. This thesis addresses a single component of this system, the Backend Control
Multiprocessor programming environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Smith, M.B.; Fornaro, R.

Programming tools and techniques have been well developed for traditional uniprocessor computer systems. The focus of this research project is on the development of a programming environment for a high speed real time heterogeneous multiprocessor system, with special emphasis on languages and compilers. The new tools and techniques will allow a smooth transition for programmers with experience only on single processor systems.
Static Scheduler for Hard Real-Time Tasks on Multiprocessor Systems

DTIC Science & Technology

1992-09-01

Foundation of Computer Science, 1980 . [SIM83] Simons, B., "Multiprocessor Scheduling of Unit-Time Jobs with Arbitrary Release Times and Deadlines", SIAM...Research Office Attn: Dr. David Hislop P. O. Box 12211 Research Triangle Park, NC 27709-2211 31. Persistent Data Systems 75 W. Chapel Ridge Road Attn: Dr
The force on the flex: Global parallelism and portability

NASA Technical Reports Server (NTRS)

Jordan, H. F.

1986-01-01

A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft

NASA Technical Reports Server (NTRS)

Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.

1978-01-01

The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.
A multiarchitecture parallel-processing development environment

NASA Technical Reports Server (NTRS)

Townsend, Scott; Blech, Richard; Cole, Gary

1993-01-01

A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.
Dynamic file-access characteristics of a production parallel scientific workload

NASA Technical Reports Server (NTRS)

Kotz, David; Nieuwejaar, Nils

1994-01-01

Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors. Most successful systems are based on a solid understanding of the expected workload, but thus far there have been no comprehensive workload characterizations of multiprocessor file systems. This paper presents the results of a three week tracing study in which all file-related activity on a massively parallel computer was recorded. Our instrumentation differs from previous efforts in that it collects information about every I/O request and about the mix of jobs running in a production environment. We also present the results of a trace-driven caching simulation and recommendations for designers of multiprocessor file systems.
HyperForest: A high performance multi-processor architecture for real-time intelligent systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.

1997-04-01

Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doorsmore » for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.« less

C-MOS array design techniques: SUMC multiprocessor system study

NASA Technical Reports Server (NTRS)

Clapp, W. A.; Helbig, W. A.; Merriam, A. S.

1972-01-01

The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.
Multiprocessor computer overset grid method and apparatus

DOEpatents

Barnette, Daniel W.; Ober, Curtis C.

2003-01-01

A multiprocessor computer overset grid method and apparatus comprises associating points in each overset grid with processors and using mapped interpolation transformations to communicate intermediate values between processors assigned base and target points of the interpolation transformations. The method allows a multiprocessor computer to operate with effective load balance on overset grid applications.
A multiprocessor computer simulation model employing a feedback scheduler/allocator for memory space and bandwidth matching and TMR processing

NASA Technical Reports Server (NTRS)

Bradley, D. B.; Irwin, J. D.

1974-01-01

A computer simulation model for a multiprocessor computer is developed that is useful for studying the problem of matching multiprocessor's memory space, memory bandwidth and numbers and speeds of processors with aggregate job set characteristics. The model assumes an input work load of a set of recurrent jobs. The model includes a feedback scheduler/allocator which attempts to improve system performance through higher memory bandwidth utilization by matching individual job requirements for space and bandwidth with space availability and estimates of bandwidth availability at the times of memory allocation. The simulation model includes provisions for specifying precedence relations among the jobs in a job set, and provisions for specifying precedence execution of TMR (Triple Modular Redundant and SIMPLEX (non redundant) jobs.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

NASA Technical Reports Server (NTRS)

Smith, T. B., Jr.; Lala, J. H.

1983-01-01

The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Bibliography On Multiprocessors And Distributed Processing

NASA Technical Reports Server (NTRS)

Miya, Eugene N.

1988-01-01

Multiprocessor and Distributed Processing Bibliography package consists of large machine-readable bibliographic data base, which in addition to usual keyword searches, used for producing citations, indexes, and cross-references. Data base contains UNIX(R) "refer" -formatted ASCII data and implemented on any computer running under UNIX(R) operating system. Easily convertible to other operating systems. Requires approximately one megabyte of secondary storage. Bibliography compiled in 1985.
A language comparison for scientific computing on MIMD architectures

NASA Technical Reports Server (NTRS)

Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.

1989-01-01

Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.
Conference on Real-Time Computer Applications in Nuclear, Particle and Plasma Physics, 6th, Williamsburg, VA, May 15-19, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Pordes, Ruth (Editor)

1989-01-01

Papers on real-time computer applications in nuclear, particle, and plasma physics are presented, covering topics such as expert systems tactics in testing FASTBUS segment interconnect modules, trigger control in a high energy physcis experiment, the FASTBUS read-out system for the Aleph time projection chamber, a multiprocessor data acquisition systems, DAQ software architecture for Aleph, a VME multiprocessor system for plasma control at the JT-60 upgrade, and a multiasking, multisinked, multiprocessor data acquisition front end. Other topics include real-time data reduction using a microVAX processor, a transputer based coprocessor for VEDAS, simulation of a macropipelined multi-CPU event processor for use in FASTBUS, a distributed VME control system for the LISA superconducting Linac, a distributed system for laboratory process automation, and a distributed system for laboratory process automation. Additional topics include a structure macro assembler for the event handler, a data acquisition and control system for Thomson scattering on ATF, remote procedure execution software for distributed systems, and a PC-based graphic display real-time particle beam uniformity.
Design of a modular digital computer system, DRL 4. [for meeting future requirements of spaceborne computers

NASA Technical Reports Server (NTRS)

1972-01-01

The design is reported of an advanced modular computer system designated the Automatically Reconfigurable Modular Multiprocessor System, which anticipates requirements for higher computing capacity and reliability for future spaceborne computers. Subjects discussed include: an overview of the architecture, mission analysis, synchronous and nonsynchronous scheduling control, reliability, and data transmission.
Real-Time Multiprocessor Programming Language (RTMPL) user's manual

NASA Technical Reports Server (NTRS)

Arpasi, D. J.

1985-01-01

A real-time multiprocessor programming language (RTMPL) has been developed to provide for high-order programming of real-time simulations on systems of distributed computers. RTMPL is a structured, engineering-oriented language. The RTMPL utility supports a variety of multiprocessor configurations and types by generating assembly language programs according to user-specified targeting information. Many programming functions are assumed by the utility (e.g., data transfer and scaling) to reduce the programming chore. This manual describes RTMPL from a user's viewpoint. Source generation, applications, utility operation, and utility output are detailed. An example simulation is generated to illustrate many RTMPL features.
A simple modern correctness condition for a space-based high-performance multiprocessor

NASA Technical Reports Server (NTRS)

Probst, David K.; Li, Hon F.

1992-01-01

A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
Event parallelism: Distributed memory parallel computing for high energy physics experiments

NASA Astrophysics Data System (ADS)

Nash, Thomas

1989-12-01

This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
Parallel implementation and evaluation of motion estimation system algorithms on a distributed memory multiprocessor using knowledge based mappings

NASA Technical Reports Server (NTRS)

Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.

1989-01-01

Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.
Systems analysis of the space shuttle. [communication systems, computer systems, and power distribution

NASA Technical Reports Server (NTRS)

Schilling, D. L.; Oh, S. J.; Thau, F.

1975-01-01

Developments in communications systems, computer systems, and power distribution systems for the space shuttle are described. The use of high speed delta modulation for bit rate compression in the transmission of television signals is discussed. Simultaneous Multiprocessor Organization, an approach to computer organization, is presented. Methods of computer simulation and automatic malfunction detection for the shuttle power distribution system are also described.
Solution of large nonlinear quasistatic structural mechanics problems on distributed-memory multiprocessor computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blanford, M.

1997-12-31

Most commercially-available quasistatic finite element programs assemble element stiffnesses into a global stiffness matrix, then use a direct linear equation solver to obtain nodal displacements. However, for large problems (greater than a few hundred thousand degrees of freedom), the memory size and computation time required for this approach becomes prohibitive. Moreover, direct solution does not lend itself to the parallel processing needed for today`s multiprocessor systems. This talk gives an overview of the iterative solution strategy of JAS3D, the nonlinear large-deformation quasistatic finite element program. Because its architecture is derived from an explicit transient-dynamics code, it does not ever assemblemore » a global stiffness matrix. The author describes the approach he used to implement the solver on multiprocessor computers, and shows examples of problems run on hundreds of processors and more than a million degrees of freedom. Finally, he describes some of the work he is presently doing to address the challenges of iterative convergence for ill-conditioned problems.« less
Modeling and measurement of fault-tolerant multiprocessors

NASA Technical Reports Server (NTRS)

Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

1985-01-01

The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.
Distributed parallel messaging for multiprocessor systems

DOEpatents

Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka

2013-06-04

A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Job-mix modeling and system analysis of an aerospace multiprocessor.

NASA Technical Reports Server (NTRS)

Mallach, E. G.

1972-01-01

An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.
Instrumentation, performance visualization, and debugging tools for multiprocessors

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Fineman, Charles E.; Hontalas, Philip J.

1991-01-01

The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs.
Scalable Multiprocessor for High-Speed Computing in Space

NASA Technical Reports Server (NTRS)

Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard

2004-01-01

A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Operating system for a real-time multiprocessor propulsion system simulator

NASA Technical Reports Server (NTRS)

Cole, G. L.

1984-01-01

The success of the Real Time Multiprocessor Operating System (RTMPOS) in the development and evaluation of experimental hardware and software systems for real time interactive simulation of air breathing propulsion systems was evaluated. The Real Time Multiprocessor Operating System (RTMPOS) provides the user with a versatile, interactive means for loading, running, debugging and obtaining results from a multiprocessor based simulator. A front end processor (FEP) serves as the simulator controller and interface between the user and the simulator. These functions are facilitated by the RTMPOS which resides on the FEP. The RTMPOS acts in conjunction with the FEP's manufacturer supplied disk operating system that provides typical utilities like an assembler, linkage editor, text editor, file handling services, etc. Once a simulation is formulated, the RTMPOS provides for engineering level, run time operations such as loading, modifying and specifying computation flow of programs, simulator mode control, data handling and run time monitoring. Run time monitoring is a powerful feature of RTMPOS that allows the user to record all actions taken during a simulation session and to receive advisories from the simulator via the FEP. The RTMPOS is programmed mainly in PASCAL along with some assembly language routines. The RTMPOS software is easily modified to be applicable to hardware from different manufacturers.

Experience with a UNIX based batch computing facility for H1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gerhards, R.; Kruener-Marquis, U.; Szkutnik, Z.

1994-12-31

A UNIX based batch computing facility for the H1 experiment at DESY is described. The ultimate goal is to replace the DESY IBM mainframe by a multiprocessor SGI Challenge series computer, using the UNIX operating system, for most of the computing tasks in H1.
Operating experience with a VMEbus multiprocessor system for data acquisition and reduction in nuclear physics

NASA Astrophysics Data System (ADS)

Kutt, P. H.; Balamuth, D. P.

1989-10-01

Summary form only given, as follows. A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPUs and 14 Mbyte of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over three times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing of gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC (reduced instruction set computer) processors to obtain an order of magnitude increase in computing power per node.
System architecture for asynchronous multi-processor robotic control system

NASA Technical Reports Server (NTRS)

Steele, Robert D.; Long, Mark; Backes, Paul

1993-01-01

The architecture for the Modular Telerobot Task Execution System (MOTES) as implemented in the Supervisory Telerobotics (STELER) Laboratory is described. MOTES is the software component of the remote site of a local-remote telerobotic system which is being developed for NASA for space applications, in particular Space Station Freedom applications. The system is being developed to provide control and supervised autonomous control to support both space based operation and ground-remote control with time delay. The local-remote architecture places task planning responsibilities at the local site and task execution responsibilities at the remote site. This separation allows the remote site to be designed to optimize task execution capability within a limited computational environment such as is expected in flight systems. The local site task planning system could be placed on the ground where few computational limitations are expected. MOTES is written in the Ada programming language for a multiprocessor environment.
Engineering study for the functional design of a multiprocessor system

NASA Technical Reports Server (NTRS)

Miller, J. S.; Vandever, W. H.; Stanten, S. F.; Avakian, A. E.; Kosmala, A. L.

1972-01-01

The results are presented of a study to generate a functional system design of a multiprocessing computer system capable of satisfying the computational requirements of a space station. These data management system requirements were specified to include: (1) real time control, (2) data processing and storage, (3) data retrieval, and (4) remote terminal servicing.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1988-01-01

The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
Memory management and compiler support for rapid recovery from failures in computer systems

NASA Technical Reports Server (NTRS)

Fuchs, W. K.

1991-01-01

This paper describes recent developments in the use of memory management and compiler technology to support rapid recovery from failures in computer systems. The techniques described include cache coherence protocols for user transparent checkpointing in multiprocessor systems, compiler-based checkpoint placement, compiler-based code modification for multiple instruction retry, and forward recovery in distributed systems utilizing optimistic execution.
Estimating Performance of Single Bus, Shared Memory Multiprocessors

DTIC Science & Technology

1987-05-01

Chandy78] K.M. Chandy, C.M. Sauer, "Approximate methods for analyzing queuing network models of computing systems," Computing Surveys, vol10 , no 3...Denning78] P. Denning, J. Buzen, "The operational analysis of queueing network models", Computing Sur- veys, vol10 , no 3, September 1978, pp 225-261
Partitioning and packing mathematical simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Arpasi, D. J.; Milner, E. J.

1986-01-01

The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 4: FTMP executive summary

NASA Technical Reports Server (NTRS)

Smith, T. B., III; Lala, J. H.

1984-01-01

The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
A Low-Cost and Energy-Efficient Multiprocessor System-on-Chip for UWB MAC Layer

NASA Astrophysics Data System (ADS)

Xiao, Hao; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki; Nakase, Yuko; Kimura, Sadahiro

Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
Exploring the use of I/O nodes for computation in a MIMD multiprocessor

NASA Technical Reports Server (NTRS)

Kotz, David; Cai, Ting

1995-01-01

As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
USSR Report: Cybernetics, Computers and Automation Technology. No. 69.

DTIC Science & Technology

1983-05-06

computers in multiprocessor and multistation design , control and scientific research automation systems. The results of comparing the efficiency of...Podvizhnaya, Scientific Research Institute of Control Computers, Severodonetsk] [Text] The most significant change in the design of the SM-2M compared to...UPRAVLYAYUSHCHIYE SISTEMY I MASHINY, Nov-Dec 82) 95 APPLICATIONS Kiev Automated Control System, Design Features and Prospects for Development (V. A
Generic Software for Emulating Multiprocessor Architectures.

DTIC Science & Technology

1985-05-01

RD-A157 662 GENERIC SOFTWARE FOR EMULATING MULTIPROCESSOR 1/2 AlRCHITECTURES(J) MASSACHUSETTS INST OF TECH CAMBRIDGE U LRS LAB FOR COMPUTER SCIENCE R...AREA & WORK UNIT NUMBERS MIT Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139 ____________ I I. CONTROLLING OFFICE NAME AND...aide If neceeasy end Identify by block number) Computer architecture, emulation, simulation, dataf low 20. ABSTRACT (Continue an reverse slde It
Validation of multiprocessor systems

NASA Technical Reports Server (NTRS)

Siewiorek, D. P.; Segall, Z.; Kong, T.

1982-01-01

Experiments that can be used to validate fault free performance of multiprocessor systems in aerospace systems integrating flight controls and avionics are discussed. Engineering prototypes for two fault tolerant multiprocessors are tested.
Optimum spaceborne computer system design by simulation

NASA Technical Reports Server (NTRS)

Williams, T.; Kerner, H.; Weatherbee, J. E.; Taylor, D. S.; Hodges, B.

1973-01-01

A deterministic simulator is described which models the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. Its use as a tool to study and determine the minimum computer system configuration necessary to satisfy the on-board computational requirements of a typical mission is presented. The paper describes how the computer system configuration is determined in order to satisfy the data processing demand of the various shuttle booster subsytems. The configuration which is developed as a result of studies with the simulator is optimal with respect to the efficient use of computer system resources.
Using parallel banded linear system solvers in generalized eigenvalue problems

NASA Technical Reports Server (NTRS)

Zhang, Hong; Moss, William F.

1993-01-01

Subspace iteration is a reliable and cost effective method for solving positive definite banded symmetric generalized eigenproblems, especially in the case of large scale problems. This paper discusses an algorithm that makes use of two parallel banded solvers in subspace iteration. A shift is introduced to decompose the banded linear systems into relatively independent subsystems and to accelerate the iterations. With this shift, an eigenproblem is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained for parallel implementations. An optimal shift is a shift that balances total computation and communication costs. Under certain conditions, we show how to estimate an optimal shift analytically using the decay rate for the inverse of a banded matrix, and how to improve this estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration

NASA Technical Reports Server (NTRS)

Obando, Rodrigo A.; Stoughton, John W.

1995-01-01

The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
Laboratory for Computer Science Progress Report 19, 1 July 1981-30 June 1982.

DTIC Science & Technology

1984-05-01

Multiprocessor Architectures 202 4. TRIX Operating System 209 5. VLSI Tools 212 ’SYSTEMATIC PROGRAM DEVELOPMENT, 221 1. Introduction 222 2. Specification...exploring distributed operating systems and the architecture of single-user powerful computers that are interconnected by communication networks. The...to now. In particular, we expect to experiment with languages, operating systems , and applications that establish the feasibility of distributed
Parallel-Processing Test Bed For Simulation Software

NASA Technical Reports Server (NTRS)

Blech, Richard; Cole, Gary; Townsend, Scott

1996-01-01

Second-generation Hypercluster computing system is multiprocessor test bed for research on parallel algorithms for simulation in fluid dynamics, electromagnetics, chemistry, and other fields with large computational requirements but relatively low input/output requirements. Built from standard, off-shelf hardware readily upgraded as improved technology becomes available. System used for experiments with such parallel-processing concepts as message-passing algorithms, debugging software tools, and computational steering. First-generation Hypercluster system described in "Hypercluster Parallel Processor" (LEW-15283).
Embedded Multiprocessor Technology for VHSIC Insertion

NASA Technical Reports Server (NTRS)

Hayes, Paul J.

1990-01-01

Viewgraphs on embedded multiprocessor technology for VHSIC insertion are presented. The objective was to develop multiprocessor system technology providing user-selectable fault tolerance, increased throughput, and ease of application representation for concurrent operation. The approach was to develop graph management mapping theory for proper performance, model multiprocessor performance, and demonstrate performance in selected hardware systems.

Analysis of a Multiprocessor Guidance Computer. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Maltach, E. G.

1969-01-01

The design of the next generation of spaceborne digital computers is described. It analyzes a possible multiprocessor computer configuration. For the analysis, a set of representative space computing tasks was abstracted from the Lunar Module Guidance Computer programs as executed during the lunar landing, from the Apollo program. This computer performs at this time about 24 concurrent functions, with iteration rates from 10 times per second to once every two seconds. These jobs were tabulated in a machine-independent form, and statistics of the overall job set were obtained. It was concluded, based on a comparison of simulation and Markov results, that the Markov process analysis is accurate in predicting overall trends and in configuration comparisons, but does not provide useful detailed information in specific situations. Using both types of analysis, it was determined that the job scheduling function is a critical one for efficiency of the multiprocessor. It is recommended that research into the area of automatic job scheduling be performed.
A robot arm simulation with a shared memory multiprocessor machine

NASA Technical Reports Server (NTRS)

Kim, Sung-Soo; Chuang, Li-Ping

1989-01-01

A parallel processing scheme for a single chain robot arm is presented for high speed computation on a shared memory multiprocessor. A recursive formulation that is derived from a virtual work form of the d'Alembert equations of motion is utilized for robot arm dynamics. A joint drive system that consists of a motor rotor and gears is included in the arm dynamics model, in order to take into account gyroscopic effects due to the spinning of the rotor. The fine grain parallelism of mechanical and control subsystem models is exploited, based on independent computation associated with bodies, joint drive systems, and controllers. Efficiency and effectiveness of the parallel scheme are demonstrated through simulations of a telerobotic manipulator arm. Two different mechanical subsystem models, i.e., with and without gyroscopic effects, are compared, to show the trade-off between efficiency and accuracy.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G; Salapura, Valentina

2014-12-02

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Partitioning strategy for efficient nonlinear finite element dynamic analysis on multiprocessor computers

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.; Peters, Jeanne M.

1989-01-01

A computational procedure is presented for the nonlinear dynamic analysis of unsymmetric structures on vector multiprocessor systems. The procedure is based on a novel hierarchical partitioning strategy in which the response of the unsymmetric and antisymmetric response vectors (modes), each obtained by using only a fraction of the degrees of freedom of the original finite element model. The three key elements of the procedure which result in high degree of concurrency throughout the solution process are: (1) mixed (or primitive variable) formulation with independent shape functions for the different fields; (2) operator splitting or restructuring of the discrete equations at each time step to delineate the symmetric and antisymmetric vectors constituting the response; and (3) two level iterative process for generating the response of the structure. An assessment is made of the effectiveness of the procedure on the CRAY X-MP/4 computers.
Parallel processing and expert systems

NASA Technical Reports Server (NTRS)

Lau, Sonie; Yan, Jerry C.

1991-01-01

Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.
Research in the Aloha system

NASA Technical Reports Server (NTRS)

Abramson, N.

1974-01-01

The Aloha system was studied and developed and extended to advanced forms of computer communications networks. Theoretical and simulation studies of Aloha type radio channels for use in packet switched communications networks were performed. Improved versions of the Aloha communications techniques and their extensions were tested experimentally. A packet radio repeater suitable for use with the Aloha system operational network was developed. General studies of the organization of multiprocessor systems centered on the development of the BCC 500 computer were concluded.
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor

DTIC Science & Technology

1991-06-01

Symposium on Compiler Construction, June 1986. [14] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In...Directory Methods. In Proceedings 17th Annual International Symposium on Computer Architecture, June 1990. [31] G . M. Papadopoulos and D.E. Culler...Monsoon: An Explicit Token-Store Ar- chitecture. In Proceedings 17th Annual International Symposium on Computer Architecture, June 1990. [32] G . F
Models for evaluating the performability of degradable computing systems

NASA Technical Reports Server (NTRS)

Wu, L. T.

1982-01-01

Recent advances in multiprocessor technology established the need for unified methods to evaluate computing systems performance and reliability. In response to this modeling need, a general modeling framework that permits the modeling, analysis and evaluation of degradable computing systems is considered. Within this framework, several user oriented performance variables are identified and shown to be proper generalizations of the traditional notions of system performance and reliability. Furthermore, a time varying version of the model is developed to generalize the traditional fault tree reliability evaluation methods of phased missions.
Process for predicting structural performance of mechanical systems

DOEpatents

Gardner, David R.; Hendrickson, Bruce A.; Plimpton, Steven J.; Attaway, Stephen W.; Heinstein, Martin W.; Vaughan, Courtenay T.

1998-01-01

A process for predicting the structural performance of a mechanical system represents the mechanical system by a plurality of surface elements. The surface elements are grouped according to their location in the volume occupied by the mechanical system so that contacts between surface elements can be efficiently located. The process is well suited for efficient practice on multiprocessor computers.
Method and apparatus for single-stepping coherence events in a multiprocessor system under software control

DOEpatents

Blumrich, Matthias A.; Salapura, Valentina

2010-11-02

An apparatus and method are disclosed for single-stepping coherence events in a multiprocessor system under software control in order to monitor the behavior of a memory coherence mechanism. Single-stepping coherence events in a multiprocessor system is made possible by adding one or more step registers. By accessing these step registers, one or more coherence requests are processed by the multiprocessor system. The step registers determine if the snoop unit will operate by proceeding in a normal execution mode, or operate in a single-step mode.
Parallel processing for scientific computations

NASA Technical Reports Server (NTRS)

Alkhatib, Hasan S.

1995-01-01

The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
Resource Management for Distributed Parallel Systems

NASA Technical Reports Server (NTRS)

Neuman, B. Clifford; Rao, Santosh

1993-01-01

Multiprocessor systems should exist in the the larger context of distributed systems, allowing multiprocessor resources to be shared by those that need them. Unfortunately, typical multiprocessor resource management techniques do not scale to large networks. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and multiprocessor systems. To manage resources in such distributed parallel systems, PRM employs three types of managers: system managers, job managers, and node managers. There exist multiple independent instances of each type of manager, reducing bottlenecks. The complexity of each manager is further reduced because each is designed to utilize information at an appropriate level of abstraction.
Optimum spaceborne computer system design by simulation

NASA Technical Reports Server (NTRS)

Williams, T.; Weatherbee, J. E.; Taylor, D. S.

1972-01-01

A deterministic digital simulation model is described which models the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. Use of the model as a tool in configuring a minimum computer system for a typical mission is demonstrated. The configuration which is developed as a result of studies with the simulator is optimal with respect to the efficient use of computer system resources, i.e., the configuration derived is a minimal one. Other considerations such as increased reliability through the use of standby spares would be taken into account in the definition of a practical system for a given mission.
Research in Structures and Dynamics, 1984

NASA Technical Reports Server (NTRS)

Hayduk, R. J. (Compiler); Noor, A. K. (Compiler)

1984-01-01

A symposium on advanced and trends in structures and dynamics was held to communicate new insights into physical behavior and to identify trends in the solution procedures for structures and dynamics problems. Pertinent areas of concern were (1) multiprocessors, parallel computation, and database management systems, (2) advances in finite element technology, (3) interactive computing and optimization, (4) mechanics of materials, (5) structural stability, (6) dynamic response of structures, and (7) advanced computer applications.
Laboratory for Computer Science Progress Report 21, July 1983-June 1984.

DTIC Science & Technology

1984-06-01

Systems 269 4. Distributed Consensus 270 5. Election of a Leader in a Distributed Ring of Processors 273 6. Distributed Network Algorithms 274 7. Diagnosis...multiprocessor systems. This facility, funded by the new!y formed Strategic Computing Program of the Defense Advanced Research Projects Agency, will enable...Academic Staff P. Szo)ovits, Group Leader R. Patil Collaborating Investigators M. Criscitiello, M.D., Tufts-New England Medical Center Hospital R
Polymorphous Computing Architectures

DTIC Science & Technology

2007-12-12

provide a multiprocessor implementation. In this work, we introduce the Atomos transactional programming language, which is the first to include...implicit transactions, strong atomicity, and a scalable multiprocessor implementation [47]. Atomos is derived from Java, but replaces its synchronization...and conditional waiting constructs with transactional alternatives. The Atomos conditional waiting proposal is tailored to allow efficient
Process for predicting structural performance of mechanical systems

DOEpatents

Gardner, D.R.; Hendrickson, B.A.; Plimpton, S.J.; Attaway, S.W.; Heinstein, M.W.; Vaughan, C.T.

1998-05-19

A process for predicting the structural performance of a mechanical system represents the mechanical system by a plurality of surface elements. The surface elements are grouped according to their location in the volume occupied by the mechanical system so that contacts between surface elements can be efficiently located. The process is well suited for efficient practice on multiprocessor computers. 12 figs.
A design fix to supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks

NASA Astrophysics Data System (ADS)

Devaraj, Rajesh; Sarkar, Arnab; Biswas, Santosh

2015-11-01

In the article 'Supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks', Park and Cho presented a systematic way of computing a largest fault-tolerant and schedulable language that provides information on whether the scheduler (i.e., supervisor) should accept or reject a newly arrived aperiodic task. The computation of such a language is mainly dependent on the task execution model presented in their paper. However, the task execution model is unable to capture the situation when the fault of a processor occurs even before the task has arrived. Consequently, a task execution model that does not capture this fact may possibly be assigned for execution on a faulty processor. This problem has been illustrated with an appropriate example. Then, the task execution model of Park and Cho has been modified to strengthen the requirement that none of the tasks are assigned for execution on a faulty processor.
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

1998-01-01

This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.

The Automated Instrumentation and Monitoring System (AIMS) reference manual

NASA Technical Reports Server (NTRS)

Yan, Jerry; Hontalas, Philip; Listgarten, Sherry

1993-01-01

Whether a researcher is designing the 'next parallel programming paradigm,' another 'scalable multiprocessor' or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. Careful analysis of execution traces can help computer designers and software architects to uncover system behavior and to take advantage of specific application characteristics and hardware features. A software tool kit that facilitates performance evaluation of parallel applications on multiprocessors is described. The Automated Instrumentation and Monitoring System (AIMS) has four major software components: a source code instrumentor which automatically inserts active event recorders into the program's source code before compilation; a run time performance-monitoring library, which collects performance data; a trace file animation and analysis tool kit which reconstructs program execution from the trace file; and a trace post-processor which compensate for data collection overhead. Besides being used as prototype for developing new techniques for instrumenting, monitoring, and visualizing parallel program execution, AIMS is also being incorporated into the run-time environments of various hardware test beds to evaluate their impact on user productivity. Currently, AIMS instrumentors accept FORTRAN and C parallel programs written for Intel's NX operating system on the iPSC family of multi computers. A run-time performance-monitoring library for the iPSC/860 is included in this release. We plan to release monitors for other platforms (such as PVM and TMC's CM-5) in the near future. Performance data collected can be graphically displayed on workstations (e.g. Sun Sparc and SGI) supporting X-Windows (in particular, Xl IR5, Motif 1.1.3).
Partitioning in parallel processing of production systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Oflazer, K.

1987-01-01

This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
VME rollback hardware for time warp multiprocessor systems

NASA Technical Reports Server (NTRS)

Robb, Michael J.; Buzzell, Calvin A.

1992-01-01

The purpose of the research effort is to develop and demonstrate innovative hardware to implement specific rollback and timing functions required for efficient queue management and precision timekeeping in multiprocessor discrete event simulations. The previously completed phase 1 effort demonstrated the technical feasibility of building hardware modules which eliminate the state saving overhead of the Time Warp paradigm used in distributed simulations on multiprocessor systems. The current phase 2 effort will build multiple pre-production rollback hardware modules integrated with a network of Sun workstations, and the integrated system will be tested by executing a Time Warp simulation. The rollback hardware will be designed to interface with the greatest number of multiprocessor systems possible. The authors believe that the rollback hardware will provide for significant speedup of large scale discrete event simulation problems and allow multiprocessors using Time Warp to dramatically increase performance.
MIT Laboratory for Computer Science Progress Report, July 1984-June 1985

DTIC Science & Technology

1985-06-01

larger (up to several thousand machines) multiprocessor systems. This facility, funded by the newly formed Strategic Computing Program of the Defense...Szolovits, Group Leader R. Patil Collaborating Investigators M. Criscitiello, M.D., Tufts-New England Medical Center Hospital J. Dzierzanowski, Ph.D., Dept...COMPUTATION STRUCTURES Academic Staff J. B. Dennis, Group Leader Research Staff W. B. Ackerman G. A. Boughton W. Y-P. Lim Graduate Students T-A. Chu S
Design and control of a macro-micro robot for precise force applications

NASA Technical Reports Server (NTRS)

Wang, Yulun; Mangaser, Amante; Laby, Keith; Jordan, Steve; Wilson, Jeff

1993-01-01

Creating a robot which can delicately interact with its environment has been the goal of much research. Primarily two difficulties have made this goal hard to attain. The execution of control strategies which enable precise force manipulations are difficult to implement in real time because such algorithms have been too computationally complex for available controllers. Also, a robot mechanism which can quickly and precisely execute a force command is difficult to design. Actuation joints must be sufficiently stiff, frictionless, and lightweight so that desired torques can be accurately applied. This paper describes a robotic system which is capable of delicate manipulations. A modular high-performance multiprocessor control system was designed to provide sufficient compute power for executing advanced control methods. An 8 degree of freedom macro-micro mechanism was constructed to enable accurate tip forces. Control algorithms based on the impedance control method were derived, coded, and load balanced for maximum execution speed on the multiprocessor system. Delicate force tasks such as polishing, finishing, cleaning, and deburring, are the target applications of the robot.
Systematic generation of multibody equations of motion suitable for recursive and parallel manipulation

NASA Technical Reports Server (NTRS)

Nikravesh, Parviz E.; Gim, Gwanghum; Arabyan, Ara; Rein, Udo

1989-01-01

The formulation of a method known as the joint coordinate method for automatic generation of the equations of motion for multibody systems is summarized. For systems containing open or closed kinematic loops, the equations of motion can be reduced systematically to a minimum number of second order differential equations. The application of recursive and nonrecursive algorithms to this formulation, computational considerations and the feasibility of implementing this formulation on multiprocessor computers are discussed.
Fault tolerance in a supercomputer through dynamic repartitioning

DOEpatents

Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Takken, Todd E.

2007-02-27

A multiprocessor, parallel computer is made tolerant to hardware failures by providing extra groups of redundant standby processors and by designing the system so that these extra groups of processors can be swapped with any group which experiences a hardware failure. This swapping can be under software control, thereby permitting the entire computer to sustain a hardware failure but, after swapping in the standby processors, to still appear to software as a pristine, fully functioning system.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1987-01-01

The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G.; Salapura, Valentina

2012-07-24

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
A new taxonomy for distributed computer systems based upon operating system structure

NASA Technical Reports Server (NTRS)

Foudriat, E. C.

1985-01-01

Characteristics of the resource structure found in the operating system are considered as a mechanism for classifying distributed computer systems. Since the operating system resources, themselves, are too diversified to provide a consistent classification, the structure upon which resources are built and shared are examined. The location and control character of this indivisibility provides the taxonomy for separating uniprocessors, computer networks, network computers (fully distributed processing systems or decentralized computers) and algorithm and/or data control multiprocessors. The taxonomy is important because it divides machines into a classification that is relevant or important to the client and not the hardware architect. It also defines the character of the kernel O/S structure needed for future computer systems. What constitutes an operating system for a fully distributed processor is discussed in detail.
A multitasking, multisinked, multiprocessor data acquisition front end

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fox, R.; Au, R.; Molen, A.V.

1989-10-01

The authors have developed a generalized data acquisition front end system which is based on MC68020 processors running a commercial real time kernel (rhoSOS), and implemented primarily in a high level language (C). This system has been attached to the back end on-line computing system at NSCL via our high performance ETHERNET protocol. Data may be simultaneously sent to any number of back end systems. Fixed fraction sampling along links to back end computing is also supported. A nonprocedural program generator simplifies the development of experiment specific code.
A heterogeneous hierarchical architecture for real-time computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Skroch, D.A.; Fornaro, R.J.

The need for high-speed data acquisition and control algorithms has prompted continued research in the area of multiprocessor systems and related programming techniques. The result presented here is a unique hardware and software architecture for high-speed real-time computer systems. The implementation of a prototype of this architecture has required the integration of architecture, operating systems and programming languages into a cohesive unit. This report describes a Heterogeneous Hierarchial Architecture for Real-Time (H{sup 2} ART) and system software for program loading and interprocessor communication.
Model Checking, Abstraction, and Compositional Verification

DTIC Science & Technology

1993-07-01

the ( alois connections used by Bensalrnu el al. [6], and also has some relation to Kurshan’s automata homonuor- phisms [62]. (Actually. we can impose a...multiprocessor simulation model. ACM Transactions on Computer Systems, 4(4):273-298, November 1986. [41 D. L. Beatty, R. E. Bryant, and C.-J. Seger
Developing Software to Use Parallel Processing Effectively

DTIC Science & Technology

1988-10-01

Experience, Vol 15(6), June 1985, p53 Gajski85 Gajski , Daniel D. and Jih-Kwon Peir, "Essential Issues in Multiprocessor Systems", IEEE Computer, June...Treleaven (eds.), Springer-Verlag, pp. 213-225 (June 1987). Kuck83 David Kuck, Duncan Lawrie, Ron Cytron, Ahmed Sameh and Daniel Gajski , The Architecture and
Parallelising a molecular dynamics algorithm on a multi-processor workstation

NASA Astrophysics Data System (ADS)

Müller-Plathe, Florian

1990-12-01

The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
Techniques for video compression

NASA Technical Reports Server (NTRS)

Wu, Chwan-Hwa

1995-01-01

In this report, we present our study on multiprocessor implementation of a MPEG2 encoding algorithm. First, we compare two approaches to implementing video standards, VLSI technology and multiprocessor processing, in terms of design complexity, applications, and cost. Then we evaluate the functional modules of MPEG2 encoding process in terms of their computation time. Two crucial modules are identified based on this evaluation. Then we present our experimental study on the multiprocessor implementation of the two crucial modules. Data partitioning is used for job assignment. Experimental results show that high speedup ratio and good scalability can be achieved by using this kind of job assignment strategy.
Methodologies and systems for heterogeneous concurrent computing

NASA Technical Reports Server (NTRS)

Sunderam, V. S.

1994-01-01

Heterogeneous concurrent computing is gaining increasing acceptance as an alternative or complementary paradigm to multiprocessor-based parallel processing as well as to conventional supercomputing. While algorithmic and programming aspects of heterogeneous concurrent computing are similar to their parallel processing counterparts, system issues, partitioning and scheduling, and performance aspects are significantly different. In this paper, we discuss critical design and implementation issues in heterogeneous concurrent computing, and describe techniques for enhancing its effectiveness. In particular, we highlight the system level infrastructures that are required, aspects of parallel algorithm development that most affect performance, system capabilities and limitations, and tools and methodologies for effective computing in heterogeneous networked environments. We also present recent developments and experiences in the context of the PVM system and comment on ongoing and future work.
Integrating Software Modules For Robot Control

NASA Technical Reports Server (NTRS)

Volpe, Richard A.; Khosla, Pradeep; Stewart, David B.

1993-01-01

Reconfigurable, sensor-based control system uses state variables in systematic integration of reusable control modules. Designed for open-architecture hardware including many general-purpose microprocessors, each having own local memory plus access to global shared memory. Implemented in software as extension of Chimera II real-time operating system. Provides transparent computing mechanism for intertask communication between control modules and generic process-module architecture for multiprocessor realtime computation. Used to control robot arm. Proves useful in variety of other control and robotic applications.
A fault-tolerant multiprocessor architecture for aircraft, volume 1. [autopilot configuration

NASA Technical Reports Server (NTRS)

Smith, T. B.; Hopkins, A. L.; Taylor, W.; Ausrotas, R. A.; Lala, J. H.; Hanley, L. D.; Martin, J. H.

1978-01-01

A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed.
A macro-micro robot for precise force applications

NASA Technical Reports Server (NTRS)

Marzwell, Neville I.; Wang, Yulun

1993-01-01

This paper describes an 8 degree-of-freedom macro-micro robot capable of performing tasks which require accurate force control. Applications such as polishing, finishing, grinding, deburring, and cleaning are a few examples of tasks which need this capability. Currently these tasks are either performed manually or with dedicated machinery because of the lack of a flexible and cost effective tool, such as a programmable force-controlled robot. The basic design and control of the macro-micro robot is described in this paper. A modular high-performance multiprocessor control system was designed to provide sufficient compute power for executing advanced control methods. An 8 degree of freedom macro-micro mechanism was constructed to enable accurate tip forces. Control algorithms based on the impedance control method were derived, coded, and load balanced for maximum execution speed on the multiprocessor system.

Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
Design and implementation of a modulator-based free-space optical backplane for multiprocessor applications.

PubMed

Kirk, Andrew G; Plant, David V; Szymanski, Ted H; Vranesic, Zvonko G; Tooley, Frank A P; Rolston, David R; Ayliffe, Michael H; Lacroix, Frederic K; Robertson, Brian; Bernier, Eric; Brosseau, Daniel F

2003-05-10

Design and implementation of a free-space optical backplane for multiprocessor applications is presented. The system is designed to interconnect four multiprocessor nodes that communicate by using multiplexed 32-bit packets. Each multiprocessor node is electrically connected to an optoelectronic VLSI chip which implements the hyperplane interconnection architecture. The chips each contain 256 optical transmitters (implemented as dual-rail multiple quantum-well modulators) and 256 optical receivers. A rigid free-space microoptical interconnection system that interconnects the transceiver chips in a 512-channel unidirectional ring is implemented. Full design, implementation, and operational details are provided.
Design and implementation of a modulator-based free-space optical backplane for multiprocessor applications

NASA Astrophysics Data System (ADS)

Kirk, Andrew G.; Plant, David V.; Szymanski, Ted H.; Vranesic, Zvonko G.; Tooley, Frank A. P.; Rolston, David R.; Ayliffe, Michael H.; Lacroix, Frederic K.; Robertson, Brian; Bernier, Eric; Brosseau, Daniel F.

2003-05-01

Design and implementation of a free-space optical backplane for multiprocessor applications is presented. The system is designed to interconnect four multiprocessor nodes that communicate by using multiplexed 32-bit packets. Each multiprocessor node is electrically connected to an optoelectronic VLSI chip which implements the hyperplane interconnection architecture. The chips each contain 256 optical transmitters (implemented as dual-rail multiple quantum-well modulators) and 256 optical receivers. A rigid free-space microoptical interconnection system that interconnects the transceiver chips in a 512-channel unidirectional ring is implemented. Full design, implementation, and operational details are provided.
Portable programming on parallel/networked computers using the Application Portable Parallel Library (APPL)

NASA Technical Reports Server (NTRS)

Quealy, Angela; Cole, Gary L.; Blech, Richard A.

1993-01-01

The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
Clocking and Synchronization Circuits in Multiprocessor Systems

DTIC Science & Technology

1989-04-01

18 3.4 Inter -chip Clocking Strategies...may occur when two or more of the switches make transitions at different times during the inter - val during which those inputs are being processed...increased without any fruitful computation. The sources of the inter -chip clock skew are the electromagnetic propagation delay, the buffer delay within
Hardware for Accelerating N-Modular Redundant Systems for High-Reliability Computing

NASA Technical Reports Server (NTRS)

Dobbs, Carl, Sr.

2012-01-01

A hardware unit has been designed that reduces the cost, in terms of performance and power consumption, for implementing N-modular redundancy (NMR) in a multiprocessor device. The innovation monitors transactions to memory, and calculates a form of sumcheck on-the-fly, thereby relieving the processors of calculating the sumcheck in software
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB

NASA Technical Reports Server (NTRS)

Clune, E.; Segall, Z.; Siewiorek, D.

1985-01-01

This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.
MULTIPROCESSOR AND DISTRIBUTED PROCESSING BIBLIOGRAPHIC DATA BASE SOFTWARE SYSTEM

NASA Technical Reports Server (NTRS)

Miya, E. N.

1994-01-01

Multiprocessors and distributed processing are undergoing increased scientific scrutiny for many reasons. It is more and more difficult to keep track of the existing research in these fields. This package consists of a large machine-readable bibliographic data base which, in addition to the usual keyword searches, can be used for producing citations, indexes, and cross-references. The data base is compiled from smaller existing multiprocessing bibliographies, and tables of contents from journals and significant conferences. There are approximately 4,000 entries covering topics such as parallel and vector processing, networks, supercomputers, fault-tolerant computers, and cellular automata. Each entry is represented by 21 fields including keywords, author, referencing book or journal title, volume and page number, and date and city of publication. The data base contains UNIX 'refer' formatted ASCII data and can be implemented on any computer running under the UNIX operating system. The data base requires approximately one megabyte of secondary storage. The documentation for this program is included with the distribution tape, although it can be purchased for the price below. This bibliography was compiled in 1985 and updated in 1988.
Prefetching in file systems for MIMD multiprocessors

NASA Technical Reports Server (NTRS)

Kotz, David F.; Ellis, Carla Schlatter

1990-01-01

The question of whether prefetching blocks on the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions, is considered. Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that (1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, (2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O (input/output) operation, and (3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study). The authors explore why it is not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in the environment.
Experimental evaluation of multiprocessor cache-based error recovery

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. K.

1991-01-01

Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, the performance effect of integrating the recovery schemes in the cache coherence protocol are evaluated. The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.
UTDallas Offline Computing System for B Physics with the Babar Experiment at SLAC

NASA Astrophysics Data System (ADS)

Benninger, Tracy L.

1998-10-01

The University of Texas at Dallas High Energy Physics group is building a high performance, large storage computing system for B physics research with the BaBar experiment (``factory'') at the Stanford Linear Accelerator Center. The goal of this system is to analyze one terabyte of complex Event Store data from BaBar in one to two days. The foundation of the computing system is a Sun E6000 Enterprise multiprocessor system, with additions of a Sun StorEdge L1800 Tape Library, a Sun Workstation for processing batch jobs, staging disks and interface cards. The design considerations, current status, projects underway, and possible upgrade paths will be discussed.
Multilevel Parallelization of AutoDock 4.2.

PubMed

Norgan, Andrew P; Coffman, Paul K; Kocher, Jean-Pierre A; Katzmann, David J; Sosa, Carlos P

2011-04-28

Virtual (computational) screening is an increasingly important tool for drug discovery. AutoDock is a popular open-source application for performing molecular docking, the prediction of ligand-receptor interactions. AutoDock is a serial application, though several previous efforts have parallelized various aspects of the program. In this paper, we report on a multi-level parallelization of AutoDock 4.2 (mpAD4). Using MPI and OpenMP, AutoDock 4.2 was parallelized for use on MPI-enabled systems and to multithread the execution of individual docking jobs. In addition, code was implemented to reduce input/output (I/O) traffic by reusing grid maps at each node from docking to docking. Performance of mpAD4 was examined on two multiprocessor computers. Using MPI with OpenMP multithreading, mpAD4 scales with near linearity on the multiprocessor systems tested. In situations where I/O is limiting, reuse of grid maps reduces both system I/O and overall screening time. Multithreading of AutoDock's Lamarkian Genetic Algorithm with OpenMP increases the speed of execution of individual docking jobs, and when combined with MPI parallelization can significantly reduce the execution time of virtual screens. This work is significant in that mpAD4 speeds the execution of certain molecular docking workloads and allows the user to optimize the degree of system-level (MPI) and node-level (OpenMP) parallelization to best fit both workloads and computational resources.
Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, T. B., III

1983-01-01

The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
NASA Workshop on Computational Structural Mechanics 1987, part 1

NASA Technical Reports Server (NTRS)

Sykes, Nancy P. (Editor)

1989-01-01

Topics in Computational Structural Mechanics (CSM) are reviewed. CSM parallel structural methods, a transputer finite element solver, architectures for multiprocessor computers, and parallel eigenvalue extraction are among the topics discussed.
Image Understanding and Intelligent Parallel Systems

DTIC Science & Technology

1991-05-09

a common user interface for the interactive , graphical manipulation of those histories, and...Circuits and Systems, August 1987. Yap, S.-K. and M.L. Scott, "PenGuin: A language for reactive graphical user interface programming," to appear, INTERACT 󈨞, Cambridge, United Kingdom, 1990. 74 ...of up to a factor of 100 over single-workstation implementations. User interfaces to large multiprocessor computers are a difficult issue addressed
Method for wiring allocation and switch configuration in a multiprocessor environment

DOEpatents

Aridor, Yariv [Zichron Ya'akov, IL; Domany, Tamar [Kiryat Tivon, IL; Frachtenberg, Eitan [Jerusalem, IL; Gal, Yoav [Haifa, IL; Shmueli, Edi [Haifa, IL; Stockmeyer, legal representative, Robert E.; Stockmeyer, Larry Joseph [San Jose, CA

2008-07-15

A method for wiring allocation and switch configuration in a multiprocessor computer, the method including employing depth-first tree traversal to determine a plurality of paths among a plurality of processing elements allocated to a job along a plurality of switches and wires in a plurality of D-lines, and selecting one of the paths in accordance with at least one selection criterion.
The Galley Parallel File System

NASA Technical Reports Server (NTRS)

Nieuwejaar, Nils; Kotz, David

1996-01-01

Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.
Preliminary basic performance analysis of the Cedar multiprocessor memory system

NASA Technical Reports Server (NTRS)

Gallivan, K.; Jalby, W.; Turner, S.; Veidenbaum, A.; Wijshoff, H.

1991-01-01

Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system.
Design and implementation of highly parallel pipelined VLSI systems

NASA Astrophysics Data System (ADS)

Delange, Alphonsus Anthonius Jozef

A methodology and its realization as a prototype CAD (Computer Aided Design) system for the design and analysis of complex multiprocessor systems is presented. The design is an iterative process in which the behavioral specifications of the system components are refined into structural descriptions consisting of interconnections and lower level components etc. A model for the representation and analysis of multiprocessor systems at several levels of abstraction and an implementation of a CAD system based on this model are described. A high level design language, an object oriented development kit for tool design, a design data management system, and design and analysis tools such as a high level simulator and graphics design interface which are integrated into the prototype system and graphics interface are described. Procedures for the synthesis of semiregular processor arrays, and to compute the switching of input/output signals, memory management and control of processor array, and sequencing and segmentation of input/output data streams due to partitioning and clustering of the processor array during the subsequent synthesis steps, are described. The architecture and control of a parallel system is designed and each component mapped to a module or module generator in a symbolic layout library, compacted for design rules of VLSI (Very Large Scale Integration) technology. An example of the design of a processor that is a useful building block for highly parallel pipelined systems in the signal/image processing domains is given.
USC orthogonal multiprocessor for image processing with neural networks

NASA Astrophysics Data System (ADS)

Hwang, Kai; Panda, Dhabaleswar K.; Haddadi, Navid

1990-07-01

This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.

Proceedings of the second SISAL users` conference

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J T; Frerking, C; Miller, P J

1992-12-01

This report contains papers on the following topics: A sisal code for computing the fourier transform on S{sub N}; five ways to fill your knapsack; simulating material dislocation motion in sisal; candis as an interface for sisal; parallelisation and performance of the burg algorithm on a shared-memory multiprocessor; use of genetic algorithm in sisal to solve the file design problem; implementing FFT`s in sisal; programming and evaluating the performance of signal processing applications in the sisal programming environment; sisal and Von Neumann-based languages: translation and intercommunication; an IF2 code generator for ADAM architecture; program partitioning for NUMA multiprocessor computer systems;more » mapping functional parallelism on distributed memory machines; implicit array copying: prevention is better than cure ; mathematical syntax for sisal; an approach for optimizing recursive functions; implementing arrays in sisal 2.0; Fol: an object oriented extension to the sisal language; twine: a portable, extensible sisal execution kernel; and investigating the memory performance of the optimizing sisal compiler.« less
Efficient mapping algorithms for scheduling robot inverse dynamics computation on a multiprocessor system

NASA Technical Reports Server (NTRS)

Lee, C. S. G.; Chen, C. L.

1989-01-01

Two efficient mapping algorithms for scheduling the robot inverse dynamics computation consisting of m computational modules with precedence relationship to be executed on a multiprocessor system consisting of p identical homogeneous processors with processor and communication costs to achieve minimum computation time are presented. An objective function is defined in terms of the sum of the processor finishing time and the interprocessor communication time. The minimax optimization is performed on the objective function to obtain the best mapping. This mapping problem can be formulated as a combination of the graph partitioning and the scheduling problems; both have been known to be NP-complete. Thus, to speed up the searching for a solution, two heuristic algorithms were proposed to obtain fast but suboptimal mapping solutions. The first algorithm utilizes the level and the communication intensity of the task modules to construct an ordered priority list of ready modules and the module assignment is performed by a weighted bipartite matching algorithm. For a near-optimal mapping solution, the problem can be solved by the heuristic algorithm with simulated annealing. These proposed optimization algorithms can solve various large-scale problems within a reasonable time. Computer simulations were performed to evaluate and verify the performance and the validity of the proposed mapping algorithms. Finally, experiments for computing the inverse dynamics of a six-jointed PUMA-like manipulator based on the Newton-Euler dynamic equations were implemented on an NCUBE/ten hypercube computer to verify the proposed mapping algorithms. Computer simulation and experimental results are compared and discussed.
Spaceborne VHSIC multiprocessor system for AI applications

NASA Technical Reports Server (NTRS)

Lum, Henry, Jr.; Shrobe, Howard E.; Aspinall, John G.

1988-01-01

A multiprocessor system, under design for space-station applications, makes use of the latest generation symbolic processor and packaging technology. The result will be a compact, space-qualified system two to three orders of magnitude more powerful than present-day symbolic processing systems.
Efficient partitioning and assignment on programs for multiprocessor execution

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1993-01-01

The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.
Processor tradeoffs in distributed real-time systems

NASA Technical Reports Server (NTRS)

Krishna, C. M.; Shin, Kang G.; Bhandari, Inderpal S.

1987-01-01

The problem of the optimization of the design of real-time distributed systems is examined with reference to a class of computer architectures similar to the continuously reconfigurable multiprocessor flight control system structure, CM2FCS. Particular attention is given to the impact of processor replacement and the burn-in time on the probability of dynamic failure and mean cost. The solution is obtained numerically and interpreted in the context of real-time applications.
Design Tools for Evaluating Multiprocessor Programs

DTIC Science & Technology

1976-07-01

than large uniprocessing machines, and 2. economies of scale in manufacturing. Perhaps the most compelling reason (possibly a consequence of the...speed, redundancy, (inefficiency, resource utilization, and economies of the components. [Browne 73, Lehman 66] 6. How can the system be scheduled...mejsures are interesting about the computation? Somn may be: speed, redundancy, (inefficiency, resource utilization, and economies of the components
Algorithms and software for solving finite element equations on serial and parallel architectures

NASA Technical Reports Server (NTRS)

George, Alan

1989-01-01

Over the past 15 years numerous new techniques have been developed for solving systems of equations and eigenvalue problems arising in finite element computations. A package called SPARSPAK has been developed by the author and his co-workers which exploits these new methods. The broad objective of this research project is to incorporate some of this software in the Computational Structural Mechanics (CSM) testbed, and to extend the techniques for use on multiprocessor architectures.
Investigation into the Use of Texturing for Real-Time Computer Animation.

DTIC Science & Technology

1987-12-01

produce a rough polygon surface [7]. Research in the area of real time texturing has also been conducted. Using a specially designed multi-processor system ...Oka, Tsutsui, Ohba, Kurauchi and Tago have introduced real-time manipulation of texture mapped surfaces [8]. Using multi- processors, systems will...a call to the system function defpattern(n,size,mask) short n,size; short *mask, which takes as input an index to a system table of patterns, a
Closed-form solutions of performability. [in computer systems

NASA Technical Reports Server (NTRS)

Meyer, J. F.

1982-01-01

It is noted that if computing system performance is degradable then system evaluation must deal simultaneously with aspects of both performance and reliability. One approach is the evaluation of a system's performability which, relative to a specified performance variable Y, generally requires solution of the probability distribution function of Y. The feasibility of closed-form solutions of performability when Y is continuous are examined. In particular, the modeling of a degradable buffer/multiprocessor system is considered whose performance Y is the (normalized) average throughput rate realized during a bounded interval of time. Employing an approximate decomposition of the model, it is shown that a closed-form solution can indeed be obtained.
A Common Interface Real-Time Multiprocessor Operating System for Embedded Systems

DTIC Science & Technology

1991-03-04

Pressman , a design methodology should show hierarchical organization, lead to modules exhibiting independent functional characteristics, and be derived...Boehm, Barry W. "Software Engineering," Tutorial: Software Design Strategies, 2nd Edition. 35-50. Los Angeles CA: IEEE Computer Society Press, 1981... Pressman , Roger S. Software Engineering: A Practitioner’s Approach, Second Edi- tion. McGraw-Hill Book Company, New York, 1988. 59. Quinn, Michael J
Multitasking runtime systems for the Cedar Multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Guzzi, M.D.

1986-07-01

The programming of a MIMD machine is more complex than for SISD and SIMD machines. The multiple computational resources of the machine must be made available to the programming language compiler and to the programmer so that multitasking programs may be written. This thesis will explore the additional complexity of programming a MIMD machine, the Cedar Multiprocessor specifically, and the multitasking runtime system necessary to provide multitasking resources to the user. First, the problem will be well defined: the Cedar machine, its operating system, the programming language, and multitasking concepts will be described. Second, a solution to the problem, calledmore » macrotasking, will be proposed. This solution provides multitasking facilities to the programmer at a very coarse level with many visible machine dependencies. Third, an alternate solution, called microtasking, will be proposed. This solution provides multitasking facilities of a much finer grain. This solution does not depend so rigidly on the specific architecture of the machine. Finally, the two solutions will be compared for effectiveness. 12 refs., 16 figs.« less
Multi-processor including data flow accelerator module

DOEpatents

Davidson, George S.; Pierce, Paul E.

1990-01-01

An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Compiler-directed cache management in multiprocessors

NASA Technical Reports Server (NTRS)

Cheong, Hoichi; Veidenbaum, Alexander V.

1990-01-01

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace-driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.
A real-time programming system.

PubMed

Townsend, H R

1979-03-01

The paper describes a Basic Operating and Scheduling System (BOSS) designed for a small computer. User programs are organised as self-contained modular 'processes' and the way in which the scheduler divides the time of the computer equally between them, while arranging for any process which has to respond to an interrupt from a peripheral device to be given the necessary priority, is described in detail. Next the procedures provided by the operating system to organise communication between processes are described, and how they are used to construct dynamically self-modifying real-time systems. Finally, the general philosophy of BOSS and applications to a multi-processor assembly are discussed.
Shared Memory Parallelization of an Implicit ADI-type CFD Code

NASA Technical Reports Server (NTRS)

Hauser, Th.; Huang, P. G.

1999-01-01

A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Distributed optimization system and method

DOEpatents

Hurtado, John E.; Dohrmann, Clark R.; Robinett, III, Rush D.

2003-06-10

A search system and method for controlling multiple agents to optimize an objective using distributed sensing and cooperative control. The search agent can be one or more physical agents, such as a robot, and can be software agents for searching cyberspace. The objective can be: chemical sources, temperature sources, radiation sources, light sources, evaders, trespassers, explosive sources, time dependent sources, time independent sources, function surfaces, maximization points, minimization points, and optimal control of a system such as a communication system, an economy, a crane, and a multi-processor computer.
Distributed Optimization System

DOEpatents

Hurtado, John E.; Dohrmann, Clark R.; Robinett, III, Rush D.

2004-11-30

A search system and method for controlling multiple agents to optimize an objective using distributed sensing and cooperative control. The search agent can be one or more physical agents, such as a robot, and can be software agents for searching cyberspace. The objective can be: chemical sources, temperature sources, radiation sources, light sources, evaders, trespassers, explosive sources, time dependent sources, time independent sources, function surfaces, maximization points, minimization points, and optimal control of a system such as a communication system, an economy, a crane, and a multi-processor computer.
An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Automatic partitioning of unstructured meshes for the parallel solution of problems in computational mechanics

NASA Technical Reports Server (NTRS)

Farhat, Charbel; Lesoinne, Michel

1993-01-01

Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the 'divide and conquer' paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two- and three-dimensional finite element and finite difference meshes is presented and discussed in view of a domain decomposed solution procedure and parallel processing. The influence of the algorithmic aspects of a solution method (implicit/explicit computations), and the architectural specifics of a multiprocessor (SIMD/MIMD, startup/transmission time), on the design of a mesh partitioning algorithm are discussed. The impact of the partitioning strategy on load balancing, operation count, operator conditioning, rate of convergence and processor mapping is also addressed. Finally, the proposed mesh decomposition algorithms are demonstrated with realistic examples of finite element, finite volume, and finite difference meshes associated with the parallel solution of solid and fluid mechanics problems on the iPSC/2 and iPSC/860 multiprocessors.
Validation of fault-free behavior of a reliable multiprocessor system - FTMP: A case study. [Fault-Tolerant Multi-Processor avionics

NASA Technical Reports Server (NTRS)

Clune, E.; Segall, Z.; Siewiorek, D.

1984-01-01

A program of experiments has been conducted at NASA-Langley to test the fault-free performance of a Fault-Tolerant Multiprocessor (FTMP) avionics system for next-generation aircraft. Baseline measurements of an operating FTMP system were obtained with respect to the following parameters: instruction execution time, frame size, and the variation of clock ticks. The mechanisms of frame stretching were also investigated. The experimental results are summarized in a table. Areas of interest for future tests are identified, with emphasis given to the implementation of a synthetic workload generation mechanism on FTMP.

Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
Mechanism of supporting sub-communicator collectives with o(64) counters as opposed to one counter for each sub-communicator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blocksome, Michael; Kumar, Sameer; Mamidala, Amith R.

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a barrier algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal tomore » the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.« less
Mechanism of supporting sub-communicator collectives with O(64) counters as opposed to one counter for each sub-communicator

DOEpatents

Kumar, Sameer; Mamidala, Amith R.; Ratterman, Joseph D.; Blocksome, Michael; Miller, Douglas

2013-09-03

A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a bather algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table.
A general model for memory interference in a multiprocessor system with memory hierarchy

NASA Technical Reports Server (NTRS)

Taha, Badie A.; Standley, Hilda M.

1989-01-01

The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
Characterizing parallel file-access patterns on a large-scale multiprocessor

NASA Technical Reports Server (NTRS)

Purakayastha, Apratim; Ellis, Carla Schlatter; Kotz, David; Nieuwejaar, Nils; Best, Michael

1994-01-01

Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors. Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines. The results of our trace analysis lead to recommendations for parallel file system design. First the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load conditions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.
Parallel and fault-tolerant algorithms for hypercube multiprocessors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Aykanat, C.

1988-01-01

Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Memory interface simulator: A computer design aid

NASA Technical Reports Server (NTRS)

Taylor, D. S.; Williams, T.; Weatherbee, J. E.

1972-01-01

Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
Reproducibility in a multiprocessor system

DOEpatents

Bellofatto, Ralph A; Chen, Dong; Coteus, Paul W; Eisley, Noel A; Gara, Alan; Gooding, Thomas M; Haring, Rudolf A; Heidelberger, Philip; Kopcsay, Gerard V; Liebsch, Thomas A; Ohmacht, Martin; Reed, Don D; Senger, Robert M; Steinmacher-Burow, Burkhard; Sugawara, Yutaka

2013-11-26

Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed; a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deterministic chip interfaces, zero-impact communication with the system, precise stop of the system and a scan of the system state.
Design consideration in constructing high performance embedded Knowledge-Based Systems (KBS)

NASA Technical Reports Server (NTRS)

Dalton, Shelly D.; Daley, Philip C.

1988-01-01

As the hardware trends for artificial intelligence (AI) involve more and more complexity, the process of optimizing the computer system design for a particular problem will also increase in complexity. Space applications of knowledge based systems (KBS) will often require an ability to perform both numerically intensive vector computations and real time symbolic computations. Although parallel machines can theoretically achieve the speeds necessary for most of these problems, if the application itself is not highly parallel, the machine's power cannot be utilized. A scheme is presented which will provide the computer systems engineer with a tool for analyzing machines with various configurations of array, symbolic, scaler, and multiprocessors. High speed networks and interconnections make customized, distributed, intelligent systems feasible for the application of AI in space. The method presented can be used to optimize such AI system configurations and to make comparisons between existing computer systems. It is an open question whether or not, for a given mission requirement, a suitable computer system design can be constructed for any amount of money.
Address tracing for parallel machines

NASA Technical Reports Server (NTRS)

Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent

1991-01-01

Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, P.A.

1988-02-01

The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
Comparative Implementation of High Performance Computing for Power System Dynamic Simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Huang, Zhenyu; Diao, Ruisheng

Dynamic simulation for transient stability assessment is one of the most important, but intensive, computations for power system planning and operation. Present commercial software is mainly designed for sequential computation to run a single simulation, which is very time consuming with a single processer. The application of High Performance Computing (HPC) to dynamic simulations is very promising in accelerating the computing process by parallelizing its kernel algorithms while maintaining the same level of computation accuracy. This paper describes the comparative implementation of four parallel dynamic simulation schemes in two state-of-the-art HPC environments: Message Passing Interface (MPI) and Open Multi-Processing (OpenMP).more » These implementations serve to match the application with dedicated multi-processor computing hardware and maximize the utilization and benefits of HPC during the development process.« less
Multiprocessor graphics computation and display using transputers

NASA Technical Reports Server (NTRS)

Ellis, Graham K.

1988-01-01

A package of two-dimensional graphics routines was developed to run on a transputer-based parallel processing system. These routines were designed to enable applications programmers to easily generate and display results from the transputer network in a graphic format. The graphics procedures were designed for the lowest possible network communication overhead for increased performance. The routines were designed for ease of use and to present an intuitive approach to generating graphics on the transputer parallel processing system.
Performance issues for domain-oriented time-driven distributed simulations

NASA Technical Reports Server (NTRS)

Nicol, David M.

1987-01-01

It has long been recognized that simulations form an interesting and important class of computations that may benefit from distributed or parallel processing. Since the point of parallel processing is improved performance, the recent proliferation of multiprocessors requires that we consider the performance issues that naturally arise when attempting to implement a distributed simulation. Three such issues are: (1) the problem of mapping the simulation onto the architecture, (2) the possibilities for performing redundant computation in order to reduce communication, and (3) the avoidance of deadlock due to distributed contention for message-buffer space. These issues are discussed in the context of a battlefield simulation implemented on a medium-scale multiprocessor message-passing architecture.
Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
The Automated Instrumentation and Monitoring System (AIMS): Design and Architecture. 3.2

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Schmidt, Melisa; Schulbach, Cathy; Bailey, David (Technical Monitor)

1997-01-01

Whether a researcher is designing the 'next parallel programming paradigm', another 'scalable multiprocessor' or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. Careful analysis of such information can help computer and software architects to capture, and therefore, exploit behavioral variations among/within various parallel programs to take advantage of specific hardware characteristics. A software tool-set that facilitates performance evaluation of parallel applications on multiprocessors has been put together at NASA Ames Research Center under the sponsorship of NASA's High Performance Computing and Communications Program over the past five years. The Automated Instrumentation and Monitoring Systematic has three major software components: a source code instrumentor which automatically inserts active event recorders into program source code before compilation; a run-time performance monitoring library which collects performance data; and a visualization tool-set which reconstructs program execution based on the data collected. Besides being used as a prototype for developing new techniques for instrumenting, monitoring and presenting parallel program execution, AIMS is also being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Currently, the execution of FORTRAN and C programs on the Intel Paragon and PALM workstations can be automatically instrumented and monitored. Performance data thus collected can be displayed graphically on various workstations. The process of performance tuning with AIMS will be illustrated using various NAB Parallel Benchmarks. This report includes a description of the internal architecture of AIMS and a listing of the source code.
Highway Traffic Simulations on Multi-Processor Computers

DOT National Transportation Integrated Search

1997-01-01

A computer model has been developed to simulate highway traffic for various degrees of automation with a high degree of fidelity in regard to driver control and vehicle characteristics. The model simulates vehicle maneuvering in a multi-lane highway ...
Characterization of real-time computers

NASA Technical Reports Server (NTRS)

Shin, K. G.; Krishna, C. M.

1984-01-01

A real-time system consists of a computer controller and controlled processes. Despite the synergistic relationship between these two components, they have been traditionally designed and analyzed independently of and separately from each other; namely, computer controllers by computer scientists/engineers and controlled processes by control scientists. As a remedy for this problem, in this report real-time computers are characterized by performance measures based on computer controller response time that are: (1) congruent to the real-time applications, (2) able to offer an objective comparison of rival computer systems, and (3) experimentally measurable/determinable. These measures, unlike others, provide the real-time computer controller with a natural link to controlled processes. In order to demonstrate their utility and power, these measures are first determined for example controlled processes on the basis of control performance functionals. They are then used for two important real-time multiprocessor design applications - the number-power tradeoff and fault-masking and synchronization.

Hyperswitch communication network

NASA Technical Reports Server (NTRS)

Peterson, J.; Pniel, M.; Upchurch, E.

1991-01-01

The Hyperswitch Communication Network (HCN) is a large scale parallel computer prototype being developed at JPL. Commercial versions of the HCN computer are planned. The HCN computer being designed is a message passing multiple instruction multiple data (MIMD) computer, and offers many advantages in price-performance ratio, reliability and availability, and manufacturing over traditional uniprocessors and bus based multiprocessors. The design of the HCN operating system is a uniquely flexible environment that combines both parallel processing and distributed processing. This programming paradigm can achieve a balance among the following competing factors: performance in processing and communications, user friendliness, and fault tolerance. The prototype is being designed to accommodate a maximum of 64 state of the art microprocessors. The HCN is classified as a distributed supercomputer. The HCN system is described, and the performance/cost analysis and other competing factors within the system design are reviewed.
A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories

DTIC Science & Technology

1989-02-01

frames per second, font generation directly from conic spline descriptions, and rapid calculation of radiosity form factors. The hardware consists of...generality for rendering curved surfaces, volume data, objects dcscri id with Constructive Solid Geometry, for rendering scenes using the radiosity ...f.aces and for computing a spherical radiosity lighting model (see Section 7.6). Custom Memory Chips \\ 208 bits x 128 pixels - Renderer Board ix p o a
Evaluation of SuperLU on multicore architectures

NASA Astrophysics Data System (ADS)

Li, X. S.

2008-07-01

The Chip Multiprocessor (CMP) will be the basic building block for computer systems ranging from laptops to supercomputers. New software developments at all levels are needed to fully utilize these systems. In this work, we evaluate performance of different high-performance sparse LU factorization and triangular solution algorithms on several representative multicore machines. We included both Pthreads and MPI implementations in this study and found that the Pthreads implementation consistently delivers good performance and that a left-looking algorithm is usually superior.
Efficient ICCG on a shared memory multiprocessor

NASA Technical Reports Server (NTRS)

Hammond, Steven W.; Schreiber, Robert

1989-01-01

Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially.
The Experimental Mathematician: The Pleasure of Discovery and the Role of Proof

ERIC Educational Resources Information Center

Borwein, Jonathan M.

2005-01-01

The emergence of powerful mathematical computing environments, the growing availability of correspondingly powerful (multi-processor) computers and the pervasive presence of the Internet allow for mathematicians, students and teachers, to proceed heuristically and "quasi-inductively." We may increasingly use symbolic and numeric computation,…
Satisfiability Test with Synchronous Simulated Annealing on the Fujitsu AP1000 Massively-Parallel Multiprocessor

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak

1996-01-01

Solving the hard Satisfiability Problem is time consuming even for modest-sized problem instances. Solving the Random L-SAT Problem is especially difficult due to the ratio of clauses to variables. This report presents a parallel synchronous simulated annealing method for solving the Random L-SAT Problem on a large-scale distributed-memory multiprocessor. In particular, we use a parallel synchronous simulated annealing procedure, called Generalized Speculative Computation, which guarantees the same decision sequence as sequential simulated annealing. To demonstrate the performance of the parallel method, we have selected problem instances varying in size from 100-variables/425-clauses to 5000-variables/21,250-clauses. Experimental results on the AP1000 multiprocessor indicate that our approach can satisfy 99.9 percent of the clauses while giving almost a 70-fold speedup on 500 processors.
Nuclear Science Symposium, 31st and Symposium on Nuclear Power Systems, 16th, Orlando, FL, October 31-November 2, 1984, Proceedings

NASA Technical Reports Server (NTRS)

Biggerstaff, J. A. (Editor)

1985-01-01

Topics related to physics instrumentation are discussed, taking into account cryostat and electronic development associated with multidetector spectrometer systems, the influence of materials and counting-rate effects on He-3 neutron spectrometry, a data acquisition system for time-resolved muscle experiments, and a sensitive null detector for precise measurements of integral linearity. Other subjects explored are concerned with space instrumentation, computer applications, detectors, instrumentation for high energy physics, instrumentation for nuclear medicine, environmental monitoring and health physics instrumentation, nuclear safeguards and reactor instrumentation, and a 1984 symposium on nuclear power systems. Attention is given to the application of multiprocessors to scientific problems, a large-scale computer facility for computational aerodynamics, a single-board 32-bit computer for the Fastbus, the integration of detector arrays and readout electronics on a single chip, and three-dimensional Monte Carlo simulation of the electron avalanche in a proportional counter.
Supercomputer optimizations for stochastic optimal control applications

NASA Technical Reports Server (NTRS)

Chung, Siu-Leung; Hanson, Floyd B.; Xu, Huihuang

1991-01-01

Supercomputer optimizations for a computational method of solving stochastic, multibody, dynamic programming problems are presented. The computational method is valid for a general class of optimal control problems that are nonlinear, multibody dynamical systems, perturbed by general Markov noise in continuous time, i.e., nonsmooth Gaussian as well as jump Poisson random white noise. Optimization techniques for vector multiprocessors or vectorizing supercomputers include advanced data structures, loop restructuring, loop collapsing, blocking, and compiler directives. These advanced computing techniques and superconducting hardware help alleviate Bellman's curse of dimensionality in dynamic programming computations, by permitting the solution of large multibody problems. Possible applications include lumped flight dynamics models for uncertain environments, such as large scale and background random aerospace fluctuations.
Realtime multiprocessor for mobile ad hoc networks

NASA Astrophysics Data System (ADS)

Jungeblut, T.; Grünewald, M.; Porrmann, M.; Rückert, U.

2008-05-01

This paper introduces a real-time Multiprocessor System-On-Chip (MPSoC) for low power wireless applications. The multiprocessor is based on eight 32bit RISC processors that are connected via an Network-On-Chip (NoC). The NoC follows a novel approach with guaranteed bandwidth to the application that meets hard realtime requirements. At a clock frequency of 100 MHz the total power consumption of the MPSoC that has been fabricated in 180 nm UMC standard cell technology is 772 mW.
A Multiprocessor Operating System Simulator

NASA Technical Reports Server (NTRS)

Johnston, Gary M.; Campbell, Roy H.

1988-01-01

This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall semester of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT&T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows that of the 'Choices' family of operating systems for loosely- and tightly-coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.
Fast neural net simulation with a DSP processor array.

PubMed

Muller, U A; Gunzinger, A; Guggenbuhl, W

1995-01-01

This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.
The Minerva Multi-Microprocessor.

DTIC Science & Technology

A multiprocessor system is described which is an experiment in low cost, extensible, multiprocessor architectures. Global issues such as inclusion of a central bus, design of the bus arbiter, and methods of interrupt handling are considered. The system initially includes two processor types, based on microprocessors, and these are discussed. Methods for reducing processor demand for the central bus are described.
Design and Implementation of Embedded Computer Vision Systems Based on Particle Filters

DTIC Science & Technology

2010-01-01

for hardware/software implementa- tion of multi-dimensional particle filter application and we explore this in the third application which is a 3D...methodology for hardware/software implementation of multi-dimensional particle filter application and we explore this in the third application which is a...and hence multiprocessor implementation of parti- cle filters is an important option to examine. A significant body of work exists on optimizing generic
Multiprocessor smalltalk: Implementation, performance, and analysis

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pallas, J.I.

1990-01-01

Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possiblemore » to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.« less
Parallel Navier-Stokes computations on shared and distributed memory architectures

NASA Technical Reports Server (NTRS)

Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

1995-01-01

We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
SIAM Conference on Parallel Processing for Scientific Computing, 4th, Chicago, IL, Dec. 11-13, 1989, Proceedings

NASA Technical Reports Server (NTRS)

Dongarra, Jack (Editor); Messina, Paul (Editor); Sorensen, Danny C. (Editor); Voigt, Robert G. (Editor)

1990-01-01

Attention is given to such topics as an evaluation of block algorithm variants in LAPACK and presents a large-grain parallel sparse system solver, a multiprocessor method for the solution of the generalized Eigenvalue problem on an interval, and a parallel QR algorithm for iterative subspace methods on the CM2. A discussion of numerical methods includes the topics of asynchronous numerical solutions of PDEs on parallel computers, parallel homotopy curve tracking on a hypercube, and solving Navier-Stokes equations on the Cedar Multi-Cluster system. A section on differential equations includes a discussion of a six-color procedure for the parallel solution of elliptic systems using the finite quadtree structure, data parallel algorithms for the finite element method, and domain decomposition methods in aerodynamics. Topics dealing with massively parallel computing include hypercube vs. 2-dimensional meshes and massively parallel computation of conservation laws. Performance and tools are also discussed.
Multiprocessor switch with selective pairing

DOEpatents

Gara, Alan; Gschwind, Michael K; Salapura, Valentina

2014-03-11

System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus
The high hall ventilation with the simplified simulation of the fan

NASA Astrophysics Data System (ADS)

Kyncl, Martin; Pelant, Jaroslav

2018-06-01

Here we work with the system of equations describing the non-stationary compressible turbulent multi-component flow in the gravitational field. We focus on the numerical simulation of the fan situated inside the high hall. The RANS equations are discretized with the use of the finite volume method. The original modification of the Riemann problem and its solution is used at the boundaries. The combination of specific boundary conditions is used for the simulation of the fan. The presented computational results are computed with own-developed code (C, FORTRAN, multiprocessor, unstructured meshes in general).
A Real-Time Linux for Multicore Platforms

DTIC Science & Technology

2013-12-20

under ARO support) to obtain a fully-functional OS for supporting real-time workloads on multicore platforms. This system, called LITMUS -RT...to be specified as plugin components. LITMUS -RT is open-source software (available at The views, opinions and/or findings contained in this report... LITMUS -RT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems), allows different multiprocessor real-time scheduling and
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor

NASA Technical Reports Server (NTRS)

Gilliland, M. C.; Smith, B. J.; Calvert, W.

1976-01-01

The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.

A fault-tolerant information processing concept for space vehicles.

NASA Technical Reports Server (NTRS)

Hopkins, A. L., Jr.

1971-01-01

A distributed fault-tolerant information processing system is proposed, comprising a central multiprocessor, dedicated local processors, and multiplexed input-output buses connecting them together. The processors in the multiprocessor are duplicated for error detection, which is felt to be less expensive than using coded redundancy of comparable effectiveness. Error recovery is made possible by a triplicated scratchpad memory in each processor. The main multiprocessor memory uses replicated memory for error detection and correction. Local processors use any of three conventional redundancy techniques: voting, duplex pairs with backup, and duplex pairs in independent subsystems.
Cache-based error recovery for shared memory multiprocessor systems

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1989-01-01

A multiprocessor cache-based checkpointing and recovery scheme for of recovering from transient processor errors in a shared-memory multiprocessor with private caches is presented. New implementation techniques that use checkpoint identifiers and recovery stacks to reduce performance degradation in processor utilization during normal execution are examined. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions that take error latency into account are presented.
Safe and Efficient Support for Embeded Multi-Processors in ADA

NASA Astrophysics Data System (ADS)

Ruiz, Jose F.

2010-08-01

New software demands increasing processing power, and multi-processor platforms are spreading as the answer to achieve the required performance. Embedded real-time systems are also subject to this trend, but in the case of real-time mission-critical systems, the properties of reliability, predictability and analyzability are also paramount. The Ada 2005 language defined a subset of its tasking model, the Ravenscar profile, that provides the basis for the implementation of deterministic and time analyzable applications on top of a streamlined run-time system. This Ravenscar tasking profile, originally designed for single processors, has proven remarkably useful for modelling verifiable real-time single-processor systems. This paper proposes a simple extension to the Ravenscar profile to support multi-processor systems using a fully partitioned approach. The implementation of this scheme is simple, and it can be used to develop applications amenable to schedulability analysis.
A CAMAC-VME-Macintosh data acquisition system for nuclear experiments

NASA Astrophysics Data System (ADS)

Anzalone, A.; Giustolisi, F.

1989-10-01

A multiprocessor system for data acquisition and analysis in low-energy nuclear physics has been realized. The system is built around CAMAC, the VMEbus, and the Macintosh PC. Multiprocessor software has been developed, using RTF, MACsys, and CERN cross-software. The execution of several programs that run on several VME CPUs and on an external PC is coordinated by a mailbox protocol. No operating system is used on the VME CPUs. The hardware, software, and system performance are described.
Performance analysis of parallel branch and bound search with the hypercube architecture

NASA Technical Reports Server (NTRS)

Mraz, Richard T.

1987-01-01

With the availability of commercial parallel computers, researchers are examining new classes of problems which might benefit from parallel computing. This paper presents results of an investigation of the class of search intensive problems. The specific problem discussed is the Least-Cost Branch and Bound search method of deadline job scheduling. The object-oriented design methodology was used to map the problem into a parallel solution. While the initial design was good for a prototype, the best performance resulted from fine-tuning the algorithm for a specific computer. The experiments analyze the computation time, the speed up over a VAX 11/785, and the load balance of the problem when using loosely coupled multiprocessor system based on the hypercube architecture.
The state of the Java universe

ScienceCinema

Gosling, James

2018-05-22

Speaker Bio: James Gosling received a B.Sc. in computer science from the University of Calgary, Canada in 1977. He received a Ph.D. in computer science from Carnegie-Mellon University in 1983. The title of his thesis was The Algebraic Manipulation of Constraints. He has built satellite data acquisition systems, a multiprocessor version of UNIXÂ®, several compilers, mail systems, and window managers. He has also built a WYSIWYG text editor, a constraint-based drawing editor, and a text editor called Emacs, for UNIX systems. At Sun his early activity was as lead engineer of the NeWS window system. He did the original design of the Java programming language and implemented its original compiler and virtual machine. He has recently been a contributor to the Real-Time Specification for Java.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Gosling, James

Speaker Bio: James Gosling received a B.Sc. in computer science from the University of Calgary, Canada in 1977. He received a Ph.D. in computer science from Carnegie-Mellon University in 1983. The title of his thesis was The Algebraic Manipulation of Constraints. He has built satellite data acquisition systems, a multiprocessor version of UNIX®, several compilers, mail systems, and window managers. He has also built a WYSIWYG text editor, a constraint-based drawing editor, and a text editor called Emacs, for UNIX systems. At Sun his early activity was as lead engineer of the NeWS window system. He did the original designmore » of the Java programming language and implemented its original compiler and virtual machine. He has recently been a contributor to the Real-Time Specification for Java.« less
A multiprocessor operating system simulator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Johnston, G.M.; Campbell, R.H.

1988-01-01

This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT and T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows thatmore » of the Choices family of operating systems for loosely and tightly coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.« less
Development of Parallel Code for the Alaska Tsunami Forecast Model

NASA Astrophysics Data System (ADS)

Bahng, B.; Knight, W. R.; Whitmore, P.

2014-12-01

The Alaska Tsunami Forecast Model (ATFM) is a numerical model used to forecast propagation and inundation of tsunamis generated by earthquakes and other means in both the Pacific and Atlantic Oceans. At the U.S. National Tsunami Warning Center (NTWC), the model is mainly used in a pre-computed fashion. That is, results for hundreds of hypothetical events are computed before alerts, and are accessed and calibrated with observations during tsunamis to immediately produce forecasts. ATFM uses the non-linear, depth-averaged, shallow-water equations of motion with multiply nested grids in two-way communications between domains of each parent-child pair as waves get closer to coastal waters. Even with the pre-computation the task becomes non-trivial as sub-grid resolution gets finer. Currently, the finest resolution Digital Elevation Models (DEM) used by ATFM are 1/3 arc-seconds. With a serial code, large or multiple areas of very high resolution can produce run-times that are unrealistic even in a pre-computed approach. One way to increase the model performance is code parallelization used in conjunction with a multi-processor computing environment. NTWC developers have undertaken an ATFM code-parallelization effort to streamline the creation of the pre-computed database of results with the long term aim of tsunami forecasts from source to high resolution shoreline grids in real time. Parallelization will also permit timely regeneration of the forecast model database with new DEMs; and, will make possible future inclusion of new physics such as the non-hydrostatic treatment of tsunami propagation. The purpose of our presentation is to elaborate on the parallelization approach and to show the compute speed increase on various multi-processor systems.
Scheduler for multiprocessor system switch with selective pairing

DOEpatents

Gara, Alan; Gschwind, Michael Karl; Salapura, Valentina

2015-01-06

System, method and computer program product for scheduling threads in a multiprocessing system with selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). The method configures the selective pairing facility to use checking provide one highly reliable thread for high-reliability and allocate threads to corresponding processor cores indicating need for hardware checking. The method configures the selective pairing facility to provide multiple independent cores and allocate threads to corresponding processor cores indicating inherent resilience.
The MSFC UNIVAC 1108 EXEC 8 simulation model

NASA Technical Reports Server (NTRS)

Williams, T. G.; Richards, F. M.; Weatherbee, J. E.; Paul, L. K.

1972-01-01

A model is presented which simulates the MSFC Univac 1108 multiprocessor system. The hardware/operating system is described to enable a good statistical measurement of the system behavior. The performance of the 1108 is evaluated by performing twenty-four different experiments designed to locate system bottlenecks and also to test the sensitivity of system throughput with respect to perturbation of the various Exec 8 scheduling algorithms. The model is implemented in the general purpose system simulation language and the techniques described can be used to assist in the design, development, and evaluation of multiprocessor systems.
Visual monitoring of autonomous life sciences experimentation

NASA Technical Reports Server (NTRS)

Blank, G. E.; Martin, W. N.

1987-01-01

The design and implementation of a computerized visual monitoring system to aid in the monitoring and control of life sciences experiments on board a space station was investigated. A likely multiprocessor design was chosen, a plausible life science experiment with which to work was defined, the theoretical issues involved in the programming of a visual monitoring system for the experiment was considered on the multiprocessor, a system for monitoring the experiment was designed, and simulations of such a system was implemented on a network of Apollo workstations.
Importance of balanced architectures in the design of high-performance imaging systems

NASA Astrophysics Data System (ADS)

Sgro, Joseph A.; Stanton, Paul C.

1999-03-01

Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Fault tree models for fault tolerant hypercube multiprocessors

NASA Technical Reports Server (NTRS)

Boyd, Mark A.; Tuazon, Jezus O.

1991-01-01

Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.
The economics of data acquisition computers for ST and MST radars

NASA Technical Reports Server (NTRS)

Watkins, B. J.

1983-01-01

Some low cost options for data acquisition computers for ST (stratosphere, troposphere) and MST (mesosphere, stratosphere, troposphere) are presented. The particular equipment discussed reflects choices made by the University of Alaska group but of course many other options exist. The low cost microprocessor and array processor approach presented here has several advantages because of its modularity. An inexpensive system may be configured for a minimum performance ST radar, whereas a multiprocessor and/or a multiarray processor system may be used for a higher performance MST radar. This modularity is important for a network of radars because the initial cost is minimized while future upgrades will still be possible at minimal expense. This modularity also aids in lowering the cost of software development because system expansions should rquire little software changes. The functions of the radar computer will be to obtain Doppler spectra in near real time with some minor analysis such as vector wind determination.
Conditional load and store in a shared memory

DOEpatents

Blumrich, Matthias A; Ohmacht, Martin

2015-02-03

A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
A Scheduling Algorithm for Replicated Real-Time Tasks

NASA Technical Reports Server (NTRS)

Yu, Albert C.; Lin, Kwei-Jay

1991-01-01

We present an algorithm for scheduling real-time periodic tasks on a multiprocessor system under fault-tolerant requirement. Our approach incorporates both the redundancy and masking technique and the imprecise computation model. Since the tasks in hard real-time systems have stringent timing constraints, the redundancy and masking technique are more appropriate than the rollback techniques which usually require extra time for error recovery. The imprecise computation model provides flexible functionality by trading off the quality of the result produced by a task with the amount of processing time required to produce it. It therefore permits the performance of a real-time system to degrade gracefully. We evaluate the algorithm by stochastic analysis and Monte Carlo simulations. The results show that the algorithm is resilient under hardware failures.
Self-powered information measuring wireless networks using the distribution of tasks within multicore processors

NASA Astrophysics Data System (ADS)

Zhuravska, Iryna M.; Koretska, Oleksandra O.; Musiyenko, Maksym P.; Surtel, Wojciech; Assembay, Azat; Kovalev, Vladimir; Tleshova, Akmaral

2017-08-01

The article contains basic approaches to develop the self-powered information measuring wireless networks (SPIM-WN) using the distribution of tasks within multicore processors critical applying based on the interaction of movable components - as in the direction of data transmission as wireless transfer of energy coming from polymetric sensors. Base mathematic model of scheduling tasks within multiprocessor systems was modernized to schedule and allocate tasks between cores of one-crystal computer (SoC) to increase energy efficiency SPIM-WN objects.
A model for the distributed storage and processing of large arrays

NASA Technical Reports Server (NTRS)

Mehrota, P.; Pratt, T. W.

1983-01-01

A conceptual model for parallel computations on large arrays is developed. The model provides a set of language concepts appropriate for processing arrays which are generally too large to fit in the primary memories of a multiprocessor system. The semantic model is used to represent arrays on a concurrent architecture in such a way that the performance realities inherent in the distributed storage and processing can be adequately represented. An implementation of the large array concept as an Ada package is also described.
Software environment for implementing engineering applications on MIMD computers

NASA Technical Reports Server (NTRS)

Lopez, L. A.; Valimohamed, K. A.; Schiff, S.

1990-01-01

In this paper the concept for a software environment for developing engineering application systems for multiprocessor hardware (MIMD) is presented. The philosophy employed is to solve the largest problems possible in a reasonable amount of time, rather than solve existing problems faster. In the proposed environment most of the problems concerning parallel computation and handling of large distributed data spaces are hidden from the application program developer, thereby facilitating the development of large-scale software applications. Applications developed under the environment can be executed on a variety of MIMD hardware; it protects the application software from the effects of a rapidly changing MIMD hardware technology.

High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Towards developing robust algorithms for solving partial differential equations on MIMD machines

NASA Technical Reports Server (NTRS)

Saltz, Joel H.; Naik, Vijay K.

1988-01-01

Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.
Towards developing robust algorithms for solving partial differential equations on MIMD machines

NASA Technical Reports Server (NTRS)

Saltz, J. H.; Naik, V. K.

1985-01-01

Methods for efficient computation of numerical algorithms on a wide variety of MIMD machines are proposed. These techniques reorganize the data dependency patterns to improve the processor utilization. The model problem finds the time-accurate solution to a parabolic partial differential equation discretized in space and implicitly marched forward in time. The algorithms are extensions of Jacobi and SOR. The extensions consist of iterating over a window of several timesteps, allowing efficient overlap of computation with communication. The methods increase the degree to which work can be performed while data are communicated between processors. The effect of the window size and of domain partitioning on the system performance is examined both by implementing the algorithm on a simulated multiprocessor system.
Scheduling algorithms for automatic control systems for technological processes

NASA Astrophysics Data System (ADS)

Chernigovskiy, A. S.; Tsarev, R. Yu; Kapulin, D. V.

2017-01-01

Wide use of automatic process control systems and the usage of high-performance systems containing a number of computers (processors) give opportunities for creation of high-quality and fast production that increases competitiveness of an enterprise. Exact and fast calculations, control computation, and processing of the big data arrays - all of this requires the high level of productivity and, at the same time, minimum time of data handling and result receiving. In order to reach the best time, it is necessary not only to use computing resources optimally, but also to design and develop the software so that time gain will be maximal. For this purpose task (jobs or operations), scheduling techniques for the multi-machine/multiprocessor systems are applied. Some of basic task scheduling methods for the multi-machine process control systems are considered in this paper, their advantages and disadvantages come to light, and also some usage considerations, in case of the software for automatic process control systems developing, are made.
Multiprocessor and memory architecture of the neurocomputer SYNAPSE-1.

PubMed

Ramacher, U; Raab, W; Anlauf, J; Hachmann, U; Beichter, J; Brüls, N; Wesseling, M; Sicheneder, E; Männer, R; Glass, J

1993-12-01

A general purpose neurocomputer, SYNAPSE-1, which exhibits a multiprocessor and memory architecture is presented. It offers wide flexibility with respect to neural algorithms and a speed-up factor of several orders of magnitude--including learning. The computational power is provided by a 2-dimensional systolic array of neural signal processors. Since the weights are stored outside these NSPs, memory size and processing power can be adapted individually to the application needs. A neural algorithms programming language, embedded in C(+2) has been defined for the user to cope with the neurocomputer. In a benchmark test, the prototype of SYNAPSE-1 was 8000 times as fast as a standard workstation.
Reducing Response Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms

DTIC Science & Technology

2016-01-01

synchronous parallel tasks on multicore platforms. In 25th ECRTS, 2013. [10] U. Devi. Soft Real - Time Scheduling on Multiprocessors. PhD thesis...report, Washington University in St Louis, 2014. [18] C. Liu and J. Anderson. Supporting soft real - time DAG-based sys- tems on multiprocessors with...analysis for DAG-based real - time task systems im- plemented on heterogeneous multicore platforms. The spe- cific analysis problem that is considered was
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design

NASA Astrophysics Data System (ADS)

Debes, Eric; Kaine, Greg

2002-11-01

In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Optical interconnection networks for high-performance computing systems

NASA Astrophysics Data System (ADS)

Biberman, Aleksandr; Bergman, Keren

2012-04-01

Enabled by silicon photonic technology, optical interconnection networks have the potential to be a key disruptive technology in computing and communication industries. The enduring pursuit of performance gains in computing, combined with stringent power constraints, has fostered the ever-growing computational parallelism associated with chip multiprocessors, memory systems, high-performance computing systems and data centers. Sustaining these parallelism growths introduces unique challenges for on- and off-chip communications, shifting the focus toward novel and fundamentally different communication approaches. Chip-scale photonic interconnection networks, enabled by high-performance silicon photonic devices, offer unprecedented bandwidth scalability with reduced power consumption. We demonstrate that the silicon photonic platforms have already produced all the high-performance photonic devices required to realize these types of networks. Through extensive empirical characterization in much of our work, we demonstrate such feasibility of waveguides, modulators, switches and photodetectors. We also demonstrate systems that simultaneously combine many functionalities to achieve more complex building blocks. We propose novel silicon photonic devices, subsystems, network topologies and architectures to enable unprecedented performance of these photonic interconnection networks. Furthermore, the advantages of photonic interconnection networks extend far beyond the chip, offering advanced communication environments for memory systems, high-performance computing systems, and data centers.
A multiprocessor airborne lidar data system

NASA Technical Reports Server (NTRS)

Wright, C. W.; Bailey, S. A.; Heath, G. E.; Piazza, C. R.

1988-01-01

A new multiprocessor data acquisition system was developed for the existing Airborne Oceanographic Lidar (AOL). This implementation simultaneously utilizes five single board 68010 microcomputers, the UNIX system V operating system, and the real time executive VRTX. The original data acquisition system was implemented on a Hewlett Packard HP 21-MX 16 bit minicomputer using a multi-tasking real time operating system and a mixture of assembly and FORTRAN languages. The present collection of data sources produce data at widely varied rates and require varied amounts of burdensome real time processing and formatting. It was decided to replace the aging HP 21-MX minicomputer with a multiprocessor system. A new and flexible recording format was devised and implemented to accommodate the constantly changing sensor configuration. A central feature of this data system is the minimization of non-remote sensing bus traffic. Therefore, it is highly desirable that each micro be capable of functioning as much as possible on-card or via private peripherals. The bus is used primarily for the transfer of remote sensing data to or from the buffer queue.
Neural networks and MIMD-multiprocessors

NASA Technical Reports Server (NTRS)

Vanhala, Jukka; Kaski, Kimmo

1990-01-01

Two artificial neural network models are compared. They are the Hopfield Neural Network Model and the Sparse Distributed Memory model. Distributed algorithms for both of them are designed and implemented. The run time characteristics of the algorithms are analyzed theoretically and tested in practice. The storage capacities of the networks are compared. Implementations are done using a distributed multiprocessor system.
TRAMP; The next generation data acquisition for RTP

DOE Office of Scientific and Technical Information (OSTI.GOV)

van Haren, P.C.; Wijnoltz, F.

1992-04-01

The Rijnhuizen Tokamak Project RTP is a medium-sized tokamak experiment, which requires a very reliable data-acquisition system, due to its pulsed nature. Analyzing the limitations of an existing CAMAC-based data-acquisition system showed, that substantial increase of performance and flexibility could best be obtained by the construction of an entirely new system. This paper discusses this system, CALLED TRAMP (Transient Recorder and Amoeba Multi Processor), based on tailor-made transient recorders with a multiprocessor computer system in VME running Amoeba. The performance of TRAMP exceeds the performance of the CAMAC system by a factor of four. The plans to increase the flexibilitymore » and for a further increase of performance are presented.« less
On the impact of communication complexity in the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
On the impact of communication complexity on the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D. B.; Van Rosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Parallel Computer System for 3D Visualization Stereo on GPU

NASA Astrophysics Data System (ADS)

Al-Oraiqat, Anas M.; Zori, Sergii A.

2018-03-01

This paper proposes the organization of a parallel computer system based on Graphic Processors Unit (GPU) for 3D stereo image synthesis. The development is based on the modified ray tracing method developed by the authors for fast search of tracing rays intersections with scene objects. The system allows significant increase in the productivity for the 3D stereo synthesis of photorealistic quality. The generalized procedure of 3D stereo image synthesis on the Graphics Processing Unit/Graphics Processing Clusters (GPU/GPC) is proposed. The efficiency of the proposed solutions by GPU implementation is compared with single-threaded and multithreaded implementations on the CPU. The achieved average acceleration in multi-thread implementation on the test GPU and CPU is about 7.5 and 1.6 times, respectively. Studying the influence of choosing the size and configuration of the computational Compute Unified Device Archi-tecture (CUDA) network on the computational speed shows the importance of their correct selection. The obtained experimental estimations can be significantly improved by new GPUs with a large number of processing cores and multiprocessors, as well as optimized configuration of the computing CUDA network.
Instructional image processing on a university mainframe: The Kansas system

NASA Technical Reports Server (NTRS)

Williams, T. H. L.; Siebert, J.; Gunn, C.

1981-01-01

An interactive digital image processing program package was developed that runs on the University of Kansas central computer, a Honeywell Level 66 multi-processor system. The module form of the package allows easy and rapid upgrades and extensions of the system and is used in remote sensing courses in the Department of Geography, in regional five-day short courses for academics and professionals, and also in remote sensing projects and research. The package comprises three self-contained modules of processing functions: Subimage extraction and rectification; image enhancement, preprocessing and data reduction; and classification. Its use in a typical course setting is described. Availability and costs are considered.
Cache as point of coherence in multiprocessor system

DOEpatents

Blumrich, Matthias A.; Ceze, Luis H.; Chen, Dong; Gara, Alan; Heidelberger, Phlip; Ohmacht, Martin; Steinmacher-Burow, Burkhard; Zhuang, Xiaotong

2016-11-29

In a multiprocessor system, a conflict checking mechanism is implemented in the L2 cache memory. Different versions of speculative writes are maintained in different ways of the cache. A record of speculative writes is maintained in the cache directory. Conflict checking occurs as part of directory lookup. Speculative versions that do not conflict are aggregated into an aggregated version in a different way of the cache. Speculative memory access requests do not go to main memory.
FTMP (Fault Tolerant Multiprocessor) programmer's manual

NASA Technical Reports Server (NTRS)

Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

1986-01-01

The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.
Parallel solution of the symmetric tridiagonal eigenproblem. Research report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1989-10-01

This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed-memory Multiple Instruction, Multiple Data multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speed up, and accuracy. Experiments on an IPSC hypercube multiprocessor reveal that Cuppen's method ismore » the most accurate approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effect of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptions of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Parallel solution of the symmetric tridiagonal eigenproblem

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1989-01-01

This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed memory MIMD multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues, and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speedup, and accuracy. Experiments on an iPSC hyper-cube multiprocessor reveal that Cuppen's method is the most accuratemore » approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effects of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptations of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Application Portable Parallel Library

NASA Technical Reports Server (NTRS)

Cole, Gary L.; Blech, Richard A.; Quealy, Angela; Townsend, Scott

1995-01-01

Application Portable Parallel Library (APPL) computer program is subroutine-based message-passing software library intended to provide consistent interface to variety of multiprocessor computers on market today. Minimizes effort needed to move application program from one computer to another. User develops application program once and then easily moves application program from parallel computer on which created to another parallel computer. ("Parallel computer" also include heterogeneous collection of networked computers). Written in C language with one FORTRAN 77 subroutine for UNIX-based computers and callable from application programs written in C language or FORTRAN 77.

Software/hardware distributed processing network supporting the Ada environment

NASA Astrophysics Data System (ADS)

Wood, Richard J.; Pryk, Zen

1993-09-01

A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
Energy-efficient fault tolerance in multiprocessor real-time systems

NASA Astrophysics Data System (ADS)

Guo, Yifeng

The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is investigated, where tasks' main copies are executed ASAP while backup copies ALAP to reduce the overlapped execution of main and backup copies of the same task and thus reduce energy consumption. All proposed techniques are evaluated through extensive simulations and compared with other state-of-the-art approaches. The simulation results confirm that the proposed schemes can preserve the system reliability while still achieving substantial energy savings. Finally, for both SS and POED based Energy-Efficient Fault-Tolerant (EEFT) schemes, a series of recovery strategies are designed when more than one (transient and permanent) faults need to be tolerated.
A Multiprocessor SoC Architecture with Efficient Communication Infrastructure and Advanced Compiler Support for Easy Application Development

NASA Astrophysics Data System (ADS)

Urfianto, Mohammad Zalfany; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki

This paper presentss a Multiprocessor System-on-Chips (MPSoC) architecture used as an execution platform for the new C-language based MPSoC design framework we are currently developing. The MPSoC architecture is based on an existing SoC platform with a commercial RISC core acting as the host CPU. We extend the existing SoC with a multiprocessor-array block that is used as the main engine to run parallel applications modeled in our design framework. Utilizing several optimizations provided by our compiler, an efficient inter-communication between processing elements with minimum overhead is implemented. A host-interface is designed to integrate the existing RISC core to the multiprocessor-array. The experimental results show that an efficacious integration is achieved, proving that the designed communication module can be used to efficiently incorporate off-the-shelf processors as a processing element for MPSoC architectures designed using our framework.
Communication Studies of DMP and SMP Machines

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
Experience with a Genetic Algorithm Implemented on a Multiprocessor Computer

NASA Technical Reports Server (NTRS)

Plassman, Gerald E.; Sobieszczanski-Sobieski, Jaroslaw

2000-01-01

Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the non-distributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent.
Process Management and Exception Handling in Multiprocessor Operating Systems Using Object-Oriented Design Techniques. Revised Sep. 1988

NASA Technical Reports Server (NTRS)

Russo, Vincent; Johnston, Gary; Campbell, Roy

1988-01-01

The programming of the interrupt handling mechanisms, process switching primitives, scheduling mechanism, and synchronization primitives of an operating system for a multiprocessor require both efficient code in order to support the needs of high- performance or real-time applications and careful organization to facilitate maintenance. Although many advantages have been claimed for object-oriented class hierarchical languages and their corresponding design methodologies, the application of these techniques to the design of the primitives within an operating system has not been widely demonstrated. To investigate the role of class hierarchical design in systems programming, the authors have constructed the Choices multiprocessor operating system architecture the C++ programming language. During the implementation, it was found that many operating system design concerns can be represented advantageously using a class hierarchical approach, including: the separation of mechanism and policy; the organization of an operating system into layers, each of which represents an abstract machine; and the notions of process and exception management. In this paper, we discuss an implementation of the low-level primitives of this system and outline the strategy by which we developed our solution.
Runtime Performance Monitoring Tool for RTEMS System Software

NASA Astrophysics Data System (ADS)

Cho, B.; Kim, S.; Park, H.; Kim, H.; Choi, J.; Chae, D.; Lee, J.

2007-08-01

RTEMS is a commercial-grade real-time operating system that supports multi-processor computers. However, there are not many development tools for RTEMS. In this paper, we report new RTEMS-based runtime performance monitoring tool. We have implemented a light weight runtime monitoring task with an extension to the RTEMS APIs. Using our tool, software developers can verify various performance- related parameters during runtime. Our tool can be used during software development phase and in-orbit operation as well. Our implemented target agent is light weight and has small overhead using SpaceWire interface. Efforts to reduce overhead and to add other monitoring parameters are currently under research.
Parallel processing and expert systems

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Lau, Sonie

1991-01-01

Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited.
Design of a fault tolerant airborne digital computer. Volume 1: Architecture

NASA Technical Reports Server (NTRS)

Wensley, J. H.; Levitt, K. N.; Green, M. W.; Goldberg, J.; Neumann, P. G.

1973-01-01

This volume is concerned with the architecture of a fault tolerant digital computer for an advanced commercial aircraft. All of the computations of the aircraft, including those presently carried out by analogue techniques, are to be carried out in this digital computer. Among the important qualities of the computer are the following: (1) The capacity is to be matched to the aircraft environment. (2) The reliability is to be selectively matched to the criticality and deadline requirements of each of the computations. (3) The system is to be readily expandable. contractible, and (4) The design is to appropriate to post 1975 technology. Three candidate architectures are discussed and assessed in terms of the above qualities. Of the three candidates, a newly conceived architecture, Software Implemented Fault Tolerance (SIFT), provides the best match to the above qualities. In addition SIFT is particularly simple and believable. The other candidates, Bus Checker System (BUCS), also newly conceived in this project, and the Hopkins multiprocessor are potentially more efficient than SIFT in the use of redundancy, but otherwise are not as attractive.
Scheduling for Locality in Shared-Memory Multiprocessors

DTIC Science & Technology

1993-05-01

Submitted in Partial Fulfillment of the Requirements for the Degree ’)iIC Q(JALfryT INSPECTED 5 DOCTOR OF PHILOSOPHY I Accesion For Supervised by NTIS CRAM... architecture on parallel program performance, explain the implications of this trend on popular parallel programming models, and propose system software to 0...decomoosition and scheduling algorithms. I. SUIUECT TERMS IS. NUMBER OF PAGES shared-memory multiprocessors; architecture trends; loop 110 scheduling
An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors

NASA Technical Reports Server (NTRS)

Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

2015-01-01

This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.
Efficient parallel architecture for highly coupled real-time linear system applications

NASA Technical Reports Server (NTRS)

Carroll, Chester C.; Homaifar, Abdollah; Barua, Soumavo

1988-01-01

A systematic procedure is developed for exploiting the parallel constructs of computation in a highly coupled, linear system application. An overall top-down design approach is adopted. Differential equations governing the application under consideration are partitioned into subtasks on the basis of a data flow analysis. The interconnected task units constitute a task graph which has to be computed in every update interval. Multiprocessing concepts utilizing parallel integration algorithms are then applied for efficient task graph execution. A simple scheduling routine is developed to handle task allocation while in the multiprocessor mode. Results of simulation and scheduling are compared on the basis of standard performance indices. Processor timing diagrams are developed on the basis of program output accruing to an optimal set of processors. Basic architectural attributes for implementing the system are discussed together with suggestions for processing element design. Emphasis is placed on flexible architectures capable of accommodating widely varying application specifics.
Portable parallel stochastic optimization for the design of aeropropulsion components

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Rhodes, G. S.

1994-01-01

This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses

DOEpatents

Ohmacht, Martin

2017-08-15

In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses

DOEpatents

Ohmacht, Martin

2014-09-09

In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Multiprocessor system with multiple concurrent modes of execution

DOEpatents

Ahn, Daniel; Ceze, Luis H; Chen, Dong; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin

2013-12-31

A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.
Multiprocessor system with multiple concurrent modes of execution

DOEpatents

Ahn, Daniel; Ceze, Luis H.; Chen, Dong Chen; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin

2016-11-22

A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.
The Unification of Space Qualified Integrated Circuits by Example of International Space Project GAMMA-400

NASA Astrophysics Data System (ADS)

Bobkov, S. G.; Serdin, O. V.; Arkhangelskiy, A. I.; Arkhangelskaja, I. V.; Suchkov, S. I.; Topchiev, N. P.

The problem of electronic component unification at the different levels (circuits, interfaces, hardware and software) used in space industry is considered. The task of computer systems for space purposes developing is discussed by example of scientific data acquisition system for space project GAMMA-400. The basic characteristics of high reliable and fault tolerant chips developed by SRISA RAS for space applicable computational systems are given. To reduce power consumption and enhance data reliability, embedded system interconnect made hierarchical: upper level is Serial RapidIO 1x or 4x with rate transfer 1.25 Gbaud; next level - SpaceWire with rate transfer up to 400 Mbaud and lower level - MIL-STD-1553B and RS232/RS485. The Ethernet 10/100 is technology interface and provided connection with the previously released modules too. Systems interconnection allows creating different redundancy systems. Designers can develop heterogeneous systems that employ the peer-to-peer networking performance of Serial RapidIO using multiprocessor clusters interconnected by SpaceWire.
The Triangle: a Multiprocessor Architecture for Fast Curve and Surface Generation.

DTIC Science & Technology

1987-08-01

design , curves and surfaces, graphics hardware. 20...curves, B-splines, computer-aided geometric design ; curves and sur- faces, graphics hardware. (k 12). -/ .... This work was supported in part by the...34 Electronic Design , October 30, 1986. 21. M. A. Penna and R. R. Patterson, Projective Geometry and its Applications to Computer Graphics , Prentice-Hall, Englewood Cliffs, N.J., 1985. 70,e, 41100vr -~ ~ - -- --
High-performance multiprocessor architecture for a 3-D lattice gas model

NASA Technical Reports Server (NTRS)

Lee, F.; Flynn, M.; Morf, M.

1991-01-01

The lattice gas method has recently emerged as a promising discrete particle simulation method in areas such as fluid dynamics. We present a very high-performance scalable multiprocessor architecture, called ALGE, proposed for the simulation of a realistic 3-D lattice gas model, Henon's 24-bit FCHC isometric model. Each of these VLSI processors is as powerful as a CRAY-2 for this application. ALGE is scalable in the sense that it achieves linear speedup for both fixed and increasing problem sizes with more processors. The core computation of a lattice gas model consists of many repetitions of two alternating phases: particle collision and propagation. Functional decomposition by symmetry group and virtual move are the respective keys to efficient implementation of collision and propagation.

Dynamic programming on a shared-memory multiprocessor

NASA Technical Reports Server (NTRS)

Edmonds, Phil; Chu, Eleanor; George, Alan

1993-01-01

Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
A measurement-based performability model for a multiprocessor system

NASA Technical Reports Server (NTRS)

Ilsueh, M. C.; Iyer, Ravi K.; Trivedi, K. S.

1987-01-01

A measurement-based performability model based on real error-data collected on a multiprocessor system is described. Model development from the raw errror-data to the estimation of cumulative reward is described. Both normal and failure behavior of the system are characterized. The measured data show that the holding times in key operational and failure states are not simple exponential and that semi-Markov process is necessary to model the system behavior. A reward function, based on the service rate and the error rate in each state, is then defined in order to estimate the performability of the system and to depict the cost of different failure types and recovery procedures.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2011-01-11

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-02-21

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Flood predictions using the parallel version of distributed numerical physical rainfall-runoff model TOPKAPI

NASA Astrophysics Data System (ADS)

Boyko, Oleksiy; Zheleznyak, Mark

2015-04-01

The original numerical code TOPKAPI-IMMS of the distributed rainfall-runoff model TOPKAPI ( Todini et al, 1996-2014) is developed and implemented in Ukraine. The parallel version of the code has been developed recently to be used on multiprocessors systems - multicore/processors PC and clusters. Algorithm is based on binary-tree decomposition of the watershed for the balancing of the amount of computation for all processors/cores. Message passing interface (MPI) protocol is used as a parallel computing framework. The numerical efficiency of the parallelization algorithms is demonstrated for the case studies for the flood predictions of the mountain watersheds of the Ukrainian Carpathian regions. The modeling results is compared with the predictions based on the lumped parameters models.
Framework for analysis of guaranteed QOS systems

NASA Astrophysics Data System (ADS)

Chaudhry, Shailender; Choudhary, Alok

1997-01-01

Multimedia data is isochronous in nature and entails managing and delivering high volumes of data. Multiprocessors with their large processing power, vast memory, and fast interconnects, are an ideal candidate for the implementation of multimedia applications. Initially, multiprocessors were designed to execute scientific programs and thus their architecture was optimized to provide low message latency and efficiently support regular communication patterns. Hence, they have a regular network topology and most use wormhole routing. The design offers the benefits of a simple router, small buffer size, and network latency that is almost independent of path length. Among the various multimedia applications, video on demand (VOD) server is well-suited for implementation using parallel multiprocessors. Logical models for VOD servers are presently mapped onto multiprocessors. Our paper provides a framework for calculating bounds on utilization of system resources with which QoS parameters for each isochronous stream can be guaranteed. Effects of the architecture of multiprocessors, and efficiency of various local models and mapping on particular architectures can be investigated within our framework. Our framework is based on rigorous proofs and provides tight bounds. The results obtained may be used as the basis for admission control tests. To illustrate the versatility of our framework, we provide bounds on utilization for various logical models applied to mesh connected architectures for a video on demand server. Our results show that worm hole routing can lead to packets waiting for transmission of other packets that apparently share no common resources. This situation is analogous to head-of-the-line blocking. We find that the provision of multiple VCs per link and multiple flit buffers improves utilization (even under guaranteed QoS parameters). This analogous to parallel iterative matching.
Montage Version 3.0

NASA Technical Reports Server (NTRS)

Jacob, Joseph; Katz, Daniel; Prince, Thomas; Berriman, Graham; Good, John; Laity, Anastasia

2006-01-01

The final version (3.0) of the Montage software has been released. To recapitulate from previous NASA Tech Briefs articles about Montage: This software generates custom, science-grade mosaics of astronomical images on demand from input files that comply with the Flexible Image Transport System (FITS) standard and contain image data registered on projections that comply with the World Coordinate System (WCS) standards. This software can be executed on single-processor computers, multi-processor computers, and such networks of geographically dispersed computers as the National Science Foundation s TeraGrid or NASA s Information Power Grid. The primary advantage of running Montage in a grid environment is that computations can be done on a remote supercomputer for efficiency. Multiple computers at different sites can be used for different parts of a computation a significant advantage in cases of computations for large mosaics that demand more processor time than is available at any one site. Version 3.0 incorporates several improvements over prior versions. The most significant improvement is that this version is accessible to scientists located anywhere, through operational Web services that provide access to data from several large astronomical surveys and construct mosaics on either local workstations or remote computational grids as needed.
Analysis and design of algorithm-based fault-tolerant systems

NASA Technical Reports Server (NTRS)

Nair, V. S. Sukumaran

1990-01-01

An important consideration in the design of high performance multiprocessor systems is to ensure the correctness of the results computed in the presence of transient and intermittent failures. Concurrent error detection and correction have been applied to such systems in order to achieve reliability. Algorithm Based Fault Tolerance (ABFT) was suggested as a cost-effective concurrent error detection scheme. The research was motivated by the complexity involved in the analysis and design of ABFT systems. To that end, a matrix-based model was developed and, based on that, algorithms for both the design and analysis of ABFT systems are formulated. These algorithms are less complex than the existing ones. In order to reduce the complexity further, a hierarchical approach is developed for the analysis of large systems.
Abnormal fault-recovery characteristics of the fault-tolerant multiprocessor uncovered using a new fault-injection methodology

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1991-01-01

An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.
Force user's manual, revised

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.

1987-01-01

A methodology for writing parallel programs for shared memory multiprocessors has been formalized as an extension to the Fortran language and implemented as a macro preprocessor. The extended language is known as the Force, and this manual describes how to write Force programs and execute them on the Flexible Computer Corporation Flex/32, the Encore Multimax and the Sequent Balance computers. The parallel extension macros are described in detail, but knowledge of Fortran is assumed.
Optimization of the computational load of a hypercube supercomputer onboard a mobile robot.

PubMed

Barhen, J; Toomarian, N; Protopopescu, V

1987-12-01

A combinatorial optimization methodology is developed, which enables the efficient use of hypercube multiprocessors onboard mobile intelligent robots dedicated to time-critical missions. The methodology is implemented in terms of large-scale concurrent algorithms based either on fast simulated annealing, or on nonlinear asynchronous neural networks. In particular, analytic expressions are given for the effect of singleneuron perturbations on the systems' configuration energy. Compact neuromorphic data structures are used to model effects such as prec xdence constraints, processor idling times, and task-schedule overlaps. Results for a typical robot-dynamics benchmark are presented.
Optimization of the computational load of a hypercube supercomputer onboard a mobile robot

NASA Technical Reports Server (NTRS)

Barhen, Jacob; Toomarian, N.; Protopopescu, V.

1987-01-01

A combinatorial optimization methodology is developed, which enables the efficient use of hypercube multiprocessors onboard mobile intelligent robots dedicated to time-critical missions. The methodology is implemented in terms of large-scale concurrent algorithms based either on fast simulated annealing, or on nonlinear asynchronous neural networks. In particular, analytic expressions are given for the effect of single-neuron perturbations on the systems' configuration energy. Compact neuromorphic data structures are used to model effects such as precedence constraints, processor idling times, and task-schedule overlaps. Results for a typical robot-dynamics benchmark are presented.
Parallel processing of real-time dynamic systems simulation on OSCAR (Optimally SCheduled Advanced multiprocessoR)

NASA Technical Reports Server (NTRS)

Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke

1989-01-01

Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.
Development of a model of machine hand eye coordination and program specifications for a topological machine vision system

NASA Technical Reports Server (NTRS)

1972-01-01

A unified approach to computer vision and manipulation is developed which is called choreographic vision. In the model, objects to be viewed by a projected robot in the Viking missions to Mars are seen as objects to be manipulated within choreographic contexts controlled by a multimoded remote, supervisory control system on Earth. A new theory of context relations is introduced as a basis for choreographic programming languages. A topological vision model is developed for recognizing objects by shape and contour. This model is integrated with a projected vision system consisting of a multiaperture image dissector TV camera and a ranging laser system. System program specifications integrate eye-hand coordination and topological vision functions and an aerospace multiprocessor implementation is described.
On-line data analysis and monitoring for H1 drift chambers

NASA Astrophysics Data System (ADS)

Düllmann, Dirk

1992-05-01

The on-line monitoring, slow control and calibration of the H1 central jet chamber uses a VME multiprocessor system to perform the analysis and a connected Macintosh computer as graphical interface to the operator on shift. Task of this system are: - analysis of event data including on-line track search, - on-line calibration from normal events and testpulse events, - control of the high voltage and monitoring of settings and currents, - monitoring of temperature, pressure and mixture of the chambergas. A program package is described which controls the dataflow between data aquisition, differnt VME CPUs and Macintosh. It allows to run off-line style programs for the different tasks.
An embedded multi-core parallel model for real-time stereo imaging

NASA Astrophysics Data System (ADS)

He, Wenjing; Hu, Jian; Niu, Jingyu; Li, Chuanrong; Liu, Guangyu

2018-04-01

The real-time processing based on embedded system will enhance the application capability of stereo imaging for LiDAR and hyperspectral sensor. The task partitioning and scheduling strategies for embedded multiprocessor system starts relatively late, compared with that for PC computer. In this paper, aimed at embedded multi-core processing platform, a parallel model for stereo imaging is studied and verified. After analyzing the computing amount, throughout capacity and buffering requirements, a two-stage pipeline parallel model based on message transmission is established. This model can be applied to fast stereo imaging for airborne sensors with various characteristics. To demonstrate the feasibility and effectiveness of the parallel model, a parallel software was designed using test flight data, based on the 8-core DSP processor TMS320C6678. The results indicate that the design performed well in workload distribution and had a speed-up ratio up to 6.4.
Distributed computing feasibility in a non-dedicated homogeneous distributed system

NASA Technical Reports Server (NTRS)

Leutenegger, Scott T.; Sun, Xian-He

1993-01-01

The low cost and availability of clusters of workstations have lead researchers to re-explore distributed computing using independent workstations. This approach may provide better cost/performance than tightly coupled multiprocessors. In practice, this approach often utilizes wasted cycles to run parallel jobs. The feasibility of such a non-dedicated parallel processing environment assuming workstation processes have preemptive priority over parallel tasks is addressed. An analytical model is developed to predict parallel job response times. Our model provides insight into how significantly workstation owner interference degrades parallel program performance. A new term task ratio, which relates the parallel task demand to the mean service demand of nonparallel workstation processes, is introduced. It was proposed that task ratio is a useful metric for determining how large the demand of a parallel applications must be in order to make efficient use of a non-dedicated distributed system.
Iterative algorithms for tridiagonal matrices on a WSI-multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.D.; Sameh, A.H.; Wisniewski, J.A.

1982-01-01

With the rapid advances in semiconductor technology, the construction of Wafer Scale Integration (WSI)-multiprocessors consisting of a large number of processors is now feasible. We illustrate the implementation of some basic linear algebra algorithms on such multiprocessors.
Performance Evaluation of Parallel Algorithms and Architectures in Concurrent Multiprocessor Systems

DTIC Science & Technology

1988-09-01

HEP and Other Parallel processors, Report no. ANL-83-97, Argonne National Laboratory, Argonne, Ill. 1983. [19] Davidson, G . S. A Practical Paradigm for...IEEE Comp. Soc., 1986. [241 Peir, Jih-kwon, and D. Gajski , "CAMP: A Programming Aide For Multiprocessors," Proc. 1986 ICPP, IEEE Comp. Soc., pp475...482. [251 Pfister, G . F., and V. A. Norton, "Hot Spot Contention and Combining in Multistage Interconnection Networks,"IEEE Trans. Comp., C-34, Oct
Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems

NASA Technical Reports Server (NTRS)

Nair, V. S. S.; Hoskote, Yatin V.; Abraham, Jacob A.

1992-01-01

The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. A probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system is developed. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model. Based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.

Simulation of a master-slave event set processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Comfort, J.C.

1984-03-01

Event set manipulation may consume a considerable amount of the computation time spent in performing a discrete-event simulation. One way of minimizing this time is to allow event set processing to proceed in parallel with the remainder of the simulation computation. The paper describes a multiprocessor simulation computer, in which all non-event set processing is performed by the principal processor (called the host). Event set processing is coordinated by a front end processor (the master) and actually performed by several other functionally identical processors (the slaves). A trace-driven simulation program modeling this system was constructed, and was run with tracemore » output taken from two different simulation programs. Output from this simulation suggests that a significant reduction in run time may be realized by this approach. Sensitivity analysis was performed on the significant parameters to the system (number of slave processors, relative processor speeds, and interprocessor communication times). A comparison between actual and simulation run times for a one-processor system was used to assist in the validation of the simulation. 7 references.« less
Asynchronous Data Retrieval from an Object-Oriented Database

NASA Astrophysics Data System (ADS)

Gilbert, Jonathan P.; Bic, Lubomir

We present an object-oriented semantic database model which, similar to other object-oriented systems, combines the virtues of four concepts: the functional data model, a property inheritance hierarchy, abstract data types and message-driven computation. The main emphasis is on the last of these four concepts. We describe generic procedures that permit queries to be processed in a purely message-driven manner. A database is represented as a network of nodes and directed arcs, in which each node is a logical processing element, capable of communicating with other nodes by exchanging messages. This eliminates the need for shared memory and for centralized control during query processing. Hence, the model is suitable for implementation on a multiprocessor computer architecture, consisting of large numbers of loosely coupled processing elements.
Simplifying and speeding the management of intra-node cache coherence

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY

2012-04-17

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A; Chen, Dong; Coteus, Paul W

A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Three Program Architecture for Design Optimization

NASA Technical Reports Server (NTRS)

Miura, Hirokazu; Olson, Lawrence E. (Technical Monitor)

1998-01-01

In this presentation, I would like to review historical perspective on the program architecture used to build design optimization capabilities based on mathematical programming and other numerical search techniques. It is rather straightforward to classify the program architecture in three categories as shown above. However, the relative importance of each of the three approaches has not been static, instead dynamically changing as the capabilities of available computational resource increases. For example, we considered that the direct coupling architecture would never be used for practical problems, but availability of such computer systems as multi-processor. In this presentation, I would like to review the roles of three architecture from historical as well as current and future perspective. There may also be some possibility for emergence of hybrid architecture. I hope to provide some seeds for active discussion where we are heading to in the very dynamic environment for high speed computing and communication.
Essential issues in multiprocessor systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.D.; Peir, J.K.

1985-06-01

During the past several years, a great number of proposals have been made with the objective to increase supercomputer performance by an order of magnitude on the basis of a utilization of new computer architectures. The present paper is concerned with a suitable classification scheme for comparing these architectures. It is pointed out that there are basically four schools of thought as to the most important factor for an enhancement of computer performance. According to one school, the development of faster circuits will make it possible to retain present architectures, except, possibly, for a mechanism providing synchronization of parallel processes.more » A second school assigns priority to the optimization and vectorization of compilers, which will detect parallelism and help users to write better parallel programs. A third school believes in the predominant importance of new parallel algorithms, while the fourth school supports new models of computation. The merits of the four approaches are critically evaluated. 50 references.« less
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
A parallel computing engine for a class of time critical processes.

PubMed

Nabhan, T M; Zomaya, A Y

1997-01-01

This paper focuses on the efficient parallel implementation of systems of numerically intensive nature over loosely coupled multiprocessor architectures. These analytical models are of significant importance to many real-time systems that have to meet severe time constants. A parallel computing engine (PCE) has been developed in this work for the efficient simplification and the near optimal scheduling of numerical models over the different cooperating processors of the parallel computer. First, the analytical system is efficiently coded in its general form. The model is then simplified by using any available information (e.g., constant parameters). A task graph representing the interconnections among the different components (or equations) is generated. The graph can then be compressed to control the computation/communication requirements. The task scheduler employs a graph-based iterative scheme, based on the simulated annealing algorithm, to map the vertices of the task graph onto a Multiple-Instruction-stream Multiple-Data-stream (MIMD) type of architecture. The algorithm uses a nonanalytical cost function that properly considers the computation capability of the processors, the network topology, the communication time, and congestion possibilities. Moreover, the proposed technique is simple, flexible, and computationally viable. The efficiency of the algorithm is demonstrated by two case studies with good results.
3D Navier-Stokes Flow Analysis for a Large-Array Multiprocessor

DTIC Science & Technology

1989-04-17

computer, Alliant’s FX /8, Intel’s Hypercube, and Encore’s Multimax. Unfortunately, the current algorithms have been developed pri- marily for SISD machines...Reversing and Thrust-Vectoring Nozzle Flows," Ph.D. Dissertation in the Dept. of Aero. and Astro ., Univ. of Wash., Washington, 1986. [11] Anderson
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors

NASA Technical Reports Server (NTRS)

Probst, David K.

1993-01-01

A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Proteus: a reconfigurable computational network for computer vision

NASA Astrophysics Data System (ADS)

Haralick, Robert M.; Somani, Arun K.; Wittenbrink, Craig M.; Johnson, Robert; Cooper, Kenneth; Shapiro, Linda G.; Phillips, Ihsin T.; Hwang, Jenq N.; Cheung, William; Yao, Yung H.; Chen, Chung-Ho; Yang, Larry; Daugherty, Brian; Lorbeski, Bob; Loving, Kent; Miller, Tom; Parkins, Larye; Soos, Steven L.

1992-04-01

The Proteus architecture is a highly parallel MIMD, multiple instruction, multiple-data machine, optimized for large granularity tasks such as machine vision and image processing The system can achieve 20 Giga-flops (80 Giga-flops peak). It accepts data via multiple serial links at a rate of up to 640 megabytes/second. The system employs a hierarchical reconfigurable interconnection network with the highest level being a circuit switched Enhanced Hypercube serial interconnection network for internal data transfers. The system is designed to use 256 to 1,024 RISC processors. The processors use one megabyte external Read/Write Allocating Caches for reduced multiprocessor contention. The system detects, locates, and replaces faulty subsystems using redundant hardware to facilitate fault tolerance. The parallelism is directly controllable through an advanced software system for partitioning, scheduling, and development. System software includes a translator for the INSIGHT language, a parallel debugger, low and high level simulators, and a message passing system for all control needs. Image processing application software includes a variety of point operators neighborhood, operators, convolution, and the mathematical morphology operations of binary and gray scale dilation, erosion, opening, and closing.
Multiprocessor Neural Network in Healthcare.

PubMed

Godó, Zoltán Attila; Kiss, Gábor; Kocsis, Dénes

2015-01-01

A possible way of creating a multiprocessor artificial neural network is by the use of microcontrollers. The RISC processors' high performance and the large number of I/O ports mean they are greatly suitable for creating such a system. During our research, we wanted to see if it is possible to efficiently create interaction between the artifical neural network and the natural nervous system. To achieve as much analogy to the living nervous system as possible, we created a frequency-modulated analog connection between the units. Our system is connected to the living nervous system through 128 microelectrodes. Two-way communication is provided through A/D transformation, which is even capable of testing psychopharmacons. The microcontroller-based analog artificial neural network can play a great role in medical singal processing, such as ECG, EEG etc.
Characterizing parallel file-access patterns on a large-scale multiprocessor

NASA Technical Reports Server (NTRS)

Purakayastha, A.; Ellis, Carla; Kotz, David; Nieuwejaar, Nils; Best, Michael L.

1995-01-01

High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.
Parallel algorithm of VLBI software correlator under multiprocessor environment

NASA Astrophysics Data System (ADS)

Zheng, Weimin; Zhang, Dong

2007-11-01

The correlator is the key signal processing equipment of a Very Lone Baseline Interferometry (VLBI) synthetic aperture telescope. It receives the mass data collected by the VLBI observatories and produces the visibility function of the target, which can be used to spacecraft position, baseline length measurement, synthesis imaging, and other scientific applications. VLBI data correlation is a task of data intensive and computation intensive. This paper presents the algorithms of two parallel software correlators under multiprocessor environments. A near real-time correlator for spacecraft tracking adopts the pipelining and thread-parallel technology, and runs on the SMP (Symmetric Multiple Processor) servers. Another high speed prototype correlator using the mixed Pthreads and MPI (Massage Passing Interface) parallel algorithm is realized on a small Beowulf cluster platform. Both correlators have the characteristic of flexible structure, scalability, and with 10-station data correlating abilities.
Population-based learning of load balancing policies for a distributed computer system

NASA Technical Reports Server (NTRS)

Mehra, Pankaj; Wah, Benjamin W.

1993-01-01

Effective load-balancing policies use dynamic resource information to schedule tasks in a distributed computer system. We present a novel method for automatically learning such policies. At each site in our system, we use a comparator neural network to predict the relative speedup of an incoming task using only the resource-utilization patterns obtained prior to the task's arrival. Outputs of these comparator networks are broadcast periodically over the distributed system, and the resource schedulers at each site use these values to determine the best site for executing an incoming task. The delays incurred in propagating workload information and tasks from one site to another, as well as the dynamic and unpredictable nature of workloads in multiprogrammed multiprocessors, may cause the workload pattern at the time of execution to differ from patterns prevailing at the times of load-index computation and decision making. Our load-balancing policy accommodates this uncertainty by using certain tunable parameters. We present a population-based machine-learning algorithm that adjusts these parameters in order to achieve high average speedups with respect to local execution. Our results show that our load-balancing policy, when combined with the comparator neural network for workload characterization, is effective in exploiting idle resources in a distributed computer system.
Speeding up parallel processing

NASA Technical Reports Server (NTRS)

Denning, Peter J.

1988-01-01

In 1967 Amdahl expressed doubts about the ultimate utility of multiprocessors. The formulation, now called Amdahl's law, became part of the computing folklore and has inspired much skepticism about the ability of the current generation of massively parallel processors to efficiently deliver all their computing power to programs. The widely publicized recent results of a group at Sandia National Laboratory, which showed speedup on a 1024 node hypercube of over 500 for three fixed size problems and over 1000 for three scalable problems, have convincingly challenged this bit of folklore and have given new impetus to parallel scientific computing.
Functional requirements document for the Earth Observing System Data and Information System (EOSDIS) Scientific Computing Facilities (SCF) of the NASA/MSFC Earth Science and Applications Division, 1992

NASA Technical Reports Server (NTRS)

Botts, Michael E.; Phillips, Ron J.; Parker, John V.; Wright, Patrick D.

1992-01-01

Five scientists at MSFC/ESAD have EOS SCF investigator status. Each SCF has unique tasks which require the establishment of a computing facility dedicated to accomplishing those tasks. A SCF Working Group was established at ESAD with the charter of defining the computing requirements of the individual SCFs and recommending options for meeting these requirements. The primary goal of the working group was to determine which computing needs can be satisfied using either shared resources or separate but compatible resources, and which needs require unique individual resources. The requirements investigated included CPU-intensive vector and scalar processing, visualization, data storage, connectivity, and I/O peripherals. A review of computer industry directions and a market survey of computing hardware provided information regarding important industry standards and candidate computing platforms. It was determined that the total SCF computing requirements might be most effectively met using a hierarchy consisting of shared and individual resources. This hierarchy is composed of five major system types: (1) a supercomputer class vector processor; (2) a high-end scalar multiprocessor workstation; (3) a file server; (4) a few medium- to high-end visualization workstations; and (5) several low- to medium-range personal graphics workstations. Specific recommendations for meeting the needs of each of these types are presented.
Domain Decomposition: A Bridge between Nature and Parallel Computers

DTIC Science & Technology

1992-09-01

B., "Domain Decomposition Algorithms for Indefinite Elliptic Problems," S"IAM Journal of S; cientific and Statistical (’omputing, Vol. 13, 1992, pp...AD-A256 575 NASA Contractor Report 189709 ICASE Report No. 92-44 ICASE DOMAIN DECOMPOSITION: A BRIDGE BETWEEN NATURE AND PARALLEL COMPUTERS DTIC dE...effectively implemented on dis- tributed memory multiprocessors. In 1990 (as reported in Ref. 38 using the tile algo- rithm), a 103,201-unknown 2D elliptic
Dynamic grid refinement for partial differential equations on parallel computers

NASA Technical Reports Server (NTRS)

Mccormick, S.; Quinlan, D.

1989-01-01

The fast adaptive composite grid method (FAC) is an algorithm that uses various levels of uniform grids to provide adaptive resolution and fast solution of PDEs. An asynchronous version of FAC, called AFAC, that completely eliminates the bottleneck to parallelism is presented. This paper describes the advantage that this algorithm has in adaptive refinement for moving singularities on multiprocessor computers. This work is applicable to the parallel solution of two- and three-dimensional shock tracking problems.
A comparison of multiprocessor scheduling methods for iterative data flow architectures

NASA Technical Reports Server (NTRS)

Storch, Matthew

1993-01-01

A comparative study is made between the Algorithm to Architecture Mapping Model (ATAMM) and three other related multiprocessing models from the published literature. The primary focus of all four models is the non-preemptive scheduling of large-grain iterative data flow graphs as required in real-time systems, control applications, signal processing, and pipelined computations. Important characteristics of the models such as injection control, dynamic assignment, multiple node instantiations, static optimum unfolding, range-chart guided scheduling, and mathematical optimization are identified. The models from the literature are compared with the ATAMM for performance, scheduling methods, memory requirements, and complexity of scheduling and design procedures.

Real-time dynamics simulation of the Cassini spacecraft using DARTS. Part 1: Functional capabilities and the spatial algebra algorithm

NASA Technical Reports Server (NTRS)

Jain, A.; Man, G. K.

1993-01-01

This paper describes the Dynamics Algorithms for Real-Time Simulation (DARTS) real-time hardware-in-the-loop dynamics simulator for the National Aeronautics and Space Administration's Cassini spacecraft. The spacecraft model consists of a central flexible body with a number of articulated rigid-body appendages. The demanding performance requirements from the spacecraft control system require the use of a high fidelity simulator for control system design and testing. The DARTS algorithm provides a new algorithmic and hardware approach to the solution of this hardware-in-the-loop simulation problem. It is based upon the efficient spatial algebra dynamics for flexible multibody systems. A parallel and vectorized version of this algorithm is implemented on a low-cost, multiprocessor computer to meet the simulation timing requirements.
Mobile Thread Task Manager

NASA Technical Reports Server (NTRS)

Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.

2013-01-01

The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
The Design and Evaluation of "CAPTools"--A Computer Aided Parallelization Toolkit

NASA Technical Reports Server (NTRS)

Yan, Jerry; Frumkin, Michael; Hribar, Michelle; Jin, Haoqiang; Waheed, Abdul; Johnson, Steve; Cross, Jark; Evans, Emyr; Ierotheou, Constantinos; Leggett, Pete;

1998-01-01

Writing applications for high performance computers is a challenging task. Although writing code by hand still offers the best performance, it is extremely costly and often not very portable. The Computer Aided Parallelization Tools (CAPTools) are a toolkit designed to help automate the mapping of sequential FORTRAN scientific applications onto multiprocessors. CAPTools consists of the following major components: an inter-procedural dependence analysis module that incorporates user knowledge; a 'self-propagating' data partitioning module driven via user guidance; an execution control mask generation and optimization module for the user to fine tune parallel processing of individual partitions; a program transformation/restructuring facility for source code clean up and optimization; a set of browsers through which the user interacts with CAPTools at each stage of the parallelization process; and a code generator supporting multiple programming paradigms on various multiprocessors. Besides describing the rationale behind the architecture of CAPTools, the parallelization process is illustrated via case studies involving structured and unstructured meshes. The programming process and the performance of the generated parallel programs are compared against other programming alternatives based on the NAS Parallel Benchmarks, ARC3D and other scientific applications. Based on these results, a discussion on the feasibility of constructing architectural independent parallel applications is presented.

Ordered fast fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1989-01-01

Design alternatives for ordered Fast Fourier Transformation (FFT) algorithms were examined on massively parallel hypercube multiprocessors such as the Connection Machine. Particular emphasis is placed on reducing communication which is known to dominate the overall computing time. To this end, the order and computational phases of the FFT were combined, and the sequence to processor maps that reduce communication were used. The class of ordered transforms is expanded to include any FFT in which the order of the transform is the same as that of the input sequence. Two such orderings are examined, namely, standard-order and A-order which can be implemented with equal ease on the Connection Machine where orderings are determined by geometries and priorities. If the sequence has N = 2 exp r elements and the hypercube has P = 2 exp d processors, then a standard-order FFT can be implemented with d + r/2 + 1 parallel transmissions. An A-order sequence can be transformed with 2d - r/2 parallel transmissions which is r - d + 1 fewer than the standard order. A parallel method for computing the trigonometric coefficients is presented that does not use trigonometric functions or interprocessor communication. A performance of 0.9 GFLOPS was obtained for an A-order transform on the Connection Machine.
The performance of disk arrays in shared-memory database machines

NASA Technical Reports Server (NTRS)

Katz, Randy H.; Hong, Wei

1993-01-01

In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metric data temperature as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.
Dynamic modelling and estimation of the error due to asynchronism in a redundant asynchronous multiprocessor system

NASA Technical Reports Server (NTRS)

Huynh, Loc C.; Duval, R. W.

1986-01-01

The use of Redundant Asynchronous Multiprocessor System to achieve ultrareliable Fault Tolerant Control Systems shows great promise. The development has been hampered by the inability to determine whether differences in the outputs of redundant CPU's are due to failures or to accrued error built up by slight differences in CPU clock intervals. This study derives an analytical dynamic model of the difference between redundant CPU's due to differences in their clock intervals and uses this model with on-line parameter identification to idenitify the differences in the clock intervals. The ability of this methodology to accurately track errors due to asynchronisity generate an error signal with the effect of asynchronisity removed and this signal may be used to detect and isolate actual system failures.
Random Walk Method for Potential Problems

NASA Technical Reports Server (NTRS)

Krishnamurthy, T.; Raju, I. S.

2002-01-01

A local Random Walk Method (RWM) for potential problems governed by Lapalace's and Paragon's equations is developed for two- and three-dimensional problems. The RWM is implemented and demonstrated in a multiprocessor parallel environment on a Beowulf cluster of computers. A speed gain of 16 is achieved as the number of processors is increased from 1 to 23.
Statistical methodologies for the control of dynamic remapping

NASA Technical Reports Server (NTRS)

Saltz, J. H.; Nicol, D. M.

1986-01-01

Following an initial mapping of a problem onto a multiprocessor machine or computer network, system performance often deteriorates with time. In order to maintain high performance, it may be necessary to remap the problem. The decision to remap must take into account measurements of performance deterioration, the cost of remapping, and the estimated benefits achieved by remapping. We examine the tradeoff between the costs and the benefits of remapping two qualitatively different kinds of problems. One problem assumes that performance deteriorates gradually, the other assumes that performance deteriorates suddenly. We consider a variety of policies for governing when to remap. In order to evaluate these policies, statistical models of problem behaviors are developed. Simulation results are presented which compare simple policies with computationally expensive optimal decision policies; these results demonstrate that for each problem type, the proposed simple policies are effective and robust.
Realization of preconditioned Lanczos and conjugate gradient algorithms on optical linear algebra processors.

PubMed

Ghosh, A

1988-08-01

Lanczos and conjugate gradient algorithms are important in computational linear algebra. In this paper, a parallel pipelined realization of these algorithms on a ring of optical linear algebra processors is described. The flow of data is designed to minimize the idle times of the optical multiprocessor and the redundancy of computations. The effects of optical round-off errors on the solutions obtained by the optical Lanczos and conjugate gradient algorithms are analyzed, and it is shown that optical preconditioning can improve the accuracy of these algorithms substantially. Algorithms for optical preconditioning and results of numerical experiments on solving linear systems of equations arising from partial differential equations are discussed. Since the Lanczos algorithm is used mostly with sparse matrices, a folded storage scheme to represent sparse matrices on spatial light modulators is also described.
Real-Time Considerations for Rugged Embedded Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Tumeo, Antonino; Ceriani, Marco; Palermo, Gianluca

This chapter introduces the characterizing aspects of embedded systems, and discusses the specific features that a designer should address to an embedded system “rugged”, i.e., able to operate reliably in harsh environments. The chapter addresses both the hardware and the less obvious software aspect. After presenting a current list of certifications for ruggedization, the chapters present a case study that focuses on the interaction of the hardware and software layers in reactive real-time system. In particular, it shows how the use of fast FPGA prototyping could provide insights on unexpected factors that influence the performance and thus responsiveness to eventsmore » of a scheduling algorithm for multiprocessor systems that manages both periodic, hard real-time task, and aperiodic tasks. The main lesson is that to make the system “rugged”, a designer should consider these issues by, for example, overprovisioning resources and/or computation capabilities.« less
Experiences with Transitioning Science Data Production from a Symmetric Multiprocessor Platform to a Linux Cluster Environment

NASA Astrophysics Data System (ADS)

Walter, R. J.; Protack, S. P.; Harris, C. J.; Caruthers, C.; Kusterer, J. M.

2008-12-01

NASA's Atmospheric Science Data Center at the NASA Langley Research Center performs all of the science data processing for the Multi-angle Imaging SpectroRadiometer (MISR) instrument. MISR is one of the five remote sensing instruments flying aboard NASA's Terra spacecraft. From the time of Terra launch in December 1999 until February 2008, all MISR science data processing was performed on a Silicon Graphics, Inc. (SGI) platform. However, dramatic improvements in commodity computing technology coupled with steadily declining project budgets during that period eventually made transitioning MISR processing to a commodity computing environment both feasible and necessary. The Atmospheric Science Data Center has successfully ported the MISR science data processing environment from the SGI platform to a Linux cluster environment. There were a multitude of technical challenges associated with this transition. Even though the core architecture of the production system did not change, the manner in which it interacted with underlying hardware was fundamentally different. In addition, there are more potential throughput bottlenecks in a cluster environment than there are in a symmetric multiprocessor environment like the SGI platform and each of these had to be addressed. Once all the technical issues associated with the transition were resolved, the Atmospheric Science Data Center had a MISR science data processing system with significantly higher throughput than the SGI platform at a fraction of the cost. In addition to the commodity hardware, free and open source software such as S4PM, Sun Grid Engine, PostgreSQL and Ganglia play a significant role in the new system. Details of the technical challenges and resolutions, software systems, performance improvements, and cost savings associated with the transition will be discussed. The Atmospheric Science Data Center in Langley's Science Directorate leads NASA's program for the processing, archival and distribution of Earth science data in the areas of radiation budget, clouds, aerosols, and tropospheric chemistry. The Data Center was established in 1991 to support NASA's Earth Observing System and the U.S. Global Change Research Program. It is unique among NASA data centers in the size of its archive, cutting edge computing technology, and full range of data services. For more information regarding ASDC data holdings, documentation, tools and services, visit http://eosweb.larc.nasa.gov
3-D parallel program for numerical calculation of gas dynamics problems with heat conductivity on distributed memory computational systems (CS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.

1997-12-31

The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
A tool for modeling concurrent real-time computation

NASA Technical Reports Server (NTRS)

Sharma, D. D.; Huang, Shie-Rei; Bhatt, Rahul; Sridharan, N. S.

1990-01-01

Real-time computation is a significant area of research in general, and in AI in particular. The complexity of practical real-time problems demands use of knowledge-based problem solving techniques while satisfying real-time performance constraints. Since the demands of a complex real-time problem cannot be predicted (owing to the dynamic nature of the environment) powerful dynamic resource control techniques are needed to monitor and control the performance. A real-time computation model for a real-time tool, an implementation of the QP-Net simulator on a Symbolics machine, and an implementation on a Butterfly multiprocessor machine are briefly described.
Parallel solution of sparse one-dimensional dynamic programming problems

NASA Technical Reports Server (NTRS)

Nicol, David M.

1989-01-01

Parallel computation offers the potential for quickly solving large computational problems. However, it is often a non-trivial task to effectively use parallel computers. Solution methods must sometimes be reformulated to exploit parallelism; the reformulations are often more complex than their slower serial counterparts. We illustrate these points by studying the parallelization of sparse one-dimensional dynamic programming problems, those which do not obviously admit substantial parallelization. We propose a new method for parallelizing such problems, develop analytic models which help us to identify problems which parallelize well, and compare the performance of our algorithm with existing algorithms on a multiprocessor.
Improving Quantum Gate Simulation using a GPU

NASA Astrophysics Data System (ADS)

Gutierrez, Eladio; Romero, Sergio; Trenas, Maria A.; Zapata, Emilio L.

2008-11-01

Due to the increasing computing power of the graphics processing units (GPU), they are becoming more and more popular when solving general purpose algorithms. As the simulation of quantum computers results on a problem with exponential complexity, it is advisable to perform a parallel computation, such as the one provided by the SIMD multiprocessors present in recent GPUs. In this paper, we focus on an important quantum algorithm, the quantum Fourier transform (QTF), in order to evaluate different parallelization strategies on a novel GPU architecture. Our implementation makes use of the new CUDA software/hardware architecture developed recently by NVIDIA.
Thriving on Chaos: The Development of a Surgical Information System

PubMed Central

Olund, Steven R.

1988-01-01

Hospitals present unique challenges to the computer industry, generating a greater quantity and variety of data than nearly any other enterprise. This is complicated by the fact that a hospital is not one homogenous organization, but a bundle of semi-independent groups with unique data requirements. Therefore hospital information systems must be fast, flexible, reliable, easy to use and maintain, and cost-effective. The Surgical Information System at Rush Presbyterian-St. Luke's Medical Center, Chicago is such system. It uses a Sequent Balance 21000 multi-processor superminicomputer, running industry standard tools such as the Unix operating system, a 4th generation programming language (4GL), and Structured Query Language (SQL) relational database management software. This treatise illustrates a comprehensive yet generic approach which can be applied to almost any clinical situation where access to patient data is required by a variety of medical professionals.
Method for prefetching non-contiguous data structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-05-05

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Accelerating the Gillespie Exact Stochastic Simulation Algorithm using hybrid parallel execution on graphics processing units.

PubMed

Komarov, Ivan; D'Souza, Roshan M

2012-01-01

The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.

Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes

NASA Technical Reports Server (NTRS)

Wissink, Andrew; Allen, Edwin (Technical Monitor)

1998-01-01

Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
Multiprocessor architecture: Synthesis and evaluation

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1990-01-01

Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Multitasking kernel for the C and Fortran programming languages

DOE Office of Scientific and Technical Information (OSTI.GOV)

Brooks, E.D. III

1984-09-01

A multitasking kernel for the C and Fortran programming languages which runs on the Unix operating system is presented. The kernel provides a multitasking environment which serves two purposes. The first is to provide an efficient portable environment for the coding, debugging and execution of production multiprocessor programs. The second is to provide a means of evaluating the performance of a multitasking program on model multiprocessors. The performance evaluation features require no changes in the source code of the application and are implemented as a set of compile and run time options in the kernel.
Measurement and analysis of workload effects on fault latency in real-time systems

NASA Technical Reports Server (NTRS)

Woodbury, Michael H.; Shin, Kang G.

1990-01-01

The authors demonstrate the need to address fault latency in highly reliable real-time control computer systems. It is noted that the effectiveness of all known recovery mechanisms is greatly reduced in the presence of multiple latent faults. The presence of multiple latent faults increases the possibility of multiple errors, which could result in coverage failure. The authors present experimental evidence indicating that the duration of fault latency is dependent on workload. A synthetic workload generator is used to vary the workload, and a hardware fault injector is applied to inject transient faults of varying durations. This method makes it possible to derive the distribution of fault latency duration. Experimental results obtained from the fault-tolerant multiprocessor at the NASA Airlab are presented and discussed.
Error recovery in shared memory multiprocessors using private caches

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1990-01-01

The problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
Evaluation of the Cedar memory system: Configuration of 16 by 16

NASA Technical Reports Server (NTRS)

Gallivan, K.; Jalby, W.; Wijshoff, H.

1991-01-01

Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented.
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA

NASA Technical Reports Server (NTRS)

Oliker, Leonid; Biswas, Rupak

1999-01-01

The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
Low-power, transparent optical network interface for high bandwidth off-chip interconnects.

PubMed

Liboiron-Ladouceur, Odile; Wang, Howard; Garg, Ajay S; Bergman, Keren

2009-04-13

The recent emergence of multicore architectures and chip multiprocessors (CMPs) has accelerated the bandwidth requirements in high-performance processors for both on-chip and off-chip interconnects. For next generation computing clusters, the delivery of scalable power efficient off-chip communications to each compute node has emerged as a key bottleneck to realizing the full computational performance of these systems. The power dissipation is dominated by the off-chip interface and the necessity to drive high-speed signals over long distances. We present a scalable photonic network interface approach that fully exploits the bandwidth capacity offered by optical interconnects while offering significant power savings over traditional E/O and O/E approaches. The power-efficient interface optically aggregates electronic serial data streams into a multiple WDM channel packet structure at time-of-flight latencies. We demonstrate a scalable optical network interface with 70% improvement in power efficiency for a complete end-to-end PCI Express data transfer.
Portable Parallel Programming for the Dynamic Load Balancing of Unstructured Grid Applications

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Das, Sajal K.; Harvey, Daniel; Oliker, Leonid

1999-01-01

The ability to dynamically adapt an unstructured -rid (or mesh) is a powerful tool for solving computational problems with evolving physical features; however, an efficient parallel implementation is rather difficult, particularly from the view point of portability on various multiprocessor platforms We address this problem by developing PLUM, tin automatic anti architecture-independent framework for adaptive numerical computations in a message-passing environment. Portability is demonstrated by comparing performance on an SP2, an Origin2000, and a T3E, without any code modifications. We also present a general-purpose load balancer that utilizes symmetric broadcast networks (SBN) as the underlying communication pattern, with a goal to providing a global view of system loads across processors. Experiments on, an SP2 and an Origin2000 demonstrate the portability of our approach which achieves superb load balance at the cost of minimal extra overhead.
A cache-aided multiprocessor rollback recovery scheme

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent

1989-01-01

This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
Closed-form solutions of performability. [modeling of a degradable buffer/multiprocessor system

NASA Technical Reports Server (NTRS)

Meyer, J. F.

1981-01-01

Methods which yield closed form performability solutions for continuous valued variables are developed. The models are similar to those employed in performance modeling (i.e., Markovian queueing models) but are extended so as to account for variations in structure due to faults. In particular, the modeling of a degradable buffer/multiprocessor system is considered whose performance Y is the (normalized) average throughput rate realized during a bounded interval of time. To avoid known difficulties associated with exact transient solutions, an approximate decomposition of the model is employed permitting certain submodels to be solved in equilibrium. These solutions are then incorporated in a model with fewer transient states and by solving the latter, a closed form solution of the system's performability is obtained. In conclusion, some applications of this solution are discussed and illustrated, including an example of design optimization.
By Hand or Not By-Hand: A Case Study of Alternative Approaches to Parallelize CFD Applications

NASA Technical Reports Server (NTRS)

Yan, Jerry C.; Bailey, David (Technical Monitor)

1997-01-01

While parallel processing promises to speed up applications by several orders of magnitude, the performance achieved still depends upon several factors, including the multiprocessor architecture, system software, data distribution and alignment, as well as the methods used for partitioning the application and mapping its components onto the architecture. The existence of the Gorden Bell Prize given out at Supercomputing every year suggests that while good performance can be attained for real applications on general purpose multiprocessors, the large investment in man-power and time still has to be repeated for each application-machine combination. As applications and machine architectures become more complex, the cost and time-delays for obtaining performance by hand will become prohibitive. Computer users today can turn to three possible avenues for help: parallel libraries, parallel languages and compilers, interactive parallelization tools. The success of these methodologies, in turn, depends on proper application of data dependency analysis, program structure recognition and transformation, performance prediction as well as exploitation of user supplied knowledge. NASA has been developing multidisciplinary applications on highly parallel architectures under the High Performance Computing and Communications Program. Over the past six years, the transition of underlying hardware and system software have forced the scientists to spend a large effort to migrate and recede their applications. Various attempts to exploit software tools to automate the parallelization process have not produced favorable results. In this paper, we report our most recent experience with CAPTOOL, a package developed at Greenwich University. We have chosen CAPTOOL for three reasons: 1. CAPTOOL accepts a FORTRAN 77 program as input. This suggests its potential applicability to a large collection of legacy codes currently in use. 2. CAPTOOL employs domain decomposition to obtain parallelism. Although the fact that not all kinds of parallelism are handled may seem unappealing, many NASA applications in computational aerosciences as well as earth and space sciences are amenable to domain decomposition. 3. CAPTOOL generates code for a large variety of environments employed across NASA centers: MPI/PVM on network of workstations to the IBS/SP2 and CRAY/T3D.
Analysis of a parallelized nonlinear elliptic boundary value problem solver with application to reacting flows

NASA Technical Reports Server (NTRS)

Keyes, David E.; Smooke, Mitchell D.

1987-01-01

A parallelized finite difference code based on the Newton method for systems of nonlinear elliptic boundary value problems in two dimensions is analyzed in terms of computational complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless parameters is derived for algorithms based on stripwise and boxwise decompositions of the domain and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the cost functions to the parameters is explored in regions of parameter space corresponding to model small-order systems with inexpensive function evaluations and also a coupled system of nineteen equations with very expensive function evaluations. The algorithm was implemented on the Intel Hypercube, and some experimental results for the model problems with stripwise decompositions are presented and compared with the theory. In the context of computational combustion problems, multiprocessors of either message-passing or shared-memory type may be employed with stripwise decompositions to realize speedup of O(n), where n is mesh resolution in one direction, for reasonable n.
Customizing FP-growth algorithm to parallel mining with Charm++ library

NASA Astrophysics Data System (ADS)

Puścian, Marek

2017-08-01

This paper presents a frequent item mining algorithm that was customized to handle growing data repositories. The proposed solution applies Master Slave scheme to frequent pattern growth technique. Efficient utilization of available computation units is achieved by dynamic reallocation of tasks. Conditional frequent trees are assigned to parallel workers basing on their workload. Proposed enhancements have been successfully implemented using Charm++ library. This paper discusses results of the performance of parallelized FP-growth algorithm against different datasets. The approach has been illustrated with many experiments and measurements performed using multiprocessor and multithreaded computer.
Parallel Computing for Probabilistic Response Analysis of High Temperature Composites

NASA Technical Reports Server (NTRS)

Sues, R. H.; Lua, Y. J.; Smith, M. D.

1994-01-01

The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.
Compiling for Application Specific Computational Acceleration in Reconfigurable Architectures Final Report CRADA No. TSB-2033-01

DOE Office of Scientific and Technical Information (OSTI.GOV)

De Supinski, B.; Caliga, D.

2017-09-28

The primary objective of this project was to develop memory optimization technology to efficiently deliver data to, and distribute data within, the SRC-6's Field Programmable Gate Array- ("FPGA") based Multi-Adaptive Processors (MAPs). The hardware/software approach was to explore efficient MAP configurations and generate the compiler technology to exploit those configurations. This memory accessing technology represents an important step towards making reconfigurable symmetric multi-processor (SMP) architectures that will be a costeffective solution for large-scale scientific computing.
A divide and conquer approach to the nonsymmetric eigenvalue problem

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1991-01-01

Serial computation combined with high communication costs on distributed-memory multiprocessors make parallel implementations of the QR method for the nonsymmetric eigenvalue problem inefficient. This paper introduces an alternative algorithm for the nonsymmetric tridiagonal eigenvalue problem based on rank two tearing and updating of the matrix. The parallelism of this divide and conquer approach stems from independent solution of the updating problems. 11 refs.
Research on computer systems benchmarking

NASA Technical Reports Server (NTRS)

Smith, Alan Jay (Principal Investigator)

1996-01-01

This grant addresses the topic of research on computer systems benchmarking and is more generally concerned with performance issues in computer systems. This report reviews work in those areas during the period of NASA support under this grant. The bulk of the work performed concerned benchmarking and analysis of CPUs, compilers, caches, and benchmark programs. The first part of this work concerned the issue of benchmark performance prediction. A new approach to benchmarking and machine characterization was reported, using a machine characterizer that measures the performance of a given system in terms of a Fortran abstract machine. Another report focused on analyzing compiler performance. The performance impact of optimization in the context of our methodology for CPU performance characterization was based on the abstract machine model. Benchmark programs are analyzed in another paper. A machine-independent model of program execution was developed to characterize both machine performance and program execution. By merging these machine and program characterizations, execution time can be estimated for arbitrary machine/program combinations. The work was continued into the domain of parallel and vector machines, including the issue of caches in vector processors and multiprocessors. All of the afore-mentioned accomplishments are more specifically summarized in this report, as well as those smaller in magnitude supported by this grant.
Custom Sky-Image Mosaics from NASA's Information Power Grid

NASA Technical Reports Server (NTRS)

Jacob, Joseph; Collier, James; Craymer, Loring; Curkendall, David

2005-01-01

yourSkyG is the second generation of the software described in yourSky: Custom Sky-Image Mosaics via the Internet (NPO-30556), NASA Tech Briefs, Vol. 27, No. 6 (June 2003), page 45. Like its predecessor, yourSkyG supplies custom astronomical image mosaics of sky regions specified by requesters using client computers connected to the Internet. Whereas yourSky constructs mosaics on a local multiprocessor system, yourSkyG performs the computations on NASA s Information Power Grid (IPG), which is capable of performing much larger mosaicking tasks. (The IPG is high-performance computation and data grid that integrates geographically distributed 18 NASA Tech Briefs, September 2005 computers, databases, and instruments.) A user of yourSkyG can specify parameters describing a mosaic to be constructed. yourSkyG then constructs the mosaic on the IPG and makes it available for downloading by the user. The complexities of determining which input images are required to construct a mosaic, retrieving the required input images from remote sky-survey archives, uploading the images to the computers on the IPG, performing the computations remotely on the Grid, and downloading the resulting mosaic from the Grid are all transparent to the user
Synchronizing Data-Bus Messages

NASA Technical Reports Server (NTRS)

Harris, L. H.

1985-01-01

Adapter allows communications among as many as 30 data processors without central bus controller. Adapter improves reliability of multiprocessor system by eliminating point of failure that causes entire system to fail. Scheme prevents data collisions and eliminates nonessential polling, thereby reducing power consumption.

Evaluation of Methods for Multidisciplinary Design Optimization (MDO). Part 2

NASA Technical Reports Server (NTRS)

Kodiyalam, Srinivas; Yuan, Charles; Sobieski, Jaroslaw (Technical Monitor)

2000-01-01

A new MDO method, BLISS, and two different variants of the method, BLISS/RS and BLISS/S, have been implemented using iSIGHT's scripting language and evaluated in this report on multidisciplinary problems. All of these methods are based on decomposing a modular system optimization system into several subtasks optimization, that may be executed concurrently, and the system optimization that coordinates the subtasks optimization. The BLISS method and its variants are well suited for exploiting the concurrent processing capabilities in a multiprocessor machine. Several steps, including the local sensitivity analysis, local optimization, response surfaces construction and updates are all ideally suited for concurrent processing. Needless to mention, such algorithms that can effectively exploit the concurrent processing capabilities of the compute servers will be a key requirement for solving large-scale industrial design problems, such as the automotive vehicle problem detailed in Section 3.4.
A real time, FEM based optimal control algorithm and its implementation using parallel processing hardware (transistors) in a microprocessor environment

NASA Technical Reports Server (NTRS)

Patten, William Neff

1989-01-01

There is an evident need to discover a means of establishing reliable, implementable controls for systems that are plagued by nonlinear and, or uncertain, model dynamics. The development of a generic controller design tool for tough-to-control systems is reported. The method utilizes a moving grid, time infinite element based solution of the necessary conditions that describe an optimal controller for a system. The technique produces a discrete feedback controller. Real time laboratory experiments are now being conducted to demonstrate the viability of the method. The algorithm that results is being implemented in a microprocessor environment. Critical computational tasks are accomplished using a low cost, on-board, multiprocessor (INMOS T800 Transputers) and parallel processing. Progress to date validates the methodology presented. Applications of the technique to the control of highly flexible robotic appendages are suggested.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration. Ph.D. Thesis Report, 1 Jan. - 31 Dec. 1992

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Obando, Rodrigo A.

1993-01-01

The modeling and design of a fault-tolerant multiprocessor system is addressed. In particular, the behavior of the system during recovery and restoration after a fault has occurred is investigated. Given that a multicomputer system is designed using the Algorithm to Architecture to Mapping Model (ATAMM), and that a fault (death of a computing resource) occurs during its normal steady-state operation, a model is presented as a viable research tool for predicting the performance bounds of the system during its recovery and restoration phases. Furthermore, the bounds of the performance behavior of the system during this transient mode can be assessed. These bounds include: time to recover from the fault (t(sub rec)), time to restore the system (t(sub rec)) and whether there is a permanent delay in the system's Time Between Input and Output (TBIO) after the system has reached a steady state. An implementation of an ATAMM based computer was developed with the Generic VHSIC Spaceborne Computer (GVSC) as the target system. A simulation of the GVSC was also written based on the code used in ATAMM Multicomputer Operating System (AMOS). The simulation is in turn used to validate the new model in the usefulness and accuracy in tracking the propagation of the delay through the system and predicting the behavior in the transient state of recovery and restoration. The model is validated as an accurate method to predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
An autonomous rendezvous and docking system using cruise missile technologies

NASA Technical Reports Server (NTRS)

Jones, Ruel Edwin

1991-01-01

In November 1990 the Autonomous Rendezvous & Docking (AR&D) system was first demonstrated for members of NASA's Strategic Avionics Technology Working Group. This simulation utilized prototype hardware from the Cruise Missile and Advanced Centaur Avionics systems. The object was to show that all the accuracy, reliability and operational requirements established for a space craft to dock with Space Station Freedom could be met by the proposed system. The rapid prototyping capabilities of the Advanced Avionics Systems Development Laboratory were used to evaluate the proposed system in a real time, hardware in the loop simulation of the rendezvous and docking reference mission. The simulation permits manual, supervised automatic and fully autonomous operations to be evaluated. It is also being upgraded to be able to test an Autonomous Approach and Landing (AA&L) system. The AA&L and AR&D systems are very similar. Both use inertial guidance and control systems supplemented by GPS. Both use an Image Processing System (IPS), for target recognition and tracking. The IPS includes a general purpose multiprocessor computer and a selected suite of sensors that will provide the required relative position and orientation data. Graphic displays can also be generated by the computer, providing the astronaut / operator with real-time guidance and navigation data with enhanced video or sensor imagery.
DIRAC universal pilots

NASA Astrophysics Data System (ADS)

Stagni, F.; McNab, A.; Luzzi, C.; Krzemien, W.; Consortium, DIRAC

2017-10-01

In the last few years, new types of computing models, such as IAAS (Infrastructure as a Service) and IAAC (Infrastructure as a Client), gained popularity. New resources may come as part of pledged resources, while others are in the form of opportunistic ones. Most but not all of these new infrastructures are based on virtualization techniques. In addition, some of them, present opportunities for multi-processor computing slots to the users. Virtual Organizations are therefore facing heterogeneity of the available resources and the use of an Interware software like DIRAC to provide the transparent, uniform interface has become essential. The transparent access to the underlying resources is realized by implementing the pilot model. DIRAC’s newest generation of generic pilots (the so-called Pilots 2.0) are the “pilots for all the skies”, and have been successfully released in production more than a year ago. They use a plugin mechanism that makes them easily adaptable. Pilots 2.0 have been used for fetching and running jobs on every type of resource, being it a Worker Node (WN) behind a CREAM/ARC/HTCondor/DIRAC Computing element, a Virtual Machine running on IaaC infrastructures like Vac or BOINC, on IaaS cloud resources managed by Vcycle, the LHCb High Level Trigger farm nodes, and any type of opportunistic computing resource. Make a machine a “Pilot Machine”, and all diversities between them will disappear. This contribution describes how pilots are made suitable for different resources, and the recent steps taken towards a fully unified framework, including monitoring. Also, the cases of multi-processor computing slots either on real or virtual machines, with the whole node or a partition of it, is discussed.
Exploiting parallel computing with limited program changes using a network of microcomputers

NASA Technical Reports Server (NTRS)

Rogers, J. L., Jr.; Sobieszczanski-Sobieski, J.

1985-01-01

Network computing and multiprocessor computers are two discernible trends in parallel processing. The computational behavior of an iterative distributed process in which some subtasks are completed later than others because of an imbalance in computational requirements is of significant interest. The effects of asynchronus processing was studied. A small existing program was converted to perform finite element analysis by distributing substructure analysis over a network of four Apple IIe microcomputers connected to a shared disk, simulating a parallel computer. The substructure analysis uses an iterative, fully stressed, structural resizing procedure. A framework of beams divided into three substructures is used as the finite element model. The effects of asynchronous processing on the convergence of the design variables are determined by not resizing particular substructures on various iterations.
Performance and economy of a fault-tolerant multiprocessor

NASA Technical Reports Server (NTRS)

Lala, J. H.; Smith, C. J.

1979-01-01

The FTMP (Fault-Tolerant Multiprocessor) is one of two central aircraft fault-tolerant architectures now in the prototype phase under NASA sponsorship. The intended application of the computer includes such critical real-time tasks as 'fly-by-wire' active control and completely automatic Category III landings of commercial aircraft. The FTMP architecture is briefly described and it is shown that it is a viable solution to the multi-faceted problems of safety, speed, and cost. Three job dispatch strategies are described, and their results with respect to job-starting delay are presented. The first strategy is a simple First-Come-First-Serve (FCFS) job dispatch executive. The other two schedulers are an adaptive FCFS and an interrupt driven scheduler. Three failure modes are discussed, and the FTMP survival probability in the face of random hard failures is evaluated. It is noted that the hourly cost of operating two FTMPs in a transport aircraft can be as little as one-to-two percent of the total flight-hour cost of the aircraft.
Acceleration of linear stationary iterative processes in multiprocessor computers. II

DOE Office of Scientific and Technical Information (OSTI.GOV)

Romm, Ya.E.

1982-05-01

For pt.I, see Kibernetika, vol.18, no.1, p.47 (1982). For pt.I, see Cybernetics, vol.18, no.1, p.54 (1982). Considers a reduced system of linear algebraic equations x=ax+b, where a=(a/sub ij/) is a real n*n matrix; b is a real vector with common euclidean norm >>>. It is supposed that the existence and uniqueness of solution det (0-a) not equal to e is given, where e is a unit matrix. The linear iterative process converging to x x/sup (k+1)/=fx/sup (k)/, k=0, 1, 2, ..., where the operator f translates r/sup n/ into r/sup n/. In considering implementation of the iterative process (ip) inmore » a multiprocessor system, it is assumed that the number of processors is constant, and are various values of the latter investigated; it is assumed in addition, that the processors perform elementary binary arithmetic operations of addition and multiestimates only include the time of execution of arithmetic operations. With any paralleling of individual iteration, the execution time of the ip is proportional to the number of sequential steps k+1. The author sets the task of reducing the number of sequential steps in the ip so as to execute it in a time proportional to a value smaller than k+1. He also sets the goal of formulating a method of accelerated bit serial-parallel execution of each successive step of the ip, with, in the modification sought, a reduced number of steps in a time comparable to the operation time of logical elements. 6 references.« less
Fault recovery characteristics of the fault tolerant multi-processor

NASA Technical Reports Server (NTRS)

Padilla, Peter A.

1990-01-01

The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.
Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units*

PubMed Central

Hardy, David J.; Stone, John E.; Schulten, Klaus

2009-01-01

Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. PMID:20161132
Algorithms and architecture for multiprocessor based circuit simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deutsch, J.T.

Accurate electrical simulation is critical to the design of high performance integrated circuits. Logic simulators can verify function and give first-order timing information. Switch level simulators are more effective at dealing with charge sharing than standard logic simulators, but cannot provide accurate timing information or discover DC problems. Delay estimation techniques and cell level simulation can be used in constrained design methods, but must be tuned for each application, and circuit simulation must still be used to generate the cell models. None of these methods has the guaranteed accuracy that many circuit designers desire, and none can provide detailed waveformmore » information. Detailed electrical-level simulation can predict circuit performance if devices and parasitics are modeled accurately. However, the computational requirements of conventional circuit simulators make it impractical to simulate current large circuits. In this dissertation, the implementation of Iterated Timing Analysis (ITA), a relaxation-based technique for accurate circuit simulation, on a special-purpose multiprocessor is presented. The ITA method is an SOR-Newton, relaxation-based method which uses event-driven analysis and selective trace to exploit the temporal sparsity of the electrical network. Because event-driven selective trace techniques are employed, this algorithm lends itself to implementation on a data-driven computer.« less
Efficient Approximation Algorithms for Weighted $b$-Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.

2016-01-01

We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
Alpha: A real-time decentralized operating system for mission-oriented system integration and operation

NASA Technical Reports Server (NTRS)

Jensen, E. Douglas

1988-01-01

Alpha is a new kind of operating system that is unique in two highly significant ways. First, it is decentralized transparently providing reliable resource management across physically dispersed nodes, so that distributed applications programming can be done largely as though it were centralized. And second, it provides comprehensive, high technology support for real-time system integration and operation, an application area which consists predominately of aperiodic activities having critical time constraints such as deadlines. Alpha is extremely adaptable so that it can be easily optimized for a wide range of problem-specific functionality, performance, and cost. Alpha is the first systems effort of the Archons Project, and the prototype was created at Carnegie-Mellon University directly on modified Sun multiprocessor workstation hardware. It has been demonstrated with a real-time C(sup 2) application. Continuing research is leading to a series of enhanced follow-ons to Alpha; these are portable but initially hosted on Concurrent's MASSCOMP line of multiprocessor products.
A pipelined architecture for real time correction of non-uniformity in infrared focal plane arrays imaging system using multiprocessors

NASA Astrophysics Data System (ADS)

Zou, Liang; Fu, Zhuang; Zhao, YanZheng; Yang, JunYan

2010-07-01

This paper proposes a kind of pipelined electric circuit architecture implemented in FPGA, a very large scale integrated circuit (VLSI), which efficiently deals with the real time non-uniformity correction (NUC) algorithm for infrared focal plane arrays (IRFPA). Dual Nios II soft-core processors and a DSP with a 64+ core together constitute this image system. Each processor undertakes own systematic task, coordinating its work with each other's. The system on programmable chip (SOPC) in FPGA works steadily under the global clock frequency of 96Mhz. Adequate time allowance makes FPGA perform NUC image pre-processing algorithm with ease, which has offered favorable guarantee for the work of post image processing in DSP. And at the meantime, this paper presents a hardware (HW) and software (SW) co-design in FPGA. Thus, this systematic architecture yields an image processing system with multiprocessor, and a smart solution to the satisfaction with the performance of the system.
State recovery and lockstep execution restart in a system with multiprocessor pairing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gara, Alan; Gschwind, Michael K; Salapura, Valentina

System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switchmore » or a bus. Each selectively paired processor core is includes a transactional execution facility, whereing the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.« less
Modeling and simulation

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hanham, R.; Vogt, W.G.; Mickle, M.H.

1986-01-01

This book presents the papers given at a conference on computerized simulation. Topics considered at the conference included expert systems, modeling in electric power systems, power systems operating strategies, energy analysis, a linear programming approach to optimum load shedding in transmission systems, econometrics, simulation in natural gas engineering, solar energy studies, artificial intelligence, vision systems, hydrology, multiprocessors, and flow models.
FPGA-based distributed computing microarchitecture for complex physical dynamics investigation.

PubMed

Borgese, Gianluca; Pace, Calogero; Pantano, Pietro; Bilotta, Eleonora

2013-09-01

In this paper, we present a distributed computing system, called DCMARK, aimed at solving partial differential equations at the basis of many investigation fields, such as solid state physics, nuclear physics, and plasma physics. This distributed architecture is based on the cellular neural network paradigm, which allows us to divide the differential equation system solving into many parallel integration operations to be executed by a custom multiprocessor system. We push the number of processors to the limit of one processor for each equation. In order to test the present idea, we choose to implement DCMARK on a single FPGA, designing the single processor in order to minimize its hardware requirements and to obtain a large number of easily interconnected processors. This approach is particularly suited to study the properties of 1-, 2- and 3-D locally interconnected dynamical systems. In order to test the computing platform, we implement a 200 cells, Korteweg-de Vries (KdV) equation solver and perform a comparison between simulations conducted on a high performance PC and on our system. Since our distributed architecture takes a constant computing time to solve the equation system, independently of the number of dynamical elements (cells) of the CNN array, it allows us to reduce the elaboration time more than other similar systems in the literature. To ensure a high level of reconfigurability, we design a compact system on programmable chip managed by a softcore processor, which controls the fast data/control communication between our system and a PC Host. An intuitively graphical user interface allows us to change the calculation parameters and plot the results.
Parallel Implementation of Triangular Cellular Automata for Computing Two-Dimensional Elastodynamic Response on Arbitrary Domains

NASA Astrophysics Data System (ADS)

Leamy, Michael J.; Springer, Adam C.

In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Computational design of RNA parts, devices, and transcripts with kinetic folding algorithms implemented on multiprocessor clusters.

PubMed

Thimmaiah, Tim; Voje, William E; Carothers, James M

2015-01-01

With progress toward inexpensive, large-scale DNA assembly, the demand for simulation tools that allow the rapid construction of synthetic biological devices with predictable behaviors continues to increase. By combining engineered transcript components, such as ribosome binding sites, transcriptional terminators, ligand-binding aptamers, catalytic ribozymes, and aptamer-controlled ribozymes (aptazymes), gene expression in bacteria can be fine-tuned, with many corollaries and applications in yeast and mammalian cells. The successful design of genetic constructs that implement these kinds of RNA-based control mechanisms requires modeling and analyzing kinetically determined co-transcriptional folding pathways. Transcript design methods using stochastic kinetic folding simulations to search spacer sequence libraries for motifs enabling the assembly of RNA component parts into static ribozyme- and dynamic aptazyme-regulated expression devices with quantitatively predictable functions (rREDs and aREDs, respectively) have been described (Carothers et al., Science 334:1716-1719, 2011). Here, we provide a detailed practical procedure for computational transcript design by illustrating a high throughput, multiprocessor approach for evaluating spacer sequences and generating functional rREDs. This chapter is written as a tutorial, complete with pseudo-code and step-by-step instructions for setting up a computational cluster with an Amazon, Inc. web server and performing the large numbers of kinefold-based stochastic kinetic co-transcriptional folding simulations needed to design functional rREDs and aREDs. The method described here should be broadly applicable for designing and analyzing a variety of synthetic RNA parts, devices and transcripts.
Software platform virtualization in chemistry research and university teaching

PubMed Central

2009-01-01

Background Modern chemistry laboratories operate with a wide range of software applications under different operating systems, such as Windows, LINUX or Mac OS X. Instead of installing software on different computers it is possible to install those applications on a single computer using Virtual Machine software. Software platform virtualization allows a single guest operating system to execute multiple other operating systems on the same computer. We apply and discuss the use of virtual machines in chemistry research and teaching laboratories. Results Virtual machines are commonly used for cheminformatics software development and testing. Benchmarking multiple chemistry software packages we have confirmed that the computational speed penalty for using virtual machines is low and around 5% to 10%. Software virtualization in a teaching environment allows faster deployment and easy use of commercial and open source software in hands-on computer teaching labs. Conclusion Software virtualization in chemistry, mass spectrometry and cheminformatics is needed for software testing and development of software for different operating systems. In order to obtain maximum performance the virtualization software should be multi-core enabled and allow the use of multiprocessor configurations in the virtual machine environment. Server consolidation, by running multiple tasks and operating systems on a single physical machine, can lead to lower maintenance and hardware costs especially in small research labs. The use of virtual machines can prevent software virus infections and security breaches when used as a sandbox system for internet access and software testing. Complex software setups can be created with virtual machines and are easily deployed later to multiple computers for hands-on teaching classes. We discuss the popularity of bioinformatics compared to cheminformatics as well as the missing cheminformatics education at universities worldwide. PMID:20150997

Software platform virtualization in chemistry research and university teaching.

PubMed

Kind, Tobias; Leamy, Tim; Leary, Julie A; Fiehn, Oliver

2009-11-16

Modern chemistry laboratories operate with a wide range of software applications under different operating systems, such as Windows, LINUX or Mac OS X. Instead of installing software on different computers it is possible to install those applications on a single computer using Virtual Machine software. Software platform virtualization allows a single guest operating system to execute multiple other operating systems on the same computer. We apply and discuss the use of virtual machines in chemistry research and teaching laboratories. Virtual machines are commonly used for cheminformatics software development and testing. Benchmarking multiple chemistry software packages we have confirmed that the computational speed penalty for using virtual machines is low and around 5% to 10%. Software virtualization in a teaching environment allows faster deployment and easy use of commercial and open source software in hands-on computer teaching labs. Software virtualization in chemistry, mass spectrometry and cheminformatics is needed for software testing and development of software for different operating systems. In order to obtain maximum performance the virtualization software should be multi-core enabled and allow the use of multiprocessor configurations in the virtual machine environment. Server consolidation, by running multiple tasks and operating systems on a single physical machine, can lead to lower maintenance and hardware costs especially in small research labs. The use of virtual machines can prevent software virus infections and security breaches when used as a sandbox system for internet access and software testing. Complex software setups can be created with virtual machines and are easily deployed later to multiple computers for hands-on teaching classes. We discuss the popularity of bioinformatics compared to cheminformatics as well as the missing cheminformatics education at universities worldwide.
Overview of the LINCS architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Fletcher, J.G.; Watson, R.W.

1982-01-13

Computing at the Lawrence Livermore National Laboratory (LLNL) has evolved over the past 15 years with a computer network based resource sharing environment. The increasing use of low cost and high performance micro, mini and midi computers and commercially available local networking systems will accelerate this trend. Further, even the large scale computer systems, on which much of the LLNL scientific computing depends, are evolving into multiprocessor systems. It is our belief that the most cost effective use of this environment will depend on the development of application systems structured into cooperating concurrent program modules (processes) distributed appropriately over differentmore » nodes of the environment. A node is defined as one or more processors with a local (shared) high speed memory. Given the latter view, the environment can be characterized as consisting of: multiple nodes communicating over noisy channels with arbitrary delays and throughput, heterogenous base resources and information encodings, no single administration controlling all resources, distributed system state, and no uniform time base. The system design problem is - how to turn the heterogeneous base hardware/firmware/software resources of this environment into a coherent set of resources that facilitate development of cost effective, reliable, and human engineered applications. We believe the answer lies in developing a layered, communication oriented distributed system architecture; layered and modular to support ease of understanding, reconfiguration, extensibility, and hiding of implementation or nonessential local details; communication oriented because that is a central feature of the environment. The Livermore Interactive Network Communication System (LINCS) is a hierarchical architecture designed to meet the above needs. While having characteristics in common with other architectures, it differs in several respects.« less
ATAMM enhancement and multiprocessor performance evaluation

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.; Som, Sukhamoy; Obando, Rodrigo; Malekpour, Mahyar R.; Jones, Robert L., III; Mandala, Brij Mohan V.

1991-01-01

ATAMM (Algorithm To Architecture Mapping Model) enhancement and multiprocessor performance evaluation is discussed. The following topics are included: the ATAMM model; ATAMM enhancement; ADM (Advanced Development Model) implementation of ATAMM; and ATAMM support tools.
MPF: A portable message passing facility for shared memory multiprocessors

NASA Technical Reports Server (NTRS)

Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

1987-01-01

The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Distributed Virtual System (DIVIRS) Project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on contract NCC 2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to program parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the virtual system model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1994-01-01

As outlined in our continuation proposal 92-ISI-. OR (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
DIstributed VIRtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, Clifford B.

1995-01-01

As outlined in our continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Distributed Virtual System (DIVIRS) project

NASA Technical Reports Server (NTRS)

Schorr, Herbert; Neuman, B. Clifford

1993-01-01

As outlined in the continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC 2-539, the investigators are developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; developing communications routines that support the abstractions implemented; continuing the development of file and information systems based on the Virtual System Model; and incorporating appropriate security measures to allow the mechanisms developed to be used on an open network. The goal throughout the work is to provide a uniform model that can be applied to both parallel and distributed systems. The authors believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. The work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Bi-Level Integrated System Synthesis (BLISS) for Concurrent and Distributed Processing

NASA Technical Reports Server (NTRS)

Sobieszczanski-Sobieski, Jaroslaw; Altus, Troy D.; Phillips, Matthew; Sandusky, Robert

2002-01-01

The paper introduces a new version of the Bi-Level Integrated System Synthesis (BLISS) methods intended for optimization of engineering systems conducted by distributed specialty groups working concurrently and using a multiprocessor computing environment. The method decomposes the overall optimization task into subtasks associated with disciplines or subsystems where the local design variables are numerous and a single, system-level optimization whose design variables are relatively few. The subtasks are fully autonomous as to their inner operations and decision making. Their purpose is to eliminate the local design variables and generate a wide spectrum of feasible designs whose behavior is represented by Response Surfaces to be accessed by a system-level optimization. It is shown that, if the problem is convex, the solution of the decomposed problem is the same as that obtained without decomposition. A simplified example of an aircraft design shows the method working as intended. The paper includes a discussion of the method merits and demerits and recommendations for further research.
Information management system breadboard data acquisition and control system.

NASA Technical Reports Server (NTRS)

Mallary, W. E.

1972-01-01

Description of a breadboard configuration of an advanced information management system based on requirements for high data rates and local and centralized computation for subsystems and experiments to be housed on a space station. The system is to contain a 10-megabit-per-second digital data bus, remote terminals with preprocessor capabilities, and a central multiprocessor. A concept definition is presented for the data acquisition and control system breadboard, and a detailed account is given of the operation of the bus control unit, the bus itself, and the remote acquisition and control unit. The data bus control unit is capable of operating under control of both its own test panel and the test processor. In either mode it is capable of both single- and multiple-message operation in that it can accept a block of data requests or update commands for transmission to the remote acquisition and control unit, which in turn is capable of three levels of data-handling complexity.
Ordered fast Fourier transforms on a massively parallel hypercube multiprocessor

NASA Technical Reports Server (NTRS)

Tong, Charles; Swarztrauber, Paul N.

1991-01-01

The present evaluation of alternative, massively parallel hypercube processor-applicable designs for ordered radix-2 decimation-in-frequency FFT algorithms gives attention to the reduction of computation time-dominating communication. A combination of the order and computational phases of the FFT is accordingly employed, in conjunction with sequence-to-processor maps which reduce communication. Two orderings, 'standard' and 'cyclic', in which the order of the transform is the same as that of the input sequence, can be implemented with ease on the Connection Machine (where orderings are determined by geometries and priorities. A parallel method for trigonometric coefficient computation is presented which does not employ trigonometric functions or interprocessor communication.
Parallel block schemes for large scale least squares computations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Golub, G.H.; Plemmons, R.J.; Sameh, A.

1986-04-01

Large scale least squares computations arise in a variety of scientific and engineering problems, including geodetic adjustments and surveys, medical image analysis, molecular structures, partial differential equations and substructuring methods in structural engineering. In each of these problems, matrices often arise which possess a block structure which reflects the local connection nature of the underlying physical problem. For example, such super-large nonlinear least squares computations arise in geodesy. Here the coordinates of positions are calculated by iteratively solving overdetermined systems of nonlinear equations by the Gauss-Newton method. The US National Geodetic Survey will complete this year (1986) the readjustment ofmore » the North American Datum, a problem which involves over 540 thousand unknowns and over 6.5 million observations (equations). The observation matrix for these least squares computations has a block angular form with 161 diagnonal blocks, each containing 3 to 4 thousand unknowns. In this paper parallel schemes are suggested for the orthogonal factorization of matrices in block angular form and for the associated backsubstitution phase of the least squares computations. In addition, a parallel scheme for the calculation of certain elements of the covariance matrix for such problems is described. It is shown that these algorithms are ideally suited for multiprocessors with three levels of parallelism such as the Cedar system at the University of Illinois. 20 refs., 7 figs.« less
Interface Message Processors for the ARPA Computer Network

DTIC Science & Technology

1976-07-01

and then clear the location) as its primitive locking facility (i.e., as the necessary multiprocessor lock equivalent to Dijkstra semaphores )[37]. To...of the extra storage required for the redundant copies. There is the problem of maintaining synchronization of multiple copy data bases in the presence...through any of the data base sites. I Update synchronization . Races between conflicting, "concurrent" update requests are resolved in a manner that j
A parallel algorithm for generation and assembly of finite element stiffness and mass matrices

NASA Technical Reports Server (NTRS)

Storaasli, O. O.; Carmona, E. A.; Nguyen, D. T.; Baddourah, M. A.

1991-01-01

A new algorithm is proposed for parallel generation and assembly of the finite element stiffness and mass matrices. The proposed assembly algorithm is based on a node-by-node approach rather than the more conventional element-by-element approach. The new algorithm's generality and computation speed-up when using multiple processors are demonstrated for several practical applications on multi-processor Cray Y-MP and Cray 2 supercomputers.
The Design and Implementation of a Data Flow Multiprocessor.

DTIC Science & Technology

1981-12-01

to thank Captain Charles Papp who taught me how to use the logic analyzer and the storage oscilloscope. Without these tools, I could never have...debugged and repaired the microprocessors. Finally, I wish to thank my thesis readers, Major Charles Lillie and Major Walt Seward, for taking valuable time...Neumann/ Babbage architecture with the a data flow architecture. The next section describes the benefits of data flow computing. The following section
Hybrid Memory Management for Parallel Execution of Prolog on Shared Memory Multiprocessors

DTIC Science & Technology

1990-06-01

organizing data to increase locality. The stack structure exhibits greater locality than the heap structure. Tradeoff decisions can also be made on...PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES...University of California at Berkeley,Department of Electrical Engineering and Computer Sciences,Berkeley,CA,94720 8. PERFORMING ORGANIZATION REPORT
Performance Comparison of Mainframe, Workstations, Clusters, and Desktop Computers

NASA Technical Reports Server (NTRS)

Farley, Douglas L.

2005-01-01

A performance evaluation of a variety of computers frequently found in a scientific or engineering research environment was conducted using a synthetic and application program benchmarks. From a performance perspective, emerging commodity processors have superior performance relative to legacy mainframe computers. In many cases, the PC clusters exhibited comparable performance with traditional mainframe hardware when 8-12 processors were used. The main advantage of the PC clusters was related to their cost. Regardless of whether the clusters were built from new computers or whether they were created from retired computers their performance to cost ratio was superior to the legacy mainframe computers. Finally, the typical annual maintenance cost of legacy mainframe computers is several times the cost of new equipment such as multiprocessor PC workstations. The savings from eliminating the annual maintenance fee on legacy hardware can result in a yearly increase in total computational capability for an organization.
Performance and evaluation of real-time multicomputer control systems

NASA Technical Reports Server (NTRS)

Shin, K. G.

1985-01-01

Three experiments on fault tolerant multiprocessors (FTMP) were begun. They are: (1) measurement of fault latency in FTMP; (2) validation and analysis of FTMP synchronization protocols; and investigation of error propagation in FTMP.
Numerical methods on some structured matrix algebra problems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jessup, E.R.

1996-06-01

This proposal concerned the design, analysis, and implementation of serial and parallel algorithms for certain structured matrix algebra problems. It emphasized large order problems and so focused on methods that can be implemented efficiently on distributed-memory MIMD multiprocessors. Such machines supply the computing power and extensive memory demanded by the large order problems. We proposed to examine three classes of matrix algebra problems: the symmetric and nonsymmetric eigenvalue problems (especially the tridiagonal cases) and the solution of linear systems with specially structured coefficient matrices. As all of these are of practical interest, a major goal of this work was tomore » translate our research in linear algebra into useful tools for use by the computational scientists interested in these and related applications. Thus, in addition to software specific to the linear algebra problems, we proposed to produce a programming paradigm and library to aid in the design and implementation of programs for distributed-memory MIMD computers. We now report on our progress on each of the problems and on the programming tools.« less
NETRA: A parallel architecture for integrated vision systems. 1: Architecture and organization

NASA Technical Reports Server (NTRS)

Choudhary, Alok N.; Patel, Janak H.; Ahuja, Narendra

1989-01-01

Computer vision is regarded as one of the most complex and computationally intensive problems. An integrated vision system (IVS) is considered to be a system that uses vision algorithms from all levels of processing for a high level application (such as object recognition). A model of computation is presented for parallel processing for an IVS. Using the model, desired features and capabilities of a parallel architecture suitable for IVSs are derived. Then a multiprocessor architecture (called NETRA) is presented. This architecture is highly flexible without the use of complex interconnection schemes. The topology of NETRA is recursively defined and hence is easily scalable from small to large systems. Homogeneity of NETRA permits fault tolerance and graceful degradation under faults. It is a recursively defined tree-type hierarchical architecture where each of the leaf nodes consists of a cluster of processors connected with a programmable crossbar with selective broadcast capability to provide for desired flexibility. A qualitative evaluation of NETRA is presented. Then general schemes are described to map parallel algorithms onto NETRA. Algorithms are classified according to their communication requirements for parallel processing. An extensive analysis of inter-cluster communication strategies in NETRA is presented, and parameters affecting performance of parallel algorithms when mapped on NETRA are discussed. Finally, a methodology to evaluate performance of algorithms on NETRA is described.

Cache coherency without line exclusivity in MP systems having store-in caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pomerene, J.H.; Puzak, T.R.; Rechtschaffen, R.N.

1983-11-01

By modifying the function of the storage control unit, a multiprocessor (MP) system having store-in caches is enabled to operate with the same versatility as an MP system having store-through caches, thereby eliminating the requirement for line exclusivity and greatly reducing the occurrence of cross-interrogates.
Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications

NASA Technical Reports Server (NTRS)

Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

1997-01-01

A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.
Design and Implementation of a Threaded Search Engine for Tour Recommendation Systems

NASA Astrophysics Data System (ADS)

Lee, Junghoon; Park, Gyung-Leen; Ko, Jin-Hee; Shin, In-Hye; Kang, Mikyung

This paper implements a threaded scan engine for the O(n!) search space and measures its performance, aiming at providing a responsive tour recommendation and scheduling service. As a preliminary step of integrating POI ontology, mobile object database, and personalization profile for the development of new vehicular telematics services, this implementation can give a useful guideline to design a challenging and computation-intensive vehicular telematics service. The implemented engine allocates the subtree to the respective threads and makes them run concurrently exploiting the primitives provided by the operating system and the underlying multiprocessor architecture. It also makes it easy to add a variety of constraints, for example, the search tree is pruned if the cost of partial allocation already exceeds the current best. The performance measurement result shows that the service can run even in the low-power telematics device when the number of destinations does not exceed 15, with an appropriate constraint processing.
Implementation of MPEG-2 encoder to multiprocessor system using multiple MVPs (TMS320C80)

NASA Astrophysics Data System (ADS)

Kim, HyungSun; Boo, Kenny; Chung, SeokWoo; Choi, Geon Y.; Lee, YongJin; Jeon, JaeHo; Park, Hyun Wook

1997-05-01

This paper presents the efficient algorithm mapping for the real-time MPEG-2 encoding on the KAIST image computing system (KICS), which has a parallel architecture using five multimedia video processors (MVPs). The MVP is a general purpose digital signal processor (DSP) of Texas Instrument. It combines one floating-point processor and four fixed- point DSPs on a single chip. The KICS uses the MVP as a primary processing element (PE). Two PEs form a cluster, and there are two processing clusters in the KICS. Real-time MPEG-2 encoder is implemented through the spatial and the functional partitioning strategies. Encoding process of spatially partitioned half of the video input frame is assigned to ne processing cluster. Two PEs perform the functionally partitioned MPEG-2 encoding tasks in the pipelined operation mode. One PE of a cluster carries out the transform coding part and the other performs the predictive coding part of the MPEG-2 encoding algorithm. One MVP among five MVPs is used for system control and interface with host computer. This paper introduces an implementation of the MPEG-2 algorithm with a parallel processing architecture.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.

1987-02-15

For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
ARTS III/Parallel Processor Design Study

DOT National Transportation Integrated Search

1975-04-01

It was the purpose of this design study to investigate the feasibility, suitability, and cost-effectiveness of augmenting the ARTS III failsafe/failsoft multiprocessor system with a form of parallel processor to accomodate a large growth in air traff...
Evict on write, a management strategy for a prefetch unit and/or first level cache in a multiprocessor system with speculative execution

DOEpatents

Gara, Alan; Ohmacht, Martin

2014-09-16

In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Platform-Independence and Scheduling In a Multi-Threaded Real-Time Simulation

NASA Technical Reports Server (NTRS)

Sugden, Paul P.; Rau, Melissa A.; Kenney, P. Sean

2001-01-01

Aviation research often relies on real-time, pilot-in-the-loop flight simulation as a means to develop new flight software, flight hardware, or pilot procedures. Often these simulations become so complex that a single processor is incapable of performing the necessary computations within a fixed time-step. Threads are an elegant means to distribute the computational work-load when running on a symmetric multi-processor machine. However, programming with threads often requires operating system specific calls that reduce code portability and maintainability. While a multi-threaded simulation allows a significant increase in the simulation complexity, it also increases the workload of a simulation operator by requiring that the operator determine which models run on which thread. To address these concerns an object-oriented design was implemented in the NASA Langley Standard Real-Time Simulation in C++ (LaSRS++) application framework. The design provides a portable and maintainable means to use threads and also provides a mechanism to automatically load balance the simulation models.
Proceedings of the international conference on cybernetics and societ

DOE Office of Scientific and Technical Information (OSTI.GOV)

Not Available

1985-01-01

This book presents the papers given at a conference on artificial intelligence, expert systems and knowledge bases. Topics considered at the conference included automating expert system development, modeling expert systems, causal maps, data covariances, robot vision, image processing, multiprocessors, parallel processing, VLSI structures, man-machine systems, human factors engineering, cognitive decision analysis, natural language, computerized control systems, and cybernetics.
Scheduling for energy and reliability management on multiprocessor real-time systems

NASA Astrophysics Data System (ADS)

Qi, Xuan

Scheduling algorithms for multiprocessor real-time systems have been studied for years with many well-recognized algorithms proposed. However, it is still an evolving research area and many problems remain open due to their intrinsic complexities. With the emergence of multicore processors, it is necessary to re-investigate the scheduling problems and design/develop efficient algorithms for better system utilization, low scheduling overhead, high energy efficiency, and better system reliability. Focusing cluster schedulings with optimal global schedulers, we study the utilization bound and scheduling overhead for a class of cluster-optimal schedulers. Then, taking energy/power consumption into consideration, we developed energy-efficient scheduling algorithms for real-time systems, especially for the proliferating embedded systems with limited energy budget. As the commonly deployed energy-saving technique (e.g. dynamic voltage frequency scaling (DVFS)) will significantly affect system reliability, we study schedulers that have intelligent mechanisms to recuperate system reliability to satisfy the quality assurance requirements. Extensive simulation is conducted to evaluate the performance of the proposed algorithms on reduction of scheduling overhead, energy saving, and reliability improvement. The simulation results show that the proposed reliability-aware power management schemes could preserve the system reliability while still achieving substantial energy saving.
Sparse Gaussian elimination with controlled fill-in on a shared memory multiprocessor

NASA Technical Reports Server (NTRS)

Alaghband, Gita; Jordan, Harry F.

1989-01-01

It is shown that in sparse matrices arising from electronic circuits, it is possible to do computations on many diagonal elements simultaneously. A technique for obtaining an ordered compatible set directly from the ordered incompatible table is given. The ordering is based on the Markowitz number of the pivot candidates. This technique generates a set of compatible pivots with the property of generating few fills. A novel heuristic algorithm is presented that combines the idea of an order-compatible set with a limited binary tree search to generate several sets of compatible pivots in linear time. An elimination set for reducing the matrix is generated and selected on the basis of a minimum Markowitz sum number. The parallel pivoting technique presented is a stepwise algorithm and can be applied to any submatrix of the original matrix. Thus, it is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds. Parameters are suggested to obtain a balance between parallelism and fill-ins. Results of applying the proposed algorithms on several large application matrices using the HEP multiprocessor (Kowalik, 1985) are presented and analyzed.
Comparing the Performance of Two Dynamic Load Distribution Methods

NASA Technical Reports Server (NTRS)

Kale, L. V.

1987-01-01

Parallel processing of symbolic computations on a message-passing multi-processor presents one challenge: To effectively utilize the available processors, the load must be distributed uniformly to all the processors. However, the structure of these computations cannot be predicted in advance. go, static scheduling methods are not applicable. In this paper, we compare the performance of two dynamic, distributed load balancing methods with extensive simulation studies. The two schemes are: the Contracting Within a Neighborhood (CWN) scheme proposed by us, and the Gradient Model proposed by Lin and Keller. We conclude that although simpler, the CWN is significantly more effective at distributing the work than the Gradient model.
Scheduler-Conscious Synchronization.

DTIC Science & Technology

1994-12-01

SPONSORING I MONITORING Office of Naval Research ARPA AGENCY REPORT NUMBER Information Systems 3701 N. Fairfax Drive TR 550 Arlington VA 22217 Arlington VA...Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL -97663, Lawrence Livermore National Laboratory
Desktop supercomputer: what can it do?

NASA Astrophysics Data System (ADS)

Bogdanov, A.; Degtyarev, A.; Korkhov, V.

2017-12-01

The paper addresses the issues of solving complex problems that require using supercomputers or multiprocessor clusters available for most researchers nowadays. Efficient distribution of high performance computing resources according to actual application needs has been a major research topic since high-performance computing (HPC) technologies became widely introduced. At the same time, comfortable and transparent access to these resources was a key user requirement. In this paper we discuss approaches to build a virtual private supercomputer available at user's desktop: a virtual computing environment tailored specifically for a target user with a particular target application. We describe and evaluate possibilities to create the virtual supercomputer based on light-weight virtualization technologies, and analyze the efficiency of our approach compared to traditional methods of HPC resource management.
Surprise

DOE Office of Scientific and Technical Information (OSTI.GOV)

Curran, L.

1988-03-03

Interest has been building in recent months over the imminent arrival of a new class of supercomputer, called the ''supercomputer on a desk'' or the single-user model. Most observers expected the first such product to come from either of two startups, Ardent Computer Corp. or Stellar Computer Inc. But a surprise entry has shown up. Apollo Computer Inc. is launching a new work station this week that racks up an impressive list of industry first as it puts supercomputer power at the disposal of a single user. The new series 10000 from the Chelmsford, Mass., a company is built aroundmore » a reduced-instruction-set architecture that the company calls Prism, for parallel reduced-instruction-set multiprocessor. This article describes the 10000 and Prism.« less
FX-87 performance measurements: data-flow implementation. Technical report

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hammel, R.T.; Gifford, D.K.

1988-11-01

This report documents a series of experiments performed to explore the thesis that the FX-87 effect system permits a compiler to schedule imperative programs (i.e., programs that may contain side-effects) for execution on a parallel computer. The authors analyze how much the FX-87 static effect system can improve the execution times of five benchmark programs on a parallel graph interpreter. Three of their benchmark programs do not use side-effects (factorial, fibonacci, and polynomial division) and thus did not have any effect-induced constraints. Their FX-87 performance was comparable to their performance in a purely functional language. Two of the benchmark programsmore » use side effects (DNA sequence matching and Scheme interpretation) and the compiler was able to use effect information to reduce their execution times by factors of 1.7 to 5.4 when compared with sequential execution times. These results support the thesis that a static effect system is a powerful tool for compilation to multiprocessor computers. However, the graph interpreter we used was based on unrealistic assumptions, and thus our results may not accurately reflect the performance of a practical FX-87 implementation. The results also suggest that conventional loop analysis would complement the FX-87 effect system« less
Expert Systems on Multiprocessor Architectures. Phase 1

DTIC Science & Technology

1988-08-01

great rate) as early experience indicates what alternative aspect of system operation should have been monitored in any given completed run. The... system operation should have been monitored in any given completed run. The design goals that emerged then were (1) that the simulation system should...ORGANIZATION 6b OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION Stanford University (If applicable) Knowledge Systems Laboratory Rome Air Development
Nearest Neighbor Searching in Binary Search Trees: Simulation of a Multiprocessor System.

ERIC Educational Resources Information Center

Stewart, Mark; Willett, Peter

1987-01-01

Describes the simulation of a nearest neighbor searching algorithm for document retrieval using a pool of microprocessors. Three techniques are described which allow parallel searching of a binary search tree as well as a PASCAL-based system, PASSIM, which can simulate these techniques. Fifty-six references are provided. (Author/LRW)
Data Traffic Reduction Schemes for Cholesky Factorization on Asynchronous Multiprocessor Systems

DTIC Science & Technology

1989-06-01

Engineering NASA Langley Research Center Hampton, Virginia 23665-5225 Operated by the Universities Space Research Association DTIC ELECTE NASA jAUG 23...Hampton, VA 23665. ti- 1. Introduction Consider the problem of solving a system of linear equations Ax=b where .4 is an n x n symmetric, positive
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Bolosky, William Joseph

1993-01-01

Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.

Optical RAM-enabled cache memory and optical routing for chip multiprocessors: technologies and architectures

NASA Astrophysics Data System (ADS)

Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.

2014-03-01

The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code

NASA Astrophysics Data System (ADS)

Simunovic, S.; Zacharia, T.; Baltas, N.; Spalding, D. B.

1995-03-01

PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. The Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.
MPI implementation of PHOENICS: A general purpose computational fluid dynamics code

DOE Office of Scientific and Technical Information (OSTI.GOV)

Simunovic, S.; Zacharia, T.; Baltas, N.

1995-04-01

PHOENICS is a suite of computational analysis programs that are used for simulation of fluid flow, heat transfer, and dynamical reaction processes. The parallel version of the solver EARTH for the Computational Fluid Dynamics (CFD) program PHOENICS has been implemented using Message Passing Interface (MPI) standard. Implementation of MPI version of PHOENICS makes this computational tool portable to a wide range of parallel machines and enables the use of high performance computing for large scale computational simulations. MPI libraries are available on several parallel architectures making the program usable across different architectures as well as on heterogeneous computer networks. Themore » Intel Paragon NX and MPI versions of the program have been developed and tested on massively parallel supercomputers Intel Paragon XP/S 5, XP/S 35, and Kendall Square Research, and on the multiprocessor SGI Onyx computer at Oak Ridge National Laboratory. The preliminary testing results of the developed program have shown scalable performance for reasonably sized computational domains.« less
Multiprocessor Z-Buffer Architecture for High-Speed, High Complexity Computer Image Generation.

DTIC Science & Technology

1983-12-01

Oversampling 50 17. "Poking Through" Effects 51 18. Sampling Paths 52 19. Triangle Variables 54 20. Intelligent Tiling Algorithm 61 21. Tiler Functional Blocks...64 * 22. HSD Interface 65 23. Tiling Machine Setup 67 24. Tiling Machine 68 25. Tile Accumulate 69 26. A lx$ Sorting Machine 77 27. A 2x8 Sorting...Delay 227 87. Effect of Triangle Size on Tiler Throughput Rates 229 88. Tiling Machine Setup Stage Performance for Oversample Mode 234 89. Tiling
A Linked-Cell Domain Decomposition Method for Molecular Dynamics Simulation on a Scalable Multiprocessor

DOE PAGES

Yang, L. H.; Brooks III, E. D.; Belak, J.

1992-01-01

A molecular dynamics algorithm for performing large-scale simulations using the Parallel C Preprocessor (PCP) programming paradigm on the BBN TC2000, a massively parallel computer, is discussed. The algorithm uses a linked-cell data structure to obtain the near neighbors of each atom as time evoles. Each processor is assigned to a geometric domain containing many subcells and the storage for that domain is private to the processor. Within this scheme, the interdomain (i.e., interprocessor) communication is minimized.
Macro-actor execution on multilevel data-driven architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaudiot, J.L.; Najjar, W.

1988-12-31

The data-flow model of computation brings to multiprocessors high programmability at the expense of increased overhead. Applying the model at a higher level leads to better performance but also introduces loss of parallelism. We demonstrate here syntax directed program decomposition methods for the creation of large macro-actors in numerical algorithms. In order to alleviate some of the problems introduced by the lower resolution interpretation, we describe a multi-level of resolution and analyze the requirements for its actual hardware and software integration.
Superelement model based parallel algorithm for vehicle dynamics

DOE Office of Scientific and Technical Information (OSTI.GOV)

Agrawal, O.P.; Danhof, K.J.; Kumar, R.

1994-05-01

This paper presents a superelement model based parallel algorithm for a planar vehicle dynamics. The vehicle model is made up of a chassis and two suspension systems each of which consists of an axle-wheel assembly and two trailing arms. In this model, the chassis is treated as a Cartesian element and each suspension system is treated as a superelement. The parameters associated with the superelements are computed using an inverse dynamics technique. Suspension shock absorbers and the tires are modeled by nonlinear springs and dampers. The Euler-Lagrange approach is used to develop the system equations of motion. This leads tomore » a system of differential and algebraic equations in which the constraints internal to superelements appear only explicitly. The above formulation is implemented on a multiprocessor machine. The numerical flow chart is divided into modules and the computation of several modules is performed in parallel to gain computational efficiency. In this implementation, the master (parent processor) creates a pool of slaves (child processors) at the beginning of the program. The slaves remain in the pool until they are needed to perform certain tasks. Upon completion of a particular task, a slave returns to the pool. This improves the overall response time of the algorithm. The formulation presented is general which makes it attractive for a general purpose code development. Speedups obtained in the different modules of the dynamic analysis computation are also presented. Results show that the superelement model based parallel algorithm can significantly reduce the vehicle dynamics simulation time. 52 refs.« less
Multithreaded Model for Dynamic Load Balancing Parallel Adaptive PDE Computations

NASA Technical Reports Server (NTRS)

Chrisochoides, Nikos

1995-01-01

We present a multithreaded model for the dynamic load-balancing of numerical, adaptive computations required for the solution of Partial Differential Equations (PDE's) on multiprocessors. Multithreading is used as a means of exploring concurrency in the processor level in order to tolerate synchronization costs inherent to traditional (non-threaded) parallel adaptive PDE solvers. Our preliminary analysis for parallel, adaptive PDE solvers indicates that multithreading can be used an a mechanism to mask overheads required for the dynamic balancing of processor workloads with computations required for the actual numerical solution of the PDE's. Also, multithreading can simplify the implementation of dynamic load-balancing algorithms, a task that is very difficult for traditional data parallel adaptive PDE computations. Unfortunately, multithreading does not always simplify program complexity, often makes code re-usability not an easy task, and increases software complexity.
A fast parallel 3D Poisson solver with longitudinal periodic and transverse open boundary conditions for space-charge simulations

NASA Astrophysics Data System (ADS)

Qiang, Ji

2017-10-01

A three-dimensional (3D) Poisson solver with longitudinal periodic and transverse open boundary conditions can have important applications in beam physics of particle accelerators. In this paper, we present a fast efficient method to solve the Poisson equation using a spectral finite-difference method. This method uses a computational domain that contains the charged particle beam only and has a computational complexity of O(Nu(logNmode)) , where Nu is the total number of unknowns and Nmode is the maximum number of longitudinal or azimuthal modes. This saves both the computational time and the memory usage of using an artificial boundary condition in a large extended computational domain. The new 3D Poisson solver is parallelized using a message passing interface (MPI) on multi-processor computers and shows a reasonable parallel performance up to hundreds of processor cores.
Specification and Verification of Secure Concurrent and Distributed Software Systems

DTIC Science & Technology

1992-02-01

primitive search strategies work for operating systems that contain relatively few operations . As the number of operations increases, so does the the...others have granted him access to, etc . The burden of security falls on the operating system , although appropriate hardware support can minimize the...Guttag, J. Horning, and R. Levin. Synchronization primitives for a multiprocessor: a formal specification. Symposium on Operating System Principles
File-access characteristics of parallel scientific workloads

NASA Technical Reports Server (NTRS)

Nieuwejaar, Nils; Kotz, David; Purakayastha, Apratim; Best, Michael; Ellis, Carla Schlatter

1995-01-01

Phenomenal improvements in the computational performance of multiprocessors have not been matched by comparable gains in I/O system performance. This imbalance has resulted in I/O becoming a significant bottleneck for many scientific applications. One key to overcoming this bottleneck is improving the performance of parallel file systems. The design of a high-performance parallel file system requires a comprehensive understanding of the expected workload. Unfortunately, until recently, no general workload studies of parallel file systems have been conducted. The goal of the CHARISMA project was to remedy this problem by characterizing the behavior of several production workloads, on different machines, at the level of individual reads and writes. The first set of results from the CHARISMA project describe the workloads observed on an Intel iPSC/860 and a Thinking Machines CM-5. This paper is intended to compare and contrast these two workloads for an understanding of their essential similarities and differences, isolating common trends and platform-dependent variances. Using this comparison, we are able to gain more insight into the general principles that should guide parallel file-system design.
Parallel Architectures for Planetary Exploration Requirements (PAPER)

NASA Technical Reports Server (NTRS)

Cezzar, Ruknet; Sen, Ranjan K.

1989-01-01

The Parallel Architectures for Planetary Exploration Requirements (PAPER) project is essentially research oriented towards technology insertion issues for NASA's unmanned planetary probes. It was initiated to complement and augment the long-term efforts for space exploration with particular reference to NASA/LaRC's (NASA Langley Research Center) research needs for planetary exploration missions of the mid and late 1990s. The requirements for space missions as given in the somewhat dated Advanced Information Processing Systems (AIPS) requirements document are contrasted with the new requirements from JPL/Caltech involving sensor data capture and scene analysis. It is shown that more stringent requirements have arisen as a result of technological advancements. Two possible architectures, the AIPS Proof of Concept (POC) configuration and the MAX Fault-tolerant dataflow multiprocessor, were evaluated. The main observation was that the AIPS design is biased towards fault tolerance and may not be an ideal architecture for planetary and deep space probes due to high cost and complexity. The MAX concepts appears to be a promising candidate, except that more detailed information is required. The feasibility for adding neural computation capability to this architecture needs to be studied. Key impact issues for architectural design of computing systems meant for planetary missions were also identified.
Realtime system for GLAS on WHT

NASA Astrophysics Data System (ADS)

Skvarč, Jure; Tulloch, Simon; Myers, Richard M.

2006-06-01

The new ground layer adaptive optics system (GLAS) on the William Herschel Telescope (WHT) on La Palma will be based on the existing natural guide star adaptive optics system called NAOMI. A part of the new developments is a new control system for the tip-tilt mirror. Instead of the existing system, built around a custom built multiprocessor computer made of C40 DSPs, this system uses an ordinary PC machine and a Linux operating system. It is equipped with a high sensitivity L3 CCD camera with effective readout noise of nearly zero. The software design for the tip-tilt system is being completely redeveloped, in order to make a use of object oriented design which should facilitate easier integration with the rest of the observing system at the WHT. The modular design of the system allows incorporation of different centroiding and loop control methods. To test the system off-sky, we have built a laboratory bench using an artificial light source and a tip-tilt mirror. We present results of tip-tilt correction quality using different centroiding algorithms and different control loop methods at different light levels. This system will serve as a testing ground for a transition to a completely PC-based real-time control system.
A pluggable framework for parallel pairwise sequence search.

PubMed

Archuleta, Jeremy; Feng, Wu-chun; Tilevich, Eli

2007-01-01

The current and near future of the computing industry is one of multi-core and multi-processor technology. Most existing sequence-search tools have been designed with a focus on single-core, single-processor systems. This discrepancy between software design and hardware architecture substantially hinders sequence-search performance by not allowing full utilization of the hardware. This paper presents a novel framework that will aid the conversion of serial sequence-search tools into a parallel version that can take full advantage of the available hardware. The framework, which is based on a software architecture called mixin layers with refined roles, enables modules to be plugged into the framework with minimal effort. The inherent modular design improves maintenance and extensibility, thus opening up a plethora of opportunities for advanced algorithmic features to be developed and incorporated while routine maintenance of the codebase persists.
Strategies for concurrent processing of complex algorithms in data driven architectures

NASA Technical Reports Server (NTRS)

Stoughton, John W.; Mielke, Roland R.

1988-01-01

Research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a special distributed computer environment is presented. This model is identified by the acronym ATAMM which represents Algorithms To Architecture Mapping Model. The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
Recent Improvements in Aerodynamic Design Optimization on Unstructured Meshes

NASA Technical Reports Server (NTRS)

Nielsen, Eric J.; Anderson, W. Kyle

2000-01-01

Recent improvements in an unstructured-grid method for large-scale aerodynamic design are presented. Previous work had shown such computations to be prohibitively long in a sequential processing environment. Also, robust adjoint solutions and mesh movement procedures were difficult to realize, particularly for viscous flows. To overcome these limiting factors, a set of design codes based on a discrete adjoint method is extended to a multiprocessor environment using a shared memory approach. A nearly linear speedup is demonstrated, and the consistency of the linearizations is shown to remain valid. The full linearization of the residual is used to precondition the adjoint system, and a significantly improved convergence rate is obtained. A new mesh movement algorithm is implemented and several advantages over an existing technique are presented. Several design cases are shown for turbulent flows in two and three dimensions.
Proceedings: Sisal `93

DOE Office of Scientific and Technical Information (OSTI.GOV)

Feo, J.T.

1993-10-01

This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Impact of Load Balancing on Unstructured Adaptive Grid Computations for Distributed-Memory Multiprocessors

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Biswas, Rupak; Simon, Horst D.

1996-01-01

The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
Software Epistemology

DTIC Science & Technology

2016-03-01

in-vitro decision to incubate a startup, Lexumo [7], which is developing a commercial Software as a Service ( SaaS ) vulnerability assessment...LTS Label Transition System MUSE Mining and Understanding Software Enclaves RTEMS Real-Time Executive for Multi-processor Systems SaaS Software ...as a Service SSA Static Single Assignment SWE Software Epistemology UD/DU Def-Use/Use-Def Chains (Dataflow Graph)
Load Balancing Unstructured Adaptive Grids for CFD Problems

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid

1996-01-01

Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.

OPAD-EDIFIS Real-Time Processing

NASA Technical Reports Server (NTRS)

Katsinis, Constantine

1997-01-01

The Optical Plume Anomaly Detection (OPAD) detects engine hardware degradation of flight vehicles through identification and quantification of elemental species found in the plume by analyzing the plume emission spectra in a real-time mode. Real-time performance of OPAD relies on extensive software which must report metal amounts in the plume faster than once every 0.5 sec. OPAD software previously written by NASA scientists performed most necessary functions at speeds which were far below what is needed for real-time operation. The research presented in this report improved the execution speed of the software by optimizing the code without changing the algorithms and converting it into a parallelized form which is executed in a shared-memory multiprocessor system. The resulting code was subjected to extensive timing analysis. The report also provides suggestions for further performance improvement by (1) identifying areas of algorithm optimization, (2) recommending commercially available multiprocessor architectures and operating systems to support real-time execution and (3) presenting an initial study of fault-tolerance requirements.
cuTauLeaping: A GPU-Powered Tau-Leaping Stochastic Simulator for Massive Parallel Analyses of Biological Systems

PubMed Central

Besozzi, Daniela; Pescini, Dario; Mauri, Giancarlo

2014-01-01

Tau-leaping is a stochastic simulation algorithm that efficiently reconstructs the temporal evolution of biological systems, modeled according to the stochastic formulation of chemical kinetics. The analysis of dynamical properties of these systems in physiological and perturbed conditions usually requires the execution of a large number of simulations, leading to high computational costs. Since each simulation can be executed independently from the others, a massive parallelization of tau-leaping can bring to relevant reductions of the overall running time. The emerging field of General Purpose Graphic Processing Units (GPGPU) provides power-efficient high-performance computing at a relatively low cost. In this work we introduce cuTauLeaping, a stochastic simulator of biological systems that makes use of GPGPU computing to execute multiple parallel tau-leaping simulations, by fully exploiting the Nvidia's Fermi GPU architecture. We show how a considerable computational speedup is achieved on GPU by partitioning the execution of tau-leaping into multiple separated phases, and we describe how to avoid some implementation pitfalls related to the scarcity of memory resources on the GPU streaming multiprocessors. Our results show that cuTauLeaping largely outperforms the CPU-based tau-leaping implementation when the number of parallel simulations increases, with a break-even directly depending on the size of the biological system and on the complexity of its emergent dynamics. In particular, cuTauLeaping is exploited to investigate the probability distribution of bistable states in the Schlögl model, and to carry out a bidimensional parameter sweep analysis to study the oscillatory regimes in the Ras/cAMP/PKA pathway in S. cerevisiae. PMID:24663957
Fault-Tolerant Multiprocessor and VLSI-Based Systems.

DTIC Science & Technology

1987-03-15

54590 170 Table 1: Statistics for the Benchmark Programs pages are distributed amongst the groups of the reconfigured memory in proportion to the...distances are proportional to only the logarithm of the sure that possesses relevance to a system which consists of alare nmbe ofhomgenouseleent...and comn.unication overhead resulting from faults communicating with all of the other elements in the system the network to degrade proportionately to
Adaptive Signal Processing Testbed: VME-based DSP board market survey

NASA Astrophysics Data System (ADS)

Ingram, Rick E.

1992-04-01

The Adaptive Signal Processing Testbed (ASPT) is a real-time multiprocessor system utilizing digital signal processor technology on VMEbus based printed circuit boards installed on a Sun workstation. The ASPT has specific requirements, particularly as regards to the signal excision application, with respect to interfacing with current and planned data generation equipment, processing of the data, storage to disk of final and intermediate results, and the development tools for applications development and integration into the overall EW/COM computing environment. A prototype ASPT was implemented using three VME-C-30 boards from Applied Silicon. Experience gained during the prototype development led to the conclusions that interprocessor communications capability is the most significant contributor to overall ASPT performance. In addition, the host involvement should be minimized. Boards using different processors were evaluated with respect to the ASPT system requirements, pricing, and availability. Specific recommendations based on various priorities are made as well as recommendations concerning the integration and interaction of various tools developed during the prototype implementation.
UPEML Version 2. 0: A machine-portable CDC Update emulator

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mehlhorn, T.A.; Young, M.F.

1987-05-01

UPEML is a machine-portable CDC Update emulation program. UPEML is written in ANSI standard Fortran-77 and is relatively simple and compact. It is capable of emulating a significant subset of the standard CDC Update functions, including program library creation and subsequent modification. Machine-portability is an essential attribute of UPEML. UPEML was written primarily to facilitate the use of CDC-based scientific packages on alternate computer systems such as the VAX 11/780 and the IBM 3081. UPEML has also been successfully used on the multiprocessor ELXSI, on CRAYs under both COS and CTSS operating systems, on APOLLO workstations, and on the HP-9000.more » Version 2.0 includes enhanced error checking, full ASCI character support, a program library audit capability, and a partial update option in which only selected or modified decks are written to the compile file. Further enhancements include checks for overlapping corrections, processing of nested calls to common decks, and reads and addfiles from alternate input files.« less
Support for non-locking parallel reception of packets belonging to a single memory reception FIFO

DOEpatents

Chen, Dong [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Salapura, Valentina [Yorktown Heights, NY; Senger, Robert M [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugawara, Yutaka [Yorktown Heights, NY

2011-01-27

A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-locking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.
Distributed Scene Analysis For Autonomous Road Vehicle Guidance

NASA Astrophysics Data System (ADS)

Mysliwetz, Birger D.; Dickmanns, E. D.

1987-01-01

An efficient distributed processing scheme has been developed for visual road boundary tracking by 'VaMoRs', a testbed vehicle for autonomous mobility and computer vision. Ongoing work described here is directed to improving the robustness of the road boundary detection process in the presence of shadows, ill-defined edges and other disturbing real world effects. The system structure and the techniques applied for real-time scene analysis are presented along with experimental results. All subfunctions of road boundary detection for vehicle guidance, such as edge extraction, feature aggregation and camera pointing control, are executed in parallel by an onboard multiprocessor system. On the image processing level local oriented edge extraction is performed in multiple 'windows', tighly controlled from a hierarchically higher, modelbased level. The interpretation process involving a geometric road model and the observer's relative position to the road boundaries is capable of coping with ambiguity in measurement data. By using only selected measurements to update the model parameters even high noise levels can be dealt with and misleading edges be rejected.
Shared Versus Distributed Memory Multiprocessors

DTIC Science & Technology

1991-01-01

multiprocessors should hawe shared or dis.trimuted meieo-% ha~ trr ~ g ’’~ de~i c4~accio;, S Cm teaicners argue S trongly tor Outiding (li15 tri huted...Applications, MIT Press (1985). 161 D. Gajski et el., "Cedar," Proc. Compcon, pp. 306-309 (Spring 19S9). 171 S. Ahuja, N. Carriero and D. Gelernter, "Linda
A generalized computer code for developing dynamic gas turbine engine models (DIGTEM)

NASA Technical Reports Server (NTRS)

Daniele, C. J.

1984-01-01

This paper describes DIGTEM (digital turbofan engine model), a computer program that simulates two spool, two stream (turbofan) engines. DIGTEM was developed to support the development of a real time multiprocessor based engine simulator being designed at the Lewis Research Center. The turbofan engine model in DIGTEM contains steady state performance maps for all the components and has control volumes where continuity and energy balances are maintained. Rotor dynamics and duct momentum dynamics are also included. DIGTEM features an implicit integration scheme for integrating stiff systems and trims the model equations to match a prescribed design point by calculating correction coefficients that balance out the dynamic equations. It uses the same coefficients at off design points and iterates to a balanced engine condition. Transients are generated by defining the engine inputs as functions of time in a user written subroutine (TMRSP). Closed loop controls can also be simulated. DIGTEM is generalized in the aerothermodynamic treatment of components. This feature, along with DIGTEM's trimming at a design point, make it a very useful tool for developing a model of a specific turbofan engine.
A generalized computer code for developing dynamic gas turbine engine models (DIGTEM)

NASA Technical Reports Server (NTRS)

Daniele, C. J.

1983-01-01

This paper describes DIGTEM (digital turbofan engine model), a computer program that simulates two spool, two stream (turbofan) engines. DIGTEM was developed to support the development of a real time multiprocessor based engine simulator being designed at the Lewis Research Center. The turbofan engine model in DIGTEM contains steady state performance maps for all the components and has control volumes where continuity and energy balances are maintained. Rotor dynamics and duct momentum dynamics are also included. DIGTEM features an implicit integration scheme for integrating stiff systems and trims the model equations to match a prescribed design point by calculating correction coefficients that balance out the dynamic equations. It uses the same coefficients at off design points and iterates to a balanced engine condition. Transients are generated by defining the engine inputs as functions of time in a user written subroutine (TMRSP). Closed loop controls can also be simulated. DIGTEM is generalized in the aerothermodynamic treatment of components. This feature, along with DIGTEM's trimming at a design point, make it a very useful tool for developing a model of a specific turbofan engine.
Dynamic remapping of parallel computations with varying resource demands

NASA Technical Reports Server (NTRS)

Nicol, D. M.; Saltz, J. H.

1986-01-01

A large class of computational problems is characterized by frequent synchronization, and computational requirements which change as a function of time. When such a problem must be solved on a message passing multiprocessor machine, the combination of these characteristics lead to system performance which decreases in time. Performance can be improved with periodic redistribution of computational load; however, redistribution can exact a sometimes large delay cost. We study the issue of deciding when to invoke a global load remapping mechanism. Such a decision policy must effectively weigh the costs of remapping against the performance benefits. We treat this problem by constructing two analytic models which exhibit stochastically decreasing performance. One model is quite tractable; we are able to describe the optimal remapping algorithm, and the optimal decision policy governing when to invoke that algorithm. However, computational complexity prohibits the use of the optimal remapping decision policy. We then study the performance of a general remapping policy on both analytic models. This policy attempts to minimize a statistic W(n) which measures the system degradation (including the cost of remapping) per computation step over a period of n steps. We show that as a function of time, the expected value of W(n) has at most one minimum, and that when this minimum exists it defines the optimal fixed-interval remapping policy. Our decision policy appeals to this result by remapping when it estimates that W(n) is minimized. Our performance data suggests that this policy effectively finds the natural frequency of remapping. We also use the analytic models to express the relationship between performance and remapping cost, number of processors, and the computation's stochastic activity.
Quantitative Image Feature Engine (QIFE): an Open-Source, Modular Engine for 3D Quantitative Feature Extraction from Volumetric Medical Images.

PubMed

Echegaray, Sebastian; Bakr, Shaimaa; Rubin, Daniel L; Napel, Sandy

2017-10-06

The aim of this study was to develop an open-source, modular, locally run or server-based system for 3D radiomics feature computation that can be used on any computer system and included in existing workflows for understanding associations and building predictive models between image features and clinical data, such as survival. The QIFE exploits various levels of parallelization for use on multiprocessor systems. It consists of a managing framework and four stages: input, pre-processing, feature computation, and output. Each stage contains one or more swappable components, allowing run-time customization. We benchmarked the engine using various levels of parallelization on a cohort of CT scans presenting 108 lung tumors. Two versions of the QIFE have been released: (1) the open-source MATLAB code posted to Github, (2) a compiled version loaded in a Docker container, posted to DockerHub, which can be easily deployed on any computer. The QIFE processed 108 objects (tumors) in 2:12 (h/mm) using 1 core, and 1:04 (h/mm) hours using four cores with object-level parallelization. We developed the Quantitative Image Feature Engine (QIFE), an open-source feature-extraction framework that focuses on modularity, standards, parallelism, provenance, and integration. Researchers can easily integrate it with their existing segmentation and imaging workflows by creating input and output components that implement their existing interfaces. Computational efficiency can be improved by parallelizing execution at the cost of memory usage. Different parallelization levels provide different trade-offs, and the optimal setting will depend on the size and composition of the dataset to be processed.
Method of predicting a change in an economy

DOEpatents

Pryor, Richard J [Albuquerque, NM; Basu, Nipa [Albany, NY

2006-01-10

An economy whose activity is to be predicted comprises a plurality of decision makers. Decision makers include, for example, households, government, industry, and banks. The decision makers are represented by agents, where an agent can represent one or more decision makers. Each agent has decision rules that determine the agent's actions. Each agent can affect the economy by affecting variable conditions characteristic of the economy or the internal state of other agents. Agents can communicate actions through messages. On a multiprocessor computer, the agents can be assigned to processing elements.
Simulation of electrical and thermal fields in a multimode microwave oven using software written in C++

NASA Astrophysics Data System (ADS)

Abrudean, C.

2017-05-01

Due to multiple reflexions on walls, the electromagnetic field in a multimode microwave oven is difficult to estimate analytically. This paper presents a C++ program that calculates the electromagnetic field in a resonating cavity with an absorbing payload, uses the result to calculate heating in the payload taking its properties into account and then repeats. This results in a simulation of microwave heating, including phenomena like thermal runaway. The program is multithreaded to make use of today’s common multiprocessor/multicore computers.
KALI - An environment for the programming and control of cooperative manipulators

NASA Technical Reports Server (NTRS)

Hayward, Vincent; Hayati, Samad

1988-01-01

A design description is given of a controller for cooperative robots. The background and motivation for multiple arm control are discussed. A set of programming primitives which permit a programmer to specify cooperative tasks are described. Motion primitives specify asynchronous motions, master/slave motions, and cooperative motions. In the context of cooperative robots, trajectory generation issues are discussed and the authors' implementation briefly described. The relations between programming and control in the case of multiple robots are examined. The allocation of various tasks among a multiprocessor computer is described.
Ring-array processor distribution topology for optical interconnects

NASA Technical Reports Server (NTRS)

Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

1992-01-01

The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.
Parallel Lattice Basis Reduction Using a Multi-threaded Schnorr-Euchner LLL Algorithm

NASA Astrophysics Data System (ADS)

Backes, Werner; Wetzel, Susanne

In this paper, we introduce a new parallel variant of the LLL lattice basis reduction algorithm. Our new, multi-threaded algorithm is the first to provide an efficient, parallel implementation of the Schorr-Euchner algorithm for today’s multi-processor, multi-core computer architectures. Experiments with sparse and dense lattice bases show a speed-up factor of about 1.8 for the 2-thread and about factor 3.2 for the 4-thread version of our new parallel lattice basis reduction algorithm in comparison to the traditional non-parallel algorithm.
Better, Cheaper, Faster Molecular Dynamics

NASA Technical Reports Server (NTRS)

Pohorille, Andrew; DeVincenzi, Donald L. (Technical Monitor)

2001-01-01

Recent, revolutionary progress in genomics and structural, molecular and cellular biology has created new opportunities for molecular-level computer simulations of biological systems by providing vast amounts of data that require interpretation. These opportunities are further enhanced by the increasing availability of massively parallel computers. For many problems, the method of choice is classical molecular dynamics (iterative solving of Newton's equations of motion). It focuses on two main objectives. One is to calculate the relative stability of different states of the system. A typical problem that has' such an objective is computer-aided drug design. Another common objective is to describe evolution of the system towards a low energy (possibly the global minimum energy), "native" state. Perhaps the best example of such a problem is protein folding. Both types of problems share the same difficulty. Often, different states of the system are separated by high energy barriers, which implies that transitions between these states are rare events. This, in turn, can greatly impede exploration of phase space. In some instances this can lead to "quasi non-ergodicity", whereby a part of phase space is inaccessible on time scales of the simulation. To overcome this difficulty and to extend molecular dynamics to "biological" time scales (millisecond or longer) new physical formulations and new algorithmic developments are required. To be efficient they should account for natural limitations of multi-processor computer architecture. I will present work along these lines done in my group. In particular, I will focus on a new approach to calculating the free energies (stability) of different states and to overcoming "the curse of rare events". I will also discuss algorithmic improvements to multiple time step methods and to the treatment of slowly decaying, log-ranged, electrostatic effects.
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports

DTIC Science & Technology

1991-06-01

Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. � Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Silicon photonics for high-performance interconnection networks

NASA Astrophysics Data System (ADS)

Biberman, Aleksandr

2011-12-01

We assert in the course of this work that silicon photonics has the potential to be a key disruptive technology in computing and communication industries. The enduring pursuit of performance gains in computing, combined with stringent power constraints, has fostered the ever-growing computational parallelism associated with chip multiprocessors, memory systems, high-performance computing systems, and data centers. Sustaining these parallelism growths introduces unique challenges for on- and off-chip communications, shifting the focus toward novel and fundamentally different communication approaches. This work showcases that chip-scale photonic interconnection networks, enabled by high-performance silicon photonic devices, enable unprecedented bandwidth scalability with reduced power consumption. We demonstrate that the silicon photonic platforms have already produced all the high-performance photonic devices required to realize these types of networks. Through extensive empirical characterization in much of this work, we demonstrate such feasibility of waveguides, modulators, switches, and photodetectors. We also demonstrate systems that simultaneously combine many functionalities to achieve more complex building blocks. Furthermore, we leverage the unique properties of available silicon photonic materials to create novel silicon photonic devices, subsystems, network topologies, and architectures to enable unprecedented performance of these photonic interconnection networks and computing systems. We show that the advantages of photonic interconnection networks extend far beyond the chip, offering advanced communication environments for memory systems, high-performance computing systems, and data centers. Furthermore, we explore the immense potential of all-optical functionalities implemented using parametric processing in the silicon platform, demonstrating unique methods that have the ability to revolutionize computation and communication. Silicon photonics enables new sets of opportunities that we can leverage for performance gains, as well as new sets of challenges that we must solve. Leveraging its inherent compatibility with standard fabrication techniques of the semiconductor industry, combined with its capability of dense integration with advanced microelectronics, silicon photonics also offers a clear path toward commercialization through low-cost mass-volume production. Combining empirical validations of feasibility, demonstrations of massive performance gains in large-scale systems, and the potential for commercial penetration of silicon photonics, the impact of this work will become evident in the many decades that follow.

Advanced Numerical Techniques of Performance Evaluation. Volume 2

DTIC Science & Technology

1990-06-01

multiprocessor environment. This factor is determined by the overhead of the primitives available in the system ( semaphore , monitor , or message... semaphore , monitor , or message passing primitives ) and U the programming ability of the user who implements the simulation. " t,: the sequential...Warp Operating System . i Pro" lftevcnth ACM Symposum on Operating Systems Princlplcs, pages 77 9:3, Auslin, TX, Nov wicr 1987. ACM. [121 D.R. Jefferson
Knowledge-based vision for space station object motion detection, recognition, and tracking

NASA Technical Reports Server (NTRS)

Symosek, P.; Panda, D.; Yalamanchili, S.; Wehner, W., III

1987-01-01

Computer vision, especially color image analysis and understanding, has much to offer in the area of the automation of Space Station tasks such as construction, satellite servicing, rendezvous and proximity operations, inspection, experiment monitoring, data management and training. Knowledge-based techniques improve the performance of vision algorithms for unstructured environments because of their ability to deal with imprecise a priori information or inaccurately estimated feature data and still produce useful results. Conventional techniques using statistical and purely model-based approaches lack flexibility in dealing with the variabilities anticipated in the unstructured viewing environment of space. Algorithms developed under NASA sponsorship for Space Station applications to demonstrate the value of a hypothesized architecture for a Video Image Processor (VIP) are presented. Approaches to the enhancement of the performance of these algorithms with knowledge-based techniques and the potential for deployment of highly-parallel multi-processor systems for these algorithms are discussed.
Performance tradeoffs in static and dynamic load balancing strategies

NASA Technical Reports Server (NTRS)

Iqbal, M. A.; Saltz, J. H.; Bokhart, S. H.

1986-01-01

The problem of uniformly distributing the load of a parallel program over a multiprocessor system was considered. A program was analyzed whose structure permits the computation of the optimal static solution. Then four strategies for load balancing were described and their performance compared. The strategies are: (1) the optimal static assignment algorithm which is guaranteed to yield the best static solution, (2) the static binary dissection method which is very fast but sub-optimal, (3) the greedy algorithm, a static fully polynomial time approximation scheme, which estimates the optimal solution to arbitrary accuracy, and (4) the predictive dynamic load balancing heuristic which uses information on the precedence relationships within the program and outperforms any of the static methods. It is also shown that the overhead incurred by the dynamic heuristic is reduced considerably if it is started off with a static assignment provided by either of the other three strategies.
Direct numerical simulation of the laminar-turbulent transition at hypersonic flow speeds on a supercomputer

NASA Astrophysics Data System (ADS)

Egorov, I. V.; Novikov, A. V.; Fedorov, A. V.

2017-08-01

A method for direct numerical simulation of three-dimensional unsteady disturbances leading to a laminar-turbulent transition at hypersonic flow speeds is proposed. The simulation relies on solving the full three-dimensional unsteady Navier-Stokes equations. The computational technique is intended for multiprocessor supercomputers and is based on a fully implicit monotone approximation scheme and the Newton-Raphson method for solving systems of nonlinear difference equations. This approach is used to study the development of three-dimensional unstable disturbances in a flat-plate and compression-corner boundary layers in early laminar-turbulent transition stages at the free-stream Mach number M = 5.37. The three-dimensional disturbance field is visualized in order to reveal and discuss features of the instability development at the linear and nonlinear stages. The distribution of the skin friction coefficient is used to detect laminar and transient flow regimes and determine the onset of the laminar-turbulent transition.
Benchmarking GNU Radio Kernels and Multi-Processor Scheduling

DTIC Science & Technology

2013-01-14

AMD E350 APU , comparable to Atom • ARM Cortex A8 running on a Gumstix Overo on an Ettus USRP E110 The general testing procedure consists of • Build...Intel Atom, and the AMD E350 APU . 3.2 Multi-Processor Scheduling Figure 1: GFLOPs per second through an FFT array on an Intel i7. Example output from
Considerations for Multiprocessor Topologies

NASA Technical Reports Server (NTRS)

Byrd, Gregory T.; Delagi, Bruce A.

1987-01-01

Choosing a multiprocessor interconnection topology may depend on high-level considerations, such as the intended application domain and the expected number of processors. It certainly depends on low-level implementation details, such as packaging and communications protocols. The authors first use rough measures of cost and performance to characterize several topologies. They then examine how implementation details can affect the realizable performance of a topology.
A microeconomic scheduler for parallel computers

NASA Technical Reports Server (NTRS)

Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex

1995-01-01

We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.
Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks

NASA Technical Reports Server (NTRS)

Lee, Y. H.; Shin, K. G.

1982-01-01

A fault-tolerant multiprocessor with a rollback recovery mechanism is discussed. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery block is constructed by consecutive state-save operations and several state-save units in every processor and memory module. When a fault is detected, the multiprocessor reconfigures itself to replace the faulty component and then the process originally assigned to the faulty component retreats to one of the previously saved states in order to resume fault-free execution. A mathematical model is proposed to calculate both the coverage of multi-step rollback recovery and the risk of restart. A performance evaluation in terms of task execution time is also presented.
Progress and supercomputing in computational fluid dynamics; Proceedings of U.S.-Israel Workshop, Jerusalem, Israel, December 1984

NASA Technical Reports Server (NTRS)

Murman, E. M. (Editor); Abarbanel, S. S. (Editor)

1985-01-01

Current developments and future trends in the application of supercomputers to computational fluid dynamics are discussed in reviews and reports. Topics examined include algorithm development for personal-size supercomputers, a multiblock three-dimensional Euler code for out-of-core and multiprocessor calculations, simulation of compressible inviscid and viscous flow, high-resolution solutions of the Euler equations for vortex flows, algorithms for the Navier-Stokes equations, and viscous-flow simulation by FEM and related techniques. Consideration is given to marching iterative methods for the parabolized and thin-layer Navier-Stokes equations, multigrid solutions to quasi-elliptic schemes, secondary instability of free shear flows, simulation of turbulent flow, and problems connected with weather prediction.
Testing and operating a multiprocessor chip with processor redundancy

DOEpatents

Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J

2014-10-21

A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
Nonrecursive formulations of multibody dynamics and concurrent multiprocessing

NASA Technical Reports Server (NTRS)

Kurdila, Andrew J.; Menon, Ramesh

1993-01-01

Since the late 1980's, research in recursive formulations of multibody dynamics has flourished. Historically, much of this research can be traced to applications of low dimensionality in mechanism and vehicle dynamics. Indeed, there is little doubt that recursive order N methods are the method of choice for this class of systems. This approach has the advantage that a minimal number of coordinates are utilized, parallelism can be induced for certain system topologies, and the method is of order N computational cost for systems of N rigid bodies. Despite the fact that many authors have dismissed redundant coordinate formulations as being of order N(exp 3), and hence less attractive than recursive formulations, we present recent research that demonstrates that at least three distinct classes of redundant, nonrecursive multibody formulations consistently achieve order N computational cost for systems of rigid and/or flexible bodies. These formulations are as follows: (1) the preconditioned range space formulation; (2) penalty methods; and (3) augmented Lagrangian methods for nonlinear multibody dynamics. The first method can be traced to its foundation in equality constrained quadratic optimization, while the last two methods have been studied extensively in the context of coercive variational boundary value problems in computational mechanics. Until recently, however, they have not been investigated in the context of multibody simulation, and present theoretical questions unique to nonlinear dynamics. All of these nonrecursive methods have additional advantages with respect to recursive order N methods: (1) the formalisms retain the highly desirable order N computational cost; (2) the techniques are amenable to concurrent simulation strategies; (3) the approaches do not depend upon system topology to induce concurrency; and (4) the methods can be derived to balance the computational load automatically on concurrent multiprocessors. In addition to the presentation of the fundamental formulations, this paper presents new theoretical results regarding the rate of convergence of order N constraint stabilization schemes associated with the newly introduced class of methods.
Parallel Computing Strategies for Irregular Algorithms

NASA Technical Reports Server (NTRS)

Biswas, Rupak; Oliker, Leonid; Shan, Hongzhang; Biegel, Bryan (Technical Monitor)

2002-01-01

Parallel computing promises several orders of magnitude increase in our ability to solve realistic computationally-intensive problems, but relies on their efficient mapping and execution on large-scale multiprocessor architectures. Unfortunately, many important applications are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Moreover, with the proliferation of parallel architectures and programming paradigms, the typical scientist is faced with a plethora of questions that must be answered in order to obtain an acceptable parallel implementation of the solution algorithm. In this paper, we consider three representative irregular applications: unstructured remeshing, sparse matrix computations, and N-body problems, and parallelize them using various popular programming paradigms on a wide spectrum of computer platforms ranging from state-of-the-art supercomputers to PC clusters. We present the underlying problems, the solution algorithms, and the parallel implementation strategies. Smart load-balancing, partitioning, and ordering techniques are used to enhance parallel performance. Overall results demonstrate the complexity of efficiently parallelizing irregular algorithms.
A parallel solver for huge dense linear systems

NASA Astrophysics Data System (ADS)

Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.

2011-11-01

HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system: Linux/Unix Has the code been vectorized or parallelized?: Yes, includes MPI primitives. RAM: Tested for up to 190 GB Classification: 6.5 External routines: MPI ( http://www.mpi-forum.org/), BLAS ( http://www.netlib.org/blas/), PLAPACK ( http://www.cs.utexas.edu/~plapack/), POOCLAPACK ( ftp://ftp.cs.utexas.edu/pub/rvdg/PLAPACK/pooclapack.ps) (code for PLAPACK and POOCLAPACK is included in the distribution). Catalogue identifier of previous version: AEHU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 182 (2011) 533 Does the new version supersede the previous version?: Yes Nature of problem: Huge scale dense systems of linear equations, Ax=B, beyond standard LAPACK capabilities. Solution method: The linear systems are solved by means of parallelized routines based on the LU factorization, using efficient secondary storage algorithms when the available main memory is insufficient. Reasons for new version: In many applications we need to guarantee a high accuracy in the solution of very large linear systems and we can do it by using double-precision arithmetic. Summary of revisions: Version 1.1 Can be used to solve linear systems using double-precision arithmetic. New version of the initialization routine. The user can choose the kind of arithmetic and the values of several parameters of the environment. Running time: About 5 hours to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors using double-precision arithmetic on an eight-node commodity cluster with a total of 64 Intel cores.
Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications

DTIC Science & Technology

2009-05-01

Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications Gilbert Hendry†, Shoaib Kamil‡?, Aleksandr Biberman†, Johnnie...electronic networks -on-chip warrants investigating real application traces on functionally compa- rable photonic and electronic network designs. We... network can achieve 75× improvement in energy ef- ficiency for synthetic benchmarks and up to 37× improve- ment for real scientific applications
Adaptive Backoff Synchronization Techniques

DTIC Science & Technology

1989-07-01

The Simple Code. Technical Report, Lawrence Livermore Laboratory, February 1978. [6] F. Darems-Rogers, D. A. George, V. A. Norton, and G . F. Pfister...Heights, November 1986. 20 [7] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In International...17] Janak H. Patel. Analysis of Multiprocessors with Private Cache Memories. IEEE Transactions on Com- puters, C-31(4):296-304, April 1982. [18] G
Expert Systems on Multiprocessor Architectures. Volume 4. Technical Reports

DTIC Science & Technology

1991-06-01

Floated-Current-Time0 -> The time that this function is called in user time uflts, expressed as a floating point number. Halt- Poligono Arrests the...default a statistics file will be printed out, if it can be. To prevent this make No-Statistics true. Unhalt- Poligono Unarrests the process in which the
Controlled replication: reduce the capacity occupied by redundant replicas in tiled chip multiprocessors

NASA Astrophysics Data System (ADS)

Li, Hao; Xie, Lunguo

2013-03-01

The design of cache system for Chip Multiprocessor (CMP) face many challenges because future CMPs will have more cores and greater on-chip cache capacity. There are two base design schemes about L2 cache: private scheme in which each L2 slice is treated as a private L2 cache and shared scheme in which all L2 slices are treated as a large L2 cache shared by all cores. Private caches provide the lowest hit latency but reduce the total effective cache capacity. A shared L2 cache increases the effective cache capacity but has long hit latencies when data is on a remote tile. This paper present a new Controlled Replication (CR) policy to reduce the capacities occupied by redundant shared replicas. the new CR policy increases the effective capacity than victim replication scheme and has lower hit latency than shared scheme. We evaluate the various schemes using full-system simulation of parallel applications. Results show that CR reduces the average memory access latency of shared scheme by an average of 13%, providing better overall performance than victim replication and shared schemes.
Performance comparison analysis library communication cluster system using merge sort

NASA Astrophysics Data System (ADS)

Wulandari, D. A. R.; Ramadhan, M. E.

2018-04-01

Begins by using a single processor, to increase the speed of computing time, the use of multi-processor was introduced. The second paradigm is known as parallel computing, example cluster. The cluster must have the communication potocol for processing, one of it is message passing Interface (MPI). MPI have many library, both of them OPENMPI and MPICH2. Performance of the cluster machine depend on suitable between performance characters of library communication and characters of the problem so this study aims to analyze the comparative performances libraries in handling parallel computing process. The case study in this research are MPICH2 and OpenMPI. This case research execute sorting’s problem to know the performance of cluster system. The sorting problem use mergesort method. The research method is by implementing OpenMPI and MPICH2 on a Linux-based cluster by using five computer virtual then analyze the performance of the system by different scenario tests and three parameters for to know the performance of MPICH2 and OpenMPI. These performances are execution time, speedup and efficiency. The results of this study showed that the addition of each data size makes OpenMPI and MPICH2 have an average speed-up and efficiency tend to increase but at a large data size decreases. increased data size doesn’t necessarily increased speed up and efficiency but only execution time example in 100000 data size. OpenMPI has a execution time greater than MPICH2 example in 1000 data size average execution time with MPICH2 is 0,009721 and OpenMPI is 0,003895 OpenMPI can customize communication needs.
Adaptive Backoff Synchronization Techniques

DTIC Science & Technology

1989-06-01

The Simple Code. Technical Report, Lawrence Livermore Laboratory, February 1978. [6J F. Darems-Rogers, D. A. George, V. A. Norton, and G . F. Pfister...Heights, November 1986. 20 [7] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In International Conference...17] Janak H. Patel. Analysis of Multiprocessors with Private Cache Memories. IEEE Transactions on Com- puters, C-31(4):296-304, April 1982. [18] G
Design of a high-speed digital processing element for parallel simulation

NASA Technical Reports Server (NTRS)

Milner, E. J.; Cwynar, D. S.

1983-01-01

A prototype of a custom designed computer to be used as a processing element in a multiprocessor based jet engine simulator is described. The purpose of the custom design was to give the computer the speed and versatility required to simulate a jet engine in real time. Real time simulations are needed for closed loop testing of digital electronic engine controls. The prototype computer has a microcycle time of 133 nanoseconds. This speed was achieved by: prefetching the next instruction while the current one is executing, transporting data using high speed data busses, and using state of the art components such as a very large scale integration (VLSI) multiplier. Included are discussions of processing element requirements, design philosophy, the architecture of the custom designed processing element, the comprehensive instruction set, the diagnostic support software, and the development status of the custom design.

Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines

NASA Technical Reports Server (NTRS)

1999-01-01

Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
Droplet flow along the wall of rectangular channel with gradient of wettability

NASA Astrophysics Data System (ADS)

Kupershtokh, A. L.

2018-03-01

The lattice Boltzmann equations (LBE) method (LBM) is applicable for simulating the multiphysics problems of fluid flows with free boundaries, taking into account the viscosity, surface tension, evaporation and wetting degree of a solid surface. Modeling of the nonstationary motion of a drop of liquid along a solid surface with a variable level of wettability is carried out. For the computer simulation of such a problem, the three-dimensional lattice Boltzmann equations method D3Q19 is used. The LBE method allows us to parallelize the calculations on multiprocessor graphics accelerators using the CUDA programming technology.
Hierarchical Poly Tree Configurations for the Solution of Dynamically Refined Finte Element Models

NASA Technical Reports Server (NTRS)

Gute, G. D.; Padovan, J.

1993-01-01

This paper demonstrates how a multilevel substructuring technique, called the Hierarchical Poly Tree (HPT), can be used to integrate a localized mesh refinement into the original finite element model more efficiently. The optimal HPT configurations for solving isoparametrically square h-, p-, and hp-extensions on single and multiprocessor computers is derived. In addition, the reduced number of stiffness matrix elements that must be stored when employing this type of solution strategy is quantified. Moreover, the HPT inherently provides localize 'error-trapping' and a logical, efficient means with which to isolate physically anomalous and analytically singular behavior.
State-of-the-art review of computational fluid dynamics modeling for fluid-solids systems

NASA Astrophysics Data System (ADS)

Lyczkowski, R. W.; Bouillard, J. X.; Ding, J.; Chang, S. L.; Burge, S. W.

1994-05-01

As the result of 15 years of research (50 staff years of effort) Argonne National Laboratory (ANL), through its involvement in fluidized-bed combustion, magnetohydrodynamics, and a variety of environmental programs, has produced extensive computational fluid dynamics (CFD) software and models to predict the multiphase hydrodynamic and reactive behavior of fluid-solids motions and interactions in complex fluidized-bed reactors (FBR's) and slurry systems. This has resulted in the FLUFIX, IRF, and SLUFIX computer programs. These programs are based on fluid-solids hydrodynamic models and can predict information important to the designer of atmospheric or pressurized bubbling and circulating FBR, fluid catalytic cracking (FCC) and slurry units to guarantee optimum efficiency with minimum release of pollutants into the environment. This latter issue will become of paramount importance with the enactment of the Clean Air Act Amendment (CAAA) of 1995. Solids motion is also the key to understanding erosion processes. Erosion rates in FBR's and pneumatic and slurry components are computed by ANL's EROSION code to predict the potential metal wastage of FBR walls, intervals, feed distributors, and cyclones. Only the FLUFIX and IRF codes will be reviewed in the paper together with highlights of the validations because of length limitations. It is envisioned that one day, these codes with user-friendly pre- and post-processor software and tailored for massively parallel multiprocessor shared memory computational platforms will be used by industry and researchers to assist in reducing and/or eliminating the environmental and economic barriers which limit full consideration of coal, shale, and biomass as energy sources; to retain energy security; and to remediate waste and ecological problems.
Optimal mapping of irregular finite element domains to parallel processors

NASA Technical Reports Server (NTRS)

Flower, J.; Otto, S.; Salama, M.

1987-01-01

Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.
A C++ Thread Package for Concurrent and Parallel Programming

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jie Chen; William Watson

1999-11-01

Recently thread libraries have become a common entity on various operating systems such as Unix, Windows NT and VxWorks. Those thread libraries offer significant performance enhancement by allowing applications to use multiple threads running either concurrently or in parallel on multiprocessors. However, the incompatibilities between native libraries introduces challenges for those who wish to develop portable applications.
The FORCE - A highly portable parallel programming language

NASA Technical Reports Server (NTRS)

Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger

1989-01-01

This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.
Graphics applications utilizing parallel processing

NASA Technical Reports Server (NTRS)

Rice, John R.

1990-01-01

The results are presented of research conducted to develop a parallel graphic application algorithm to depict the numerical solution of the 1-D wave equation, the vibrating string. The research was conducted on a Flexible Flex/32 multiprocessor and a Sequent Balance 21000 multiprocessor. The wave equation is implemented using the finite difference method. The synchronization issues that arose from the parallel implementation and the strategies used to alleviate the effects of the synchronization overhead are discussed.
Power-Aware Compiler Controllable Chip Multiprocessor

NASA Astrophysics Data System (ADS)

Shikano, Hiroaki; Shirako, Jun; Wada, Yasutaka; Kimura, Keiji; Kasahara, Hironori

A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.
3-dimensional magnetotelluric inversion including topography using deformed hexahedral edge finite elements and direct solvers parallelized on symmetric multiprocessor computers - Part II: direct data-space inverse solution

NASA Astrophysics Data System (ADS)

Kordy, M.; Wannamaker, P.; Maris, V.; Cherkaev, E.; Hill, G.

2016-01-01

Following the creation described in Part I of a deformable edge finite-element simulator for 3-D magnetotelluric (MT) responses using direct solvers, in Part II we develop an algorithm named HexMT for 3-D regularized inversion of MT data including topography. Direct solvers parallelized on large-RAM, symmetric multiprocessor (SMP) workstations are used also for the Gauss-Newton model update. By exploiting the data-space approach, the computational cost of the model update becomes much less in both time and computer memory than the cost of the forward simulation. In order to regularize using the second norm of the gradient, we factor the matrix related to the regularization term and apply its inverse to the Jacobian, which is done using the MKL PARDISO library. For dense matrix multiplication and factorization related to the model update, we use the PLASMA library which shows very good scalability across processor cores. A synthetic test inversion using a simple hill model shows that including topography can be important; in this case depression of the electric field by the hill can cause false conductors at depth or mask the presence of resistive structure. With a simple model of two buried bricks, a uniform spatial weighting for the norm of model smoothing recovered more accurate locations for the tomographic images compared to weightings which were a function of parameter Jacobians. We implement joint inversion for static distortion matrices tested using the Dublin secret model 2, for which we are able to reduce nRMS to ˜1.1 while avoiding oscillatory convergence. Finally we test the code on field data by inverting full impedance and tipper MT responses collected around Mount St Helens in the Cascade volcanic chain. Among several prominent structures, the north-south trending, eruption-controlling shear zone is clearly imaged in the inversion.
Performance of a Bounce-Averaged Global Model of Super-Thermal Electron Transport in the Earth's Magnetic Field

NASA Technical Reports Server (NTRS)

McGuire, Tim

1998-01-01

In this paper, we report the results of our recent research on the application of a multiprocessor Cray T916 supercomputer in modeling super-thermal electron transport in the earth's magnetic field. In general, this mathematical model requires numerical solution of a system of partial differential equations. The code we use for this model is moderately vectorized. By using Amdahl's Law for vector processors, it can be verified that the code is about 60% vectorized on a Cray computer. Speedup factors on the order of 2.5 were obtained compared to the unvectorized code. In the following sections, we discuss the methodology of improving the code. In addition to our goal of optimizing the code for solution on the Cray computer, we had the goal of scalability in mind. Scalability combines the concepts of portabilty with near-linear speedup. Specifically, a scalable program is one whose performance is portable across many different architectures with differing numbers of processors for many different problem sizes. Though we have access to a Cray at this time, the goal was to also have code which would run well on a variety of architectures.
Communication-Driven Codesign for Multiprocessor Systems

DTIC Science & Technology

2004-01-01

processors, FPGA or ASIC subsystems, mi- croprocessors, and microcontrollers. When a processor is embedded within a SLOT architecture, one or more...Broderson, Low-power CMOS digital design, IEEE Journal of Solid-State Circuits 27 (1992), no. 4, 473–484. [25] L. Chao and E. Sha , Scheduling data-flow...1997), 239– 256 . [82] P. K. Murthy, E. G. Cohen, and S. Rowland, System Canvas: A new design en- vironment for embedded DSP and telecommunications
Expert Systems on Multiprocessor Architectures. Volume 3. Technical Reports

DTIC Science & Technology

1991-06-01

choice of load balancing vs. load sharing 1141. While load balancing strives to keep all sites equally loaded, load sharing merely tries to prevent ...unnecessary idleness. Loo. balancing is appropriate to object- oriented real- time systems because * real-time systems ne ,l to prevent long waits for...oetavir ConClass siy51cr Iz a n ubjeU rephitation ’-enare ir order wo prevent a partic=Lar abiec:;ram heing (ntrlu ~lel Ar iic]en:f etautaan ire chanw
Multiprocessor Real-Time Locking Protocols for Replicated Resources

DTIC Science & Technology

2016-07-01

circular buffer of slots, each representing a discrete segment of time . For example, if the maintenance of a timing wheel occurs af- ter an interrupt ...Experimental Evaluation To evaluate Algs. 2, 3, and 4, we conducted a series of ex- periments in which we measured relevant overheads and blocking times . We...Multiprocessor Real- Time Locking Protocols for Replicated Resources ∗ Catherine E. Jarrett1, Kecheng Yang1, Ming Yang1, Pontus Ekberg2, and James H
Cedar-a large scale multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gajski, D.; Kuck, D.; Lawrie, D.

1983-01-01

This paper presents an overview of Cedar, a large scale multiprocessor being designed at the University of Illinois. This machine is designed to accommodate several thousand high performance processors which are capable of working together on a single job, or they can be partitioned into groups of processors where each group of one or more processors can work on separate jobs. Various aspects of the machine are described including the control methodology, communication network, optimizing compiler and plans for construction. 13 references.
An Expert System Interfaced with a Database System to Perform Troubleshooting of Aircraft Carrier Piping Systems

DTIC Science & Technology

1988-12-01

interval of four feet, and are numbered sequentially bow to stem. * "wing tank" is a tank or void, outboard of the holding bulkhead, away from the center...system and DBMS simultaneously with a multi-processor, allowing queries to the DBMS without terminating the expert system. This method was judged...RECIRC). eductor -strip("Y"):- ask _ques _read_ans(OVBD,"ovbd dis open"),ovbd dis-open(OVBD). eductor-strip("N"):- ask_ques read_ans( LINEUP , "strip lineup
Solving Partial Differential Equations in a data-driven multiprocessor environment

DOE Office of Scientific and Technical Information (OSTI.GOV)

Gaudiot, J.L.; Lin, C.M.; Hosseiniyar, M.

1988-12-31

Partial differential equations can be found in a host of engineering and scientific problems. The emergence of new parallel architectures has spurred research in the definition of parallel PDE solvers. Concurrently, highly programmable systems such as data-how architectures have been proposed for the exploitation of large scale parallelism. The implementation of some Partial Differential Equation solvers (such as the Jacobi method) on a tagged token data-flow graph is demonstrated here. Asynchronous methods (chaotic relaxation) are studied and new scheduling approaches (the Token No-Labeling scheme) are introduced in order to support the implementation of the asychronous methods in a data-driven environment.more » New high-level data-flow language program constructs are introduced in order to handle chaotic operations. Finally, the performance of the program graphs is demonstrated by a deterministic simulation of a message passing data-flow multiprocessor. An analysis of the overhead in the data-flow graphs is undertaken to demonstrate the limits of parallel operations in dataflow PDE program graphs.« less
An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches

NASA Astrophysics Data System (ADS)

Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur

2018-03-01

Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.
Solutions and debugging for data consistency in multiprocessors with noncoherent caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.

1995-02-01

We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
Partitioning sparse matrices with eigenvectors of graphs

NASA Technical Reports Server (NTRS)

Pothen, Alex; Simon, Horst D.; Liou, Kang-Pu

1990-01-01

The problem of computing a small vertex separator in a graph arises in the context of computing a good ordering for the parallel factorization of sparse, symmetric matrices. An algebraic approach for computing vertex separators is considered in this paper. It is shown that lower bounds on separator sizes can be obtained in terms of the eigenvalues of the Laplacian matrix associated with a graph. The Laplacian eigenvectors of grid graphs can be computed from Kronecker products involving the eigenvectors of path graphs, and these eigenvectors can be used to compute good separators in grid graphs. A heuristic algorithm is designed to compute a vertex separator in a general graph by first computing an edge separator in the graph from an eigenvector of the Laplacian matrix, and then using a maximum matching in a subgraph to compute the vertex separator. Results on the quality of the separators computed by the spectral algorithm are presented, and these are compared with separators obtained from other algorithms for computing separators. Finally, the time required to compute the Laplacian eigenvector is reported, and the accuracy with which the eigenvector must be computed to obtain good separators is considered. The spectral algorithm has the advantage that it can be implemented on a medium-size multiprocessor in a straightforward manner.

Cache directory lookup reader set encoding for partial cache line speculation support

DOEpatents

Gara, Alan; Ohmacht, Martin

2014-10-21

In a multiprocessor system, with conflict checking implemented in a directory lookup of a shared cache memory, a reader set encoding permits dynamic recordation of read accesses. The reader set encoding includes an indication of a portion of a line read, for instance by indicating boundaries of read accesses. Different encodings may apply to different types of speculative execution.
Issues Concerning The Development Of A Mobile Platform For Health Care Applications

NASA Astrophysics Data System (ADS)

Korba, Larry W.; Liscano, Ramiro; Green, David; Durie, Nelson

1989-03-01

There are a number of problems that must yet be overcome before robotic technology can be applied in a hospital or a home care setting. The four basic problems are: cost, safety, finding appropriate applications and developing application specific solutions. Advanced robotics technology is now costly because of the complexity associated with autonomous systems. In any application, it is most important that the safety of the individuals using or exposed to the vehicle is ensured. Often in the health care field, innovative and useful new devices require an inordinate amount of time before they are accepted. The technical and ergonomic problems associated with any application must be solved so that cost containment, safety, ease of use, and quality of life are ensured. This paper discusses these issues in relation to our own development of an autonomous vehicle for health care applications. In this advancement, a commercially available platform is being equipped with an on-board, multiprocessor computer system and a variety of sensor systems. In order to develop pertinent solutions to the technical problems, there must be a framework wherein there is a focus upon the practical issues associated with the end application.
Performance Monitoring of Distributed Data Processing Systems

NASA Technical Reports Server (NTRS)

Ojha, Anand K.

2000-01-01

Test and checkout systems are essential components in ensuring safety and reliability of aircraft and related systems for space missions. A variety of systems, developed over several years, are in use at the NASA/KSC. Many of these systems are configured as distributed data processing systems with the functionality spread over several multiprocessor nodes interconnected through networks. To be cost-effective, a system should take the least amount of resource and perform a given testing task in the least amount of time. There are two aspects of performance evaluation: monitoring and benchmarking. While monitoring is valuable to system administrators in operating and maintaining, benchmarking is important in designing and upgrading computer-based systems. These two aspects of performance evaluation are the foci of this project. This paper first discusses various issues related to software, hardware, and hybrid performance monitoring as applicable to distributed systems, and specifically to the TCMS (Test Control and Monitoring System). Next, a comparison of several probing instructions are made to show that the hybrid monitoring technique developed by the NIST (National Institutes for Standards and Technology) is the least intrusive and takes only one-fourth of the time taken by software monitoring probes. In the rest of the paper, issues related to benchmarking a distributed system have been discussed and finally a prescription for developing a micro-benchmark for the TCMS has been provided.
Measurement of fault latency in a digital avionic miniprocessor

NASA Technical Reports Server (NTRS)

Mcgough, J. G.; Swern, F. L.

1981-01-01

The results of fault injection experiments utilizing a gate-level emulation of the central processor unit of the Bendix BDX-930 digital computer are presented. The failure detection coverage of comparison-monitoring and a typical avionics CPU self-test program was determined. The specific tasks and experiments included: (1) inject randomly selected gate-level and pin-level faults and emulate six software programs using comparison-monitoring to detect the faults; (2) based upon the derived empirical data develop and validate a model of fault latency that will forecast a software program's detecting ability; (3) given a typical avionics self-test program, inject randomly selected faults at both the gate-level and pin-level and determine the proportion of faults detected; (4) determine why faults were undetected; (5) recommend how the emulation can be extended to multiprocessor systems such as SIFT; and (6) determine the proportion of faults detected by a uniprocessor BIT (built-in-test) irrespective of self-test.
Spectral analysis method and sample generation for real time visualization of speech

NASA Astrophysics Data System (ADS)

Hobohm, Klaus

A method for translating speech signals into optical models, characterized by high sound discrimination and learnability and designed to provide to deaf persons a feedback towards control of their way of speaking, is presented. Important properties of speech production and perception processes and organs involved in these mechanisms are recalled in order to define requirements for speech visualization. It is established that the spectral representation of time, frequency and amplitude resolution of hearing must be fair and continuous variations of acoustic parameters of speech signal must be depicted by a continuous variation of images. A color table was developed for dynamic illustration and sonograms were generated with five spectral analysis methods such as Fourier transformations and linear prediction coding. For evaluating sonogram quality, test persons had to recognize consonant/vocal/consonant words and an optimized analysis method was achieved with a fast Fourier transformation and a postprocessor. A hardware concept of a real time speech visualization system, based on multiprocessor technology in a personal computer, is presented.
Programming methodology for a general purpose automation controller

NASA Technical Reports Server (NTRS)

Sturzenbecker, M. C.; Korein, J. U.; Taylor, R. H.

1987-01-01

The General Purpose Automation Controller is a multi-processor architecture for automation programming. A methodology has been developed whose aim is to simplify the task of programming distributed real-time systems for users in research or manufacturing. Programs are built by configuring function blocks (low-level computations) into processes using data flow principles. These processes are activated through the verb mechanism. Verbs are divided into two classes: those which support devices, such as robot joint servos, and those which perform actions on devices, such as motion control. This programming methodology was developed in order to achieve the following goals: (1) specifications for real-time programs which are to a high degree independent of hardware considerations such as processor, bus, and interconnect technology; (2) a component approach to software, so that software required to support new devices and technologies can be integrated by reconfiguring existing building blocks; (3) resistance to error and ease of debugging; and (4) a powerful command language interface.
Benchmarking Ada tasking on tightly coupled multiprocessor architectures

NASA Technical Reports Server (NTRS)

Collard, Philippe; Goforth, Andre; Marquardt, Matthew

1989-01-01

The development of benchmarks and performance measures for parallel Ada tasking is reported with emphasis on the macroscopic behavior of the benchmark across a set of load parameters. The application chosen for the study was the NASREM model for telerobot control, relevant to many NASA missions. The results of the study demonstrate the potential of parallel Ada in accomplishing the task of developing a control system for a system such as the Flight Telerobotic Servicer using the NASREM framework.
NASA Accelerates SpaceCube Technology into Orbit

NASA Technical Reports Server (NTRS)

Petrick, David

2010-01-01

On May 11, 2009, STS-125 Space Shuttle Atlantis blasted off from Kennedy Space Center on a historic mission to service the Hubble Space Telescope (HST). In addition to sending up the hardware and tools required to repair the observatory, the servicing team at NASA's Goddard Space Flight Center also sent along a complex experimental payload called Relative Navigation Sensors (RNS). The main objective of the RNS payload was to provide real-time image tracking of HST during rendezvous and docking operations. RNS was a complete success, and was brought to life by four Xilinx FPGAs (Field Programmable Gate Arrays) tightly packed into one integrated computer called SpaceCube. SpaceCube is a compact, reconfigurable, multiprocessor computing platform for space applications demanding extreme processing capabilities based on Xilinx Virtex 4 FX60 FPGAs. In a matter of months, the concept quickly went from the white board to a fully funded flight project. The 4-inch by 4-inch SpaceCube processor card was prototyped by a group of Goddard engineers using internal research funding. Once engineers were able to demonstrate the processing power of SpaceCube to NASA, HST management stood behind the product and invested in a flight qualified version, inserting it into the heart of the RNS system. With the determination of putting Xilinx into space, the team strengthened to a small army and delivered a fully functional, space qualified system to the mission.
An efficient parallel algorithm for matrix-vector multiplication

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hendrickson, B.; Leland, R.; Plimpton, S.

The multiplication of a vector by a matrix is the kernel computation of many algorithms in scientific computation. A fast parallel algorithm for this calculation is therefore necessary if one is to make full use of the new generation of parallel supercomputers. This paper presents a high performance, parallel matrix-vector multiplication algorithm that is particularly well suited to hypercube multiprocessors. For an n x n matrix on p processors, the communication cost of this algorithm is O(n/[radical]p + log(p)), independent of the matrix sparsity pattern. The performance of the algorithm is demonstrated by employing it as the kernel in themore » well-known NAS conjugate gradient benchmark, where a run time of 6.09 seconds was observed. This is the best published performance on this benchmark achieved to date using a massively parallel supercomputer.« less
Design tool for multiprocessor scheduling and evaluation of iterative dataflow algorithms

NASA Technical Reports Server (NTRS)

Jones, Robert L., III

1995-01-01

A graph-theoretic design process and software tool is defined for selecting a multiprocessing scheduling solution for a class of computational problems. The problems of interest are those that can be described with a dataflow graph and are intended to be executed repetitively on a set of identical processors. Typical applications include signal processing and control law problems. Graph-search algorithms and analysis techniques are introduced and shown to effectively determine performance bounds, scheduling constraints, and resource requirements. The software tool applies the design process to a given problem and includes performance optimization through the inclusion of additional precedence constraints among the schedulable tasks.
Parallel computation with the force

NASA Technical Reports Server (NTRS)

Jordan, H. F.

1985-01-01

A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.
A partitioning strategy for nonuniform problems on multiprocessors

NASA Technical Reports Server (NTRS)

Berger, M. J.; Bokhari, S.

1985-01-01

The partitioning of a problem on a domain with unequal work estimates in different subddomains is considered in a way that balances the work load across multiple processors. Such a problem arises for example in solving partial differential equations using an adaptive method that places extra grid points in certain subregions of the domain. A binary decomposition of the domain is used to partition it into rectangles requiring equal computational effort. The communication costs of mapping this partitioning onto different microprocessors: a mesh-connected array, a tree machine and a hypercube is then studied. The communication cost expressions can be used to determine the optimal depth of the above partitioning.
Numerical Simulation of the Flow over a Segment-Conical Body on the Basis of Reynolds Equations

NASA Astrophysics Data System (ADS)

Egorov, I. V.; Novikov, A. V.; Palchekovskaya, N. V.

2018-01-01

Numerical simulation was used to study the 3D supersonic flow over a segment-conical body similar in shape to the ExoMars space vehicle. The nonmonotone behavior of the normal force acting on the body placed in a supersonic gas flow was analyzed depending on the angle of attack. The simulation was based on the numerical solution of the unsteady Reynolds-averaged Navier-Stokes equations with a two-parameter differential turbulence model. The solution of the problem was obtained using the in-house solver HSFlow with an efficient parallel algorithm intended for multiprocessor super computers.
Solving optimization problems on computational grids.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wright, S. J.; Mathematics and Computer Science

2001-05-01

Multiprocessor computing platforms, which have become more and more widely available since the mid-1980s, are now heavily used by organizations that need to solve very demanding computational problems. Parallel computing is now central to the culture of many research communities. Novel parallel approaches were developed for global optimization, network optimization, and direct-search methods for nonlinear optimization. Activity was particularly widespread in parallel branch-and-bound approaches for various problems in combinatorial and network optimization. As the cost of personal computers and low-end workstations has continued to fall, while the speed and capacity of processors and networks have increased dramatically, 'cluster' platforms havemore » become popular in many settings. A somewhat different type of parallel computing platform know as a computational grid (alternatively, metacomputer) has arisen in comparatively recent times. Broadly speaking, this term refers not to a multiprocessor with identical processing nodes but rather to a heterogeneous collection of devices that are widely distributed, possibly around the globe. The advantage of such platforms is obvious: they have the potential to deliver enormous computing power. Just as obviously, however, the complexity of grids makes them very difficult to use. The Condor team, headed by Miron Livny at the University of Wisconsin, were among the pioneers in providing infrastructure for grid computations. More recently, the Globus project has developed technologies to support computations on geographically distributed platforms consisting of high-end computers, storage and visualization devices, and other scientific instruments. In 1997, we started the metaneos project as a collaborative effort between optimization specialists and the Condor and Globus groups. Our aim was to address complex, difficult optimization problems in several areas, designing and implementing the algorithms and the software infrastructure need to solve these problems on computational grids. This article describes some of the results we have obtained during the first three years of the metaneos project. Our efforts have led to development of the runtime support library MW for implementing algorithms with master-worker control structure on Condor platforms. This work is discussed here, along with work on algorithms and codes for integer linear programming, the quadratic assignment problem, and stochastic linear programmming. Our experiences in the metaneos project have shown that cheap, powerful computational grids can be used to tackle large optimization problems of various types. In an industrial or commercial setting, the results demonstrate that one may not have to buy powerful computational servers to solve many of the large problems arising in areas such as scheduling, portfolio optimization, or logistics; the idle time on employee workstations (or, at worst, an investment in a modest cluster of PCs) may do the job. For the optimization research community, our results motivate further work on parallel, grid-enabled algorithms for solving very large problems of other types. The fact that very large problems can be solved cheaply allows researchers to better understand issues of 'practical' complexity and of the role of heuristics.« less
Checkpointing Shared Memory Programs at the Application-level

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bronevetsky, G; Schulz, M; Szwed, P

2004-09-08

Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focusedmore » on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.« less
Memory Benchmarks for SMP-Based High Performance Parallel Computers

DOE Office of Scientific and Technical Information (OSTI.GOV)

Yoo, A B; de Supinski, B; Mueller, F

2001-11-20

As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
Domain decomposition preconditioners for the spectral collocation method

NASA Technical Reports Server (NTRS)

Quarteroni, Alfio; Sacchilandriani, Giovanni

1988-01-01

Several block iteration preconditioners are proposed and analyzed for the solution of elliptic problems by spectral collocation methods in a region partitioned into several rectangles. It is shown that convergence is achieved with a rate which does not depend on the polynomial degree of the spectral solution. The iterative methods here presented can be effectively implemented on multiprocessor systems due to their high degree of parallelism.
Comparisons between Intel 386 and i486 microprecessors

NASA Technical Reports Server (NTRS)

Liu, Yuan-Kwei

1989-01-01

A quick and preliminary comparison is made between the Intel 386 and i486 microprocessors. The following topics are discussed: the i486 key elements, comparison of instruction set architecture, the i486 on-chip cache characteristics, the i486 multiprocessor support, comparison of performance, comparison of power consumption, comparison of radiation hardening potential, and recommendations for the Space Station Freedom (SSF) Data Management System (DMS).
Cache directory look-up re-use as conflict check mechanism for speculative memory requests

DOEpatents

Ohmacht, Martin

2013-09-10

In a cache memory, energy and other efficiencies can be realized by saving a result of a cache directory lookup for sequential accesses to a same memory address. Where the cache is a point of coherence for speculative execution in a multiprocessor system, with directory lookups serving as the point of conflict detection, such saving becomes particularly advantageous.
Performance analysis of a large-grain dataflow scheduling paradigm

NASA Technical Reports Server (NTRS)

Young, Steven D.; Wills, Robert W.

1993-01-01

A paradigm for scheduling computations on a network of multiprocessors using large-grain data flow scheduling at run time is described and analyzed. The computations to be scheduled must follow a static flow graph, while the schedule itself will be dynamic (i.e., determined at run time). Many applications characterized by static flow exist, and they include real-time control and digital signal processing. With the advent of computer-aided software engineering (CASE) tools for capturing software designs in dataflow-like structures, macro-dataflow scheduling becomes increasingly attractive, if not necessary. For parallel implementations, using the macro-dataflow method allows the scheduling to be insulated from the application designer and enables the maximum utilization of available resources. Further, by allowing multitasking, processor utilizations can approach 100 percent while they maintain maximum speedup. Extensive simulation studies are performed on 4-, 8-, and 16-processor architectures that reflect the effects of communication delays, scheduling delays, algorithm class, and multitasking on performance and speedup gains.

Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
Parallel Processing with Digital Signal Processing Hardware and Software

NASA Technical Reports Server (NTRS)

Swenson, Cory V.

1995-01-01

The assembling and testing of a parallel processing system is described which will allow a user to move a Digital Signal Processing (DSP) application from the design stage to the execution/analysis stage through the use of several software tools and hardware devices. The system will be used to demonstrate the feasibility of the Algorithm To Architecture Mapping Model (ATAMM) dataflow paradigm for static multiprocessor solutions of DSP applications. The individual components comprising the system are described followed by the installation procedure, research topics, and initial program development.
Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor

NASA Technical Reports Server (NTRS)

Field, E. I.

1975-01-01

The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.
CMOL/CMOS hardware architectures and performance/price for Bayesian memory - The building block of intelligent systems

NASA Astrophysics Data System (ADS)

Zaveri, Mazad Shaheriar

The semiconductor/computer industry has been following Moore's law for several decades and has reaped the benefits in speed and density of the resultant scaling. Transistor density has reached almost one billion per chip, and transistor delays are in picoseconds. However, scaling has slowed down, and the semiconductor industry is now facing several challenges. Hybrid CMOS/nano technologies, such as CMOL, are considered as an interim solution to some of the challenges. Another potential architectural solution includes specialized architectures for applications/models in the intelligent computing domain, one aspect of which includes abstract computational models inspired from the neuro/cognitive sciences. Consequently in this dissertation, we focus on the hardware implementations of Bayesian Memory (BM), which is a (Bayesian) Biologically Inspired Computational Model (BICM). This model is a simplified version of George and Hawkins' model of the visual cortex, which includes an inference framework based on Judea Pearl's belief propagation. We then present a "hardware design space exploration" methodology for implementing and analyzing the (digital and mixed-signal) hardware for the BM. This particular methodology involves: analyzing the computational/operational cost and the related micro-architecture, exploring candidate hardware components, proposing various custom hardware architectures using both traditional CMOS and hybrid nanotechnology - CMOL, and investigating the baseline performance/price of these architectures. The results suggest that CMOL is a promising candidate for implementing a BM. Such implementations can utilize the very high density storage/computation benefits of these new nano-scale technologies much more efficiently; for example, the throughput per 858 mm2 (TPM) obtained for CMOL based architectures is 32 to 40 times better than the TPM for a CMOS based multiprocessor/multi-FPGA system, and almost 2000 times better than the TPM for a PC implementation. We later use this methodology to investigate the hardware implementations of cortex-scale spiking neural system, which is an approximate neural equivalent of BICM based cortex-scale system. The results of this investigation also suggest that CMOL is a promising candidate to implement such large-scale neuromorphic systems. In general, the assessment of such hypothetical baseline hardware architectures provides the prospects for building large-scale (mammalian cortex-scale) implementations of neuromorphic/Bayesian/intelligent systems using state-of-the-art and beyond state-of-the-art silicon structures.
Scalable parallel communications

NASA Technical Reports Server (NTRS)

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
Fault tolerant hypercube computer system architecture

NASA Technical Reports Server (NTRS)

Madan, Herb S. (Inventor); Chow, Edward (Inventor)

1989-01-01

A fault-tolerant multiprocessor computer system of the hypercube type comprising a hierarchy of computers of like kind which can be functionally substituted for one another as necessary is disclosed. Communication between the working nodes is via one communications network while communications between the working nodes and watch dog nodes and load balancing nodes higher in the structure is via another communications network separate from the first. A typical branch of the hierarchy reporting to a master node or host computer comprises, a plurality of first computing nodes; a first network of message conducting paths for interconnecting the first computing nodes as a hypercube. The first network provides a path for message transfer between the first computing nodes; a first watch dog node; and a second network of message connecting paths for connecting the first computing nodes to the first watch dog node independent from the first network, the second network provides an independent path for test message and reconfiguration affecting transfers between the first computing nodes and the first switch watch dog node. There is additionally, a plurality of second computing nodes; a third network of message conducting paths for interconnecting the second computing nodes as a hypercube. The third network provides a path for message transfer between the second computing nodes; a fourth network of message conducting paths for connecting the second computing nodes to the first watch dog node independent from the third network. The fourth network provides an independent path for test message and reconfiguration affecting transfers between the second computing nodes and the first watch dog node; and a first multiplexer disposed between the first watch dog node and the second and fourth networks for allowing the first watch dog node to selectively communicate with individual ones of the computing nodes through the second and fourth networks; as well as, a second watch dog node operably connected to the first multiplexer whereby the second watch dog node can selectively communicate with individual ones of the computing nodes through the second and fourth networks. The branch is completed by a first load balancing node; and a second multiplexer connected between the first load balancing node and the first and second watch dog nodes, allowing the first load balancing node to selectively communicate with the first and second watch dog nodes.
The DANTE Boltzmann transport solver: An unstructured mesh, 3-D, spherical harmonics algorithm compatible with parallel computer architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

McGhee, J.M.; Roberts, R.M.; Morel, J.E.

1997-06-01

A spherical harmonics research code (DANTE) has been developed which is compatible with parallel computer architectures. DANTE provides 3-D, multi-material, deterministic, transport capabilities using an arbitrary finite element mesh. The linearized Boltzmann transport equation is solved in a second order self-adjoint form utilizing a Galerkin finite element spatial differencing scheme. The core solver utilizes a preconditioned conjugate gradient algorithm. Other distinguishing features of the code include options for discrete-ordinates and simplified spherical harmonics angular differencing, an exact Marshak boundary treatment for arbitrarily oriented boundary faces, in-line matrix construction techniques to minimize memory consumption, and an effective diffusion based preconditioner formore » scattering dominated problems. Algorithm efficiency is demonstrated for a massively parallel SIMD architecture (CM-5), and compatibility with MPP multiprocessor platforms or workstation clusters is anticipated.« less
Validation of CFD/Heat Transfer Software for Turbine Blade Analysis

NASA Technical Reports Server (NTRS)

Kiefer, Walter D.

2004-01-01

I am an intern in the Turbine Branch of the Turbomachinery and Propulsion Systems Division. The division is primarily concerned with experimental and computational methods of calculating heat transfer effects of turbine blades during operation in jet engines and land-based power systems. These include modeling flow in internal cooling passages and film cooling, as well as calculating heat flux and peak temperatures to ensure safe and efficient operation. The branch is research-oriented, emphasizing the development of tools that may be used by gas turbine designers in industry. The branch has been developing a computational fluid dynamics (CFD) and heat transfer code called GlennHT to achieve the computational end of this analysis. The code was originally written in FORTRAN 77 and run on Silicon Graphics machines. However the code has been rewritten and compiled in FORTRAN 90 to take advantage of more modem computer memory systems. In addition the branch has made a switch in system architectures from SGI's to Linux PC's. The newly modified code therefore needs to be tested and validated. This is the primary goal of my internship. To validate the GlennHT code, it must be run using benchmark fluid mechanics and heat transfer test cases, for which there are either analytical solutions or widely accepted experimental data. From the solutions generated by the code, comparisons can be made to the correct solutions to establish the accuracy of the code. To design and create these test cases, there are many steps and programs that must be used. Before a test case can be run, pre-processing steps must be accomplished. These include generating a grid to describe the geometry, using a software package called GridPro. Also various files required by the GlennHT code must be created including a boundary condition file, a file for multi-processor computing, and a file to describe problem and algorithm parameters. A good deal of this internship will be to become familiar with these programs and the structure of the GlennHT code. Additional information is included in the original extended abstract.
A Model and Algorithms For a Software Evolution Control System

DTIC Science & Technology

1993-12-01

dynamic scheduling approaches can be found in [67). Task scheduling can also be characterized as preemptive and nonpreemptive . A task is preemptive ...is NP-hard for both the preemptive and nonpreemptive cases [671 [84). Scheduling nonpreemptive tasks with arbitrary ready times is NP-hard in both...the preemptive and nonpreemptive cases [671 [841. Scheduling nonpreemptive tasks with arbitrary ready times is NP-hard in both multiprocessor and
Information Processing Research

DTIC Science & Technology

1988-01-01

the Hitech chess machine, which achieves its success from parallelism in the right places. Hitech has now reached a National rating of 2359, making it...outset that success depended on building real systems and subjecting them to use by a large number of faculty and students within the Department. We...central server workstations each acting as a host for a Warp machine, and a few Warp multiprocessors. The command interpreter is executed in Lisp on
Communications Patterns in a Symbolic Multiprocessor.

DTIC Science & Technology

1987-06-01

instruction references that Multilisp programs make. The cache hit ratio is greatest when instruction references have a high degree of -- locality. Another...future touches hit an undetermined future. N, The only exception is Consim, in which one third of future touches hit unde- termined futures. Task...Cambridge, MA, June 1985. [52] S. Sugimoto, K. Agusa, K. Tabata , and Y. Ohno. A multi-microprocessor system for concurrent Lisp. In Proceedings of
Principles for problem aggregation and assignment in medium scale multiprocessors

NASA Technical Reports Server (NTRS)

Nicol, David M.; Saltz, Joel H.

1987-01-01

One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior.
NASA Tech Briefs, October 2004

NASA Technical Reports Server (NTRS)

2004-01-01

Topics include: Relative-Motion Sensors and Actuators for Two Optical Tables; Improved Position Sensor for Feedback Control of Levitation; Compact Tactile Sensors for Robot Fingers; Improved Ion-Channel Biosensors; Suspended-Patch Antenna With Inverted, EM-Coupled Feed; System Would Predictively Preempt Traffic Lights for Emergency Vehicles; Optical Position Encoders for High or Low Temperatures; Inter-Valence-Subband/Conduction-Band-Transport IR Detectors; Additional Drive Circuitry for Piezoelectric Screw Motors; Software for Use with Optoelectronic Measuring Tool; Coordinating Shared Activities; Software Reduces Radio-Interference Effects in Radar Data; Using Iron to Treat Chlorohydrocarbon-Contaminated Soil; Thermally Insulating, Kinematic Tensioned-Fiber Suspension; Back Actuators for Segmented Mirrors and Other Applications; Mechanism for Self-Reacted Friction Stir Welding; Lightweight Exoskeletons with Controllable Actuators; Miniature Robotic Submarine for Exploring Harsh Environments; Electron-Spin Filters Based on the Rashba Effect; Diffusion-Cooled Tantalum Hot-Electron Bolometer Mixers; Tunable Optical True-Time Delay Devices Would Exploit EIT; Fast Query-Optimized Kernel-Machine Classification; Indentured Parts List Maintenance and Part Assembly Capture Tool - IMPACT; An Architecture for Controlling Multiple Robots; Progress in Fabrication of Rocket Combustion Chambers by VPS; CHEM-Based Self-Deploying Spacecraft Radar Antennas; Scalable Multiprocessor for High-Speed Computing in Space; and Simple Systems for Detecting Spacecraft Meteoroid Punctures.
Dynamic resource allocation in a hierarchical multiprocessor system: A preliminary study

NASA Technical Reports Server (NTRS)

Ngai, Tin-Fook

1986-01-01

An integrated system approach to dynamic resource allocation is proposed. Some of the problems in dynamic resource allocation and the relationship of these problems to system structures are examined. A general dynamic resource allocation scheme is presented. A hierarchial system architecture which dynamically maps between processor structure and programs at multiple levels of instantiations is described. Simulation experiments were conducted to study dynamic resource allocation on the proposed system. Preliminary evaluation based on simple dynamic resource allocation algorithms indicates that with the proposed system approach, the complexity of dynamic resource management could be significantly reduced while achieving reasonable effective dynamic resource allocation.
Dynamic Load Balancing for Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single EBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Dynamic Load Balancing For Grid Partitioning on a SP-2 Multiprocessor: A Framework

NASA Technical Reports Server (NTRS)

Sohn, Andrew; Simon, Horst; Lasinski, T. A. (Technical Monitor)

1994-01-01

Computational requirements of full scale computational fluid dynamics change as computation progresses on a parallel machine. The change in computational intensity causes workload imbalance of processors, which in turn requires a large amount of data movement at runtime. If parallel CFD is to be successful on a parallel or massively parallel machine, balancing of the runtime load is indispensable. Here a framework is presented for dynamic load balancing for CFD applications, called Jove. One processor is designated as a decision maker Jove while others are assigned to computational fluid dynamics. Processors running CFD send flags to Jove in a predetermined number of iterations to initiate load balancing. Jove starts working on load balancing while other processors continue working with the current data and load distribution. Jove goes through several steps to decide if the new data should be taken, including preliminary evaluate, partition, processor reassignment, cost evaluation, and decision. Jove running on a single IBM SP2 node has been completely implemented. Preliminary experimental results show that the Jove approach to dynamic load balancing can be effective for full scale grid partitioning on the target machine IBM SP2.
Optical backplane interconnect switch for data processors and computers

NASA Technical Reports Server (NTRS)

Hendricks, Herbert D.; Benz, Harry F.; Hammer, Jacob M.

1989-01-01

An optoelectronic integrated device design is reported which can be used to implement an all-optical backplane interconnect switch. The switch is sized to accommodate an array of processors and memories suitable for direct replacement into the basic avionic multiprocessor backplane. The optical backplane interconnect switch is also suitable for direct replacement of the PI bus traffic switch and at the same time, suitable for supporting pipelining of the processor and memory. The 32 bidirectional switchable interconnects are configured with broadcast capability for controls, reconfiguration, and messages. The approach described here can handle a serial interconnection of data processors or a line-to-link interconnection of data processors. An optical fiber demonstration of this approach is presented.
From Wheatstone to Cameron and beyond: overview in 3-D and 4-D imaging technology

NASA Astrophysics Data System (ADS)

Gilbreath, G. Charmaine

2012-02-01

This paper reviews three-dimensional (3-D) and four-dimensional (4-D) imaging technology, from Wheatstone through today, with some prognostications for near future applications. This field is rich in variety, subject specialty, and applications. A major trend, multi-view stereoscopy, is moving the field forward to real-time wide-angle 3-D reconstruction as breakthroughs in parallel processing and multi-processor computers enable very fast processing. Real-time holography meets 4-D imaging reconstruction at the goal of achieving real-time, interactive, 3-D imaging. Applications to telesurgery and telemedicine as well as to the needs of the defense and intelligence communities are also discussed.
Programming parallel architectures: The BLAZE family of languages

NASA Technical Reports Server (NTRS)

Mehrotra, Piyush

1988-01-01

Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.
Automatic Data Partitioning on Distributed Memory Multiprocessors

DTIC Science & Technology

1990-10-01

DISTRIBUTED MEMORY MULTIPROCESSORS D 1"’ 1 C . Manish Gupta NOV,1 41990.NOV 1 41990m Prithviraj Banerjee D Coordinated Science Laboratory College of...developed on the partitioning of arrays can as well be applied to other programming languages, such as C . 3 The rest of this paper is organized as follows...value 1, as in Fortran. a) N= 4, N 2 = 1: f(i) = J,(j) = 03 b) =Ni 1, N 2 =4: fA(i) =, f() - c ) NI 2, X) 2: f()=[., f2(j) = [L-.j d) N 1 , N 2 =4: fA(i

Some links on this page may take you to non-federal websites. Their policies may differ from this site.