Science.gov

Sample records for shared memory multiprocessors

  1. Shared versus distributed memory multiprocessors

    NASA Technical Reports Server (NTRS)

    Jordan, Harry F.

    1991-01-01

    The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.

  2. Parallel language support on shared memory multiprocessors

    SciTech Connect

    Sah, A.

    1991-01-01

    The study of general purpose parallel computing requires efficient and inexpensive platforms for parallel program execution. This helps in ascertaining tradeoff choices between hardware complexity and software solutions for massively parallel systems design. In this paper, the authors present an implementation of an efficient parallel execution model on shared memory multiprocessors based on a Threaded Abstract Machine. The authors discuss a k-way generalized locking strategy suitable for our model. The authors study the performance gains obtained by a queuing strategy which uses multiple gueues with reduced access contention. The authors also present performance models in shared memory machines, related to lock contention and serialization in shared memory allocation. A bin-based memory management technique which reduces the serialization is presented. These issues are critical for obtaining an efficient parallel execution environment.

  3. Efficient ICCG on a shared memory multiprocessor

    NASA Technical Reports Server (NTRS)

    Hammond, Steven W.; Schreiber, Robert

    1989-01-01

    Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially.

  4. A robot arm simulation with a shared memory multiprocessor machine

    NASA Technical Reports Server (NTRS)

    Kim, Sung-Soo; Chuang, Li-Ping

    1989-01-01

    A parallel processing scheme for a single chain robot arm is presented for high speed computation on a shared memory multiprocessor. A recursive formulation that is derived from a virtual work form of the d'Alembert equations of motion is utilized for robot arm dynamics. A joint drive system that consists of a motor rotor and gears is included in the arm dynamics model, in order to take into account gyroscopic effects due to the spinning of the rotor. The fine grain parallelism of mechanical and control subsystem models is exploited, based on independent computation associated with bodies, joint drive systems, and controllers. Efficiency and effectiveness of the parallel scheme are demonstrated through simulations of a telerobotic manipulator arm. Two different mechanical subsystem models, i.e., with and without gyroscopic effects, are compared, to show the trade-off between efficiency and accuracy.

  5. MPF: A portable message passing facility for shared memory multiprocessors

    NASA Technical Reports Server (NTRS)

    Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

    1987-01-01

    The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.

  6. Parallelization of NAS Benchmarks for Shared Memory Multiprocessors

    NASA Technical Reports Server (NTRS)

    Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)

    1998-01-01

    This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.

  7. Paging tradeoffs in distributed-shared-memory multiprocessors

    SciTech Connect

    Burger, D.C.; Hyder, R.S.; Miller, B.P.; Wood, D.A.

    1994-12-31

    Massively parallel processors have begun using commodity operating systems that support demand-paged virtual memory. To evaluate the utility of virtual memory, the authors measured the behavior of seven shared memory parallel application programs on a simulated distributed-shared-memory machine. The results (1) confirm the importance of gang CPU scheduling, (2) show that a page-faulting processor should spin rather than invoke a parallel context switch, (3) show that the parallel programs frequently touch most of their data, and (4) indicate that memory, not just CPUs, must be ``gang scheduled``. Overall, the experiments demonstrate that demand paging has limited value on current parallel machines because of the applications` synchronization and memory reference patterns and the machines` high page-fault and parallel-context-switch overheads.

  8. Fast, contention-free combining tree barriers for shared-memory multiprocessors

    SciTech Connect

    Scott, M.L. ); Mellor-Crummey, J.M. )

    1994-08-01

    In a previous article, Gupta and Hill introduced an adaptive combining tree algorithm for busy-wait barrier synchronization on shared-memory multiprocessors. The intent of the algorithm was to achieve a barrier in logarithmic time when processes arrive simultaneously, and in constant time after the last arrival when arrival times are skewed. A fuzzy version of the algorithm allows a process to perform useful work between the point at which it notifies other processes of its arrival and the point at which it waits for all other processes to arrive. Unfortunately, adaptive combining tree barriers as originally devised perform a large amount of work at each node of the tree, including the acquisition and release of locks. They also perform an unbounded number of accesses to nonlocal locations, inducing large amounts of memory and interconnect contention. We present new adaptive combining tree barriers that eliminate these problems. We compare the performance of the new algorithms to that of other fast barriers on a 64-node BBN Butterfly 1 multiprocessor, a 35-node BBN TC2000, and a 126-node KSR 1. The results reveal scenarios in which our algorithms outperform all known alternatives, and suggest that both adaptation and the combination of fuzziness with tree-style synchronization will be of increasing importance on future generations of shared-memory multiprocessors.

  9. Implementation of two projection methods on a shared memory multiprocessor - DEC VAX 6240

    NASA Technical Reports Server (NTRS)

    Kamath, C.; Weeratunga, S.

    1990-01-01

    The relative performance of two iterative schemes, based on projection techniques, is compared on a shared memory multiprocessor - VAX 6240. The CG accelerated Block-SSOR method and the CG accelerated Symmetric-Kaczmarz method are considered for the solution of large sparse nonsymmetric systems of linear equations. It is shown that the regular structure of many matrices can be exploited by the CG-accelerated Block-SSOR method to provide good speedup in a multiprocessing environment. However, the CG accelerated Symmetric-Kaczmarz method, while being a viable alternative on a scalar machine, is unable to benefit from multiprocessing.

  10. Performance Modeling and Measurement of Parallelized Code for Distributed Shared Memory Multiprocessors

    NASA Technical Reports Server (NTRS)

    Waheed, Abdul; Yan, Jerry

    1998-01-01

    This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.

  11. Reader set encoding for directory of shared cache memory in multiprocessor system

    DOEpatents

    Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang

    2014-06-10

    In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.

  12. Shared performance monitor in a multiprocessor system

    DOEpatents

    Chiu, George; Gara, Alan G; Salapura, Valentina

    2014-12-02

    A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.

  13. Sparse Gaussian elimination with controlled fill-in on a shared memory multiprocessor

    NASA Technical Reports Server (NTRS)

    Alaghband, Gita; Jordan, Harry F.

    1989-01-01

    It is shown that in sparse matrices arising from electronic circuits, it is possible to do computations on many diagonal elements simultaneously. A technique for obtaining an ordered compatible set directly from the ordered incompatible table is given. The ordering is based on the Markowitz number of the pivot candidates. This technique generates a set of compatible pivots with the property of generating few fills. A novel heuristic algorithm is presented that combines the idea of an order-compatible set with a limited binary tree search to generate several sets of compatible pivots in linear time. An elimination set for reducing the matrix is generated and selected on the basis of a minimum Markowitz sum number. The parallel pivoting technique presented is a stepwise algorithm and can be applied to any submatrix of the original matrix. Thus, it is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds. Parameters are suggested to obtain a balance between parallelism and fill-ins. Results of applying the proposed algorithms on several large application matrices using the HEP multiprocessor (Kowalik, 1985) are presented and analyzed.

  14. Interleaved synchronous bus access protocol for a shared memory multi-processor system

    SciTech Connect

    Moore, W.T.

    1989-01-10

    A method is described for providing asynchronous processors with inter-processor communication and access to several memory modules over a common bus which includes a first bus and a second bus, comprising: providing clock pulses on the common bus, each pulse having a period; asserting a request signal and placing priority signal on the common bus; polling the processors during the first period to determine whether the processors request access to the common bus and to determine which one processor has priority; sending a destination address from the one processor to a destination during a second period, the destination being chosen from the processors and the several memory modules; performing one of reading input data between the destination and the processor; multiplexing priority and reading input data signals on the first bus, and multiplexing address and writing output data signals on the second bus; generating poll inhibit signals prior to each reading input data signal and prior to each memory address signal preceding a writing output data operation; and queuing the input data in a first-in-first-out manner for each of the processors when the input data indicates an interprocessor interrupt.

  15. Shared performance monitor in a multiprocessor system

    DOEpatents

    Chiu, George; Gara, Alan G.; Salapura, Valentina

    2012-07-24

    A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.

  16. A general model for memory interference in a multiprocessor system with memory hierarchy

    NASA Technical Reports Server (NTRS)

    Taha, Badie A.; Standley, Hilda M.

    1989-01-01

    The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.

  17. Preliminary basic performance analysis of the Cedar multiprocessor memory system

    NASA Technical Reports Server (NTRS)

    Gallivan, K.; Jalby, W.; Turner, S.; Veidenbaum, A.; Wijshoff, H.

    1991-01-01

    Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system.

  18. Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

    SciTech Connect

    Chin, George; Marquez, Andres; Choudhury, Sutanay; Feo, John T.

    2012-09-01

    Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis of large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.

  19. Program partitioning for NUMA multiprocessor computer systems. [Nonuniform memory access

    SciTech Connect

    Wolski, R.M.; Feo, J.T. )

    1993-11-01

    Program partitioning and scheduling are essential steps in programming non-shared-memory computer systems. Partitioning is the separation of program operations into sequential tasks, and scheduling is the assignment of tasks to processors. To be effective, automatic methods require an accurate representation of the model of computation and the target architecture. Current partitioning methods assume today's most prevalent models -- macro dataflow and a homogeneous/two-level multicomputer system. Based on communication channels, neither model represents well the emerging class of NUMA multiprocessor computer systems consisting of hierarchical read/write memories. Consequently, the partitions generated by extant methods do not execute well on these systems. In this paper, the authors extend the conventional graph representation of the macro-dataflow model to enable mapping heuristics to consider the complex communication options supported by NUMA architectures. They describe two such heuristics. Simulated execution times of program graphs show that the model and heuristics generate higher quality program mappings than current methods for NUMA architectures.

  20. Using Pin as a Memory Reference Generator for Multiprocessor Simulation

    SciTech Connect

    McCurdy, C

    2005-10-22

    In this paper we describe how we have used Pin to generate a multithreaded reference stream for simulation of a multiprocessor on a uniprocessor. We have taken special care to model as accurately as possible the effects of cache coherence protocol state, and lock and barrier synchronization on the performance of multithreaded applications running on multiprocessor hardware. We first describe a simplified version of the algorithm, which uses semaphores to synchronize instrumented application threads and the simulator on every memory reference. We then describe modifications to that algorithm to model the microarchitectural features of the Itanium2 that affect the timing of memory reference issue. An experimental evaluation determines that while cycle-accurate multithreaded simulation is possible using our approach, the use of semaphores has a negative impact on the performance of the simulator.

  1. Low Latency Messages on Distributed Memory Multiprocessors

    DOE PAGESBeta

    Rosing, Matt; Saltz, Joel

    1995-01-01

    This article describes many of the issues in developing an efficient interface for communication on distributed memory machines. Although the hardware component of message latency is less than 1 ws on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 μs. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. This article describes several tests performed and many of the issues involvedmore » in supporting low latency messages on distributed memory machines.« less

  2. Low latency messages on distributed memory multiprocessors

    NASA Technical Reports Server (NTRS)

    Rosing, Matthew; Saltz, Joel

    1993-01-01

    Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.

  3. Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Bolosky, William Joseph

    1993-01-01

    Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.

  4. Generation-based memory synchronization in a multiprocessor system with weakly consistent memory accesses

    DOEpatents

    Ohmacht, Martin

    2014-09-09

    In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.

  5. Denelcor HEP multiprocessor simulator

    SciTech Connect

    Dunigan, T.H.

    1986-06-01

    The structure and use of a simulator for the Denelcor HEP multiprocessor are described. The simulator provides a multitasking environment for the development of parallel programs in C or FORTRAN using a library of subroutines that simulate the parallel programming constructs available on the HEP, a shared-memory multiprocessor. The simulator also provides a trace file that can be used for debugging, performance analysis, or graphical display. 7 refs., 4 figs.

  6. Multi-ring performance of the Kendall square multiprocessor

    SciTech Connect

    Dunigan, T.H.

    1994-03-01

    Performance of the hierarchical shared-memory system of the Kendall Square Research multiprocessor is measured and characterized. The performance of prefetch is measured. Latency, bandwidth, and contention are analyzed on a 4-ring, 128 processor system. Scalability comparisons are made with other shared-memory and distributed-memory multiprocessors.

  7. Optical RAM-enabled cache memory and optical routing for chip multiprocessors: technologies and architectures

    NASA Astrophysics Data System (ADS)

    Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.

    2014-03-01

    The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.

  8. Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors

    NASA Technical Reports Server (NTRS)

    Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

    1991-01-01

    Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.

  9. Kendall Square multiprocessor: Early experiences and performance

    SciTech Connect

    Dunigan, T.H.

    1992-04-01

    Initial performance results and early experiences are reported for the Kendall Square Research multiprocessor. The basic architecture of the shared-memory multiprocessor is described, and computational and I/O performance is measured for both serial and parallel programs. Experiences in porting various applications are described.

  10. Recoverable distributed shared virtual memory

    NASA Technical Reports Server (NTRS)

    Wu, Kun-Lung; Fuchs, W. Kent

    1990-01-01

    The problem of rollback recovery in distributed shared virtual environments, in which the shared memory is implemented in software in a loosely coupled distributed multicomputer system, is examined. A user-transparent checkpointing recovery scheme and a new twin-page disk storage management technique are presented for implementing recoverable distributed shared virtual memory. The checkpointing scheme can be integrated with the memory coherence protocol for managing the shared virtual memory. The twin-page disk design allows checkpointing to proceed in an incremental fashion without an explicit undo at the time of recovery. The recoverable distributed shared virtual memory allows the system to restart computation from a checkpoint without a global restart.

  11. A single-assignment language in a distributed memory multiprocessor

    NASA Technical Reports Server (NTRS)

    Evripidou, P.; Najjar, W.; Gaudiot, J.-L.

    1989-01-01

    The implementation of the single-assignment programming language SISAL (McGraw et al., 1985) on a Symult 2010 parallel computer is described. The advantages of single-assignment languages over imperative languages in a multiprocessor environment are reviewed; the characteristics of SISAL are summarized; the program-graph generation and dynamic data partitioning procedures are explained; and the application of SISAL in constructing a concurrent iterative multigrid algorithm is discussed in detail and illustrated with diagrams.

  12. Conditional load and store in a shared memory

    DOEpatents

    Blumrich, Matthias A; Ohmacht, Martin

    2015-02-03

    A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.

  13. Combinatorial reliability analysis of multiprocessor computers

    SciTech Connect

    Hwang, K.; Tian-Pong Chang

    1982-12-01

    The authors propose a combinatorial method to evaluate the reliability of multiprocessor computers. Multiprocessor structures are classified as crossbar switch, time-shared buses, and multiport memories. Closed-form reliability expressions are derived via combinatorial path enumeration on the probabilistic-graph representation of a multiprocessor system. The method can analyze the reliability performance of real systems like C.mmp, Tandem 16, and Univac 1100/80. User-oriented performance levels are defined for measuring the performability of degradable multiprocessor systems. For a regularly structured multiprocessor system, it is fast and easy to use this technique for evaluating system reliability with statistically independent component reliabilities. System availability can be also evaluated by this reliability study. 6 references.

  14. Memory access in shared virtual memory

    SciTech Connect

    Berrendorf, R. )

    1992-01-01

    Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.

  15. Memory access in shared virtual memory

    SciTech Connect

    Berrendorf, R.

    1992-09-01

    Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.

  16. Multiprocessor execution of functional programs

    SciTech Connect

    Goldberg, B. )

    1988-10-01

    Functional languages have recently gained attention as vehicles for programming in a concise and element manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This paper describes research that was performed to demonstrate that multiprocessor execution of functional programs on current multiprocessors is feasible, and results in a significant reduction in their execution times. Two implementations of the functional language ALFL were built on commercially available multiprocessors. Alfalfa is an implementation on the Intel iPSC hypercube multiprocessor, and Buckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs and a run-time system that supports their execution. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, called serial combinators, that can be executed in parallel. The abstract machine model supported by Alfalfa and Buckwheat is called heterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and higher order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat runtime systems support dynamic load balancing, interprocessor communication (if required), and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.

  17. Shared Memory Parallelization of an Implicit ADI-type CFD Code

    NASA Technical Reports Server (NTRS)

    Hauser, Th.; Huang, P. G.

    1999-01-01

    A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.

  18. Solution of large nonlinear quasistatic structural mechanics problems on distributed-memory multiprocessor computers

    SciTech Connect

    Blanford, M.

    1997-12-31

    Most commercially-available quasistatic finite element programs assemble element stiffnesses into a global stiffness matrix, then use a direct linear equation solver to obtain nodal displacements. However, for large problems (greater than a few hundred thousand degrees of freedom), the memory size and computation time required for this approach becomes prohibitive. Moreover, direct solution does not lend itself to the parallel processing needed for today`s multiprocessor systems. This talk gives an overview of the iterative solution strategy of JAS3D, the nonlinear large-deformation quasistatic finite element program. Because its architecture is derived from an explicit transient-dynamics code, it does not ever assemble a global stiffness matrix. The author describes the approach he used to implement the solver on multiprocessor computers, and shows examples of problems run on hundreds of processors and more than a million degrees of freedom. Finally, he describes some of the work he is presently doing to address the challenges of iterative convergence for ill-conditioned problems.

  19. A multiprocessor computer simulation model employing a feedback scheduler/allocator for memory space and bandwidth matching and TMR processing

    NASA Technical Reports Server (NTRS)

    Bradley, D. B.; Irwin, J. D.

    1974-01-01

    A computer simulation model for a multiprocessor computer is developed that is useful for studying the problem of matching multiprocessor's memory space, memory bandwidth and numbers and speeds of processors with aggregate job set characteristics. The model assumes an input work load of a set of recurrent jobs. The model includes a feedback scheduler/allocator which attempts to improve system performance through higher memory bandwidth utilization by matching individual job requirements for space and bandwidth with space availability and estimates of bandwidth availability at the times of memory allocation. The simulation model includes provisions for specifying precedence relations among the jobs in a job set, and provisions for specifying precedence execution of TMR (Triple Modular Redundant and SIMPLEX (non redundant) jobs.

  20. The performance of disk arrays in shared-memory database machines

    NASA Technical Reports Server (NTRS)

    Katz, Randy H.; Hong, Wei

    1993-01-01

    In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metric data temperature as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.

  1. A simple modern correctness condition for a space-based high-performance multiprocessor

    NASA Technical Reports Server (NTRS)

    Probst, David K.; Li, Hon F.

    1992-01-01

    A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.

  2. Scheduling and process migration in partitioned multiprocessors

    SciTech Connect

    Gait, J. )

    1990-03-01

    A partitioned multiprocessor (PM) has a shared global bus and nonshared local memories. This paper studies a process scheduler, called the two-tier scheduler (TTS), for a PM. In a PM local scheduling amortizes the cost of loading processes in local memory. Global scheduling migrates processes to balance load. A tunable time quantum is adjusted so the average process completes execution on the processor on which it is first scheduled, and only relatively long lived processes are rescheduled globally.

  3. Firefly: A multiprocessor workstation

    SciTech Connect

    Thacker, C.P.; Stewart, L.C.; Satterthwaite, E.H.

    1988-08-01

    Firefly is a shared memory multiprocessor workstation developed at the Digital Equipment Corporation Systems Research Center (SRC). A Firefly system consists of from one to nine VLSI VAX processors, each with a floating point accelerator and a cache. The caches are coherent, so that all processors see a consistent view of main memory. The Firefly runs a software system that emulates the Ultrix system call interface, and in addition provides support for multiprocessing through multiple threads of control in a single address space. Communication is provided uniformly through the use of remote procedure call. The authors describe the goals, hardware, software system, and performance of the Firefly, and discuss the extent to which SRC has been successful in providing software to take advantage of multi-processing.

  4. Performance modeling and measurement of real-time multiprocessors with time-shared buses

    SciTech Connect

    Woodbury, M.H.; Shin, K.G.

    1988-02-01

    A closed queueing network model is constructed to address workload effects on computer performance for a highly reliable unibus multiprocessor used in real-time control. The queueing model consists of multiserver nodes and a nonpreemptive priority queue. Use of this model requires partitioning the workload into task classes. The time average steady-state solution of the queuing model directly produces useful results that are necessary in performance evaluation. The model is experimentally justified with the Fault-Tolerant Multiprocessor (FTMP) located at the NASA AIRLAB. Extensive experiments are performed on FTMP with a synthetic workload generator (SWG) to directly measure performance parameters, such as processor idle time, system bus contention, and task processing times. These measurements determine values for parameters in the queueing model. Experimental and analytic results are then compared.

  5. Performing an allreduce operation using shared memory

    DOEpatents

    Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E

    2014-06-10

    Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.

  6. Performing an allreduce operation using shared memory

    DOEpatents

    Archer, Charles J.; Dozsa, Gabor; Ratterman, Joseph D.; Smith, Brian E.

    2012-04-17

    Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.

  7. Dynamically controlling false sharing in distributed shared memory

    SciTech Connect

    Freeh, V.W.; Andrews, G.R.

    1996-12-31

    Distributed shared memory (DSM) alleviates the need to program message passing explicitly on a distributed-memory machine. In order to reduce memory latency, a DSM replicates copies of data. This paper examines several current approaches to controlling thrashing caused by false sharing in a DSM. Then it introduces a novel memory consistency protocol, writer-owns, which detects and eliminates false sharing at run time. In iterative computations, where the data is accessed similarly every iteration, the writer-owns protocol can have tremendous benefits because the overhead of eliminating false sharing is only incurred once. Performance results show that the writer-owns protocol is competitive with and often better than existing approaches.

  8. Recoverable distributed shared virtual memory - Memory coherence and storage structures

    NASA Technical Reports Server (NTRS)

    Wu, Kun-Lung; Fuchs, W. Kent

    1989-01-01

    This paper examines the problem of implementing rollback recovery in multicomputer distributed shared virtual memory environments, in which the shared memory is implemented in software and exists only virtually. A user-transparent checkpointing recovery scheme and new twin-page disk storage management are presented to implement a recoverable distributed shared virtual memory. The checkpointing scheme is integrated with the shared virtual memory management. The twin-page disk approach allows incremental checkpointing without an explicit undo at the time of recovery. A single consistent checkpoint state is maintained on stable disk storage. The recoverable distributed shared virtual memory allows the system to restart computation from a previous checkpoint due to a processor failure without a global restart.

  9. Multiprocessor execution of functional programs

    SciTech Connect

    Goldberg, B.F.

    1988-01-01

    Functional languages have recently gained attention as vehicles for programming in a concise and elegant manner. In addition, it has been suggested that functional programming provides a natural methodology for programming multiprocessor computers. This dissertation demonstrates that multiprocessor execution of functional programs is feasible, and results in a significant reduction in their execution times. Two implementations of the functional language ALFL were built on commercially available multiprocessors. ALFL is an implementation on the Intel iPSC hypercube multiprocessor, and Buckwheat is an implementation on the Encore Multimax shared-memory multiprocessor. Each implementation includes a compiler that performs automatic decomposition of ALFL programs. The compiler is responsible for detecting the inherent parallelism in a program, and decomposing the program into a collection of tasks, called serial combinators, that can be executed in parallel. One of the primary goals of the compiler is to generate serial combinators exhibiting the coarsest granularity possibly without sacrificing useful parallelism. This dissertation describes the algorithms used by the compiler to analyze, decompose, and optimize functional programs. The abstract machine model supported by Alfalfa and Buckwheat is called heterogeneous graph reduction, which is a hybrid of graph reduction and conventional stack-oriented execution. This model supports parallelism, lazy evaluation, and higher order functions while at the same time making efficient use of the processors in the system. The Alfalfa and Buckwheat run-time systems support dynamic load balancing, interprocessor communication (if required) and storage management. A large number of experiments were performed on Alfalfa and Buckwheat for a variety of programs. The results of these experiments, as well as the conclusions drawn from them, are presented.

  10. Direct access inter-process shared memory

    DOEpatents

    Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B

    2013-10-22

    A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.

  11. Multiprocessor architectural study

    NASA Technical Reports Server (NTRS)

    Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.

    1972-01-01

    An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.

  12. Debugging in a multi-processor environment

    SciTech Connect

    Spann, J.M.

    1981-09-29

    The Supervisory Control and Diagnostic System (SCDS) for the Mirror Fusion Test Facility (MFTF) consists of nine 32-bit minicomputers arranged in a tightly coupled distributed computer system utilizing a share memory as the data exchange medium. Debugging of more than one program in the multi-processor environment is a difficult process. This paper describes what new tools were developed and how the testing of software is performed in the SCDS for the MFTF project.

  13. Comparative Study of Message Passing and Shared Memory Parallel Programming Models in Neural Network Training

    SciTech Connect

    Vitela, J.; Gordillo, J.; Cortina, L; Hanebutte, U.

    1999-12-14

    It is presented a comparative performance study of a coarse grained parallel neural network training code, implemented in both OpenMP and MPI, standards for shared memory and message passing parallel programming environments, respectively. In addition, these versions of the parallel training code are compared to an implementation utilizing SHMEM the native SGI/CRAY environment for shared memory programming. The multiprocessor platform used is a SGI/Cray Origin 2000 with up to 32 processors. It is shown that in this study, the native CRAY environment outperforms MPI for the entire range of processors used, while OpenMP shows better performance than the other two environments when using more than 19 processors. In this study, the efficiency is always greater than 60% regardless of the parallel programming environment used as well as of the number of processors.

  14. A shared memory environment for hypercubes

    SciTech Connect

    Agarwala, A.; Das, C.R.

    1994-12-31

    This paper describes the design and implementation of a shared virtual memory (SVM) system for the nCUBE 2 hypercube multicomputer. The SVM system provides the user a single coherent address space across all nodes. It is implemented at the user level in a C programming environment using high level constructs to support data sharing. Shared variables are treated as objects rather than pages. We have improved upon an existing algorithm for maintaining coherency in the SVM system, thus achieving a reduction in the number of inter-node messages required in coherency maintenance. Detailed timing analysis is conducted to analyze the feasibility of this shared environment. Experimental results indicate the parallel programs running under an SVM system show linear speedup, suggesting that SVM systems could provide an effective programming environment for the next generation of distributed memory parallel computers. A bottleneck of this implementation seems to be the expensive interrupt handling by the nCUBE 2 kernel.

  15. Rollback-recovery techniques and architectural support for multiprocessor systems

    SciTech Connect

    Chiang Chungyang.

    1991-01-01

    The author proposes efficient and robust fault diagnosis and rollback-recovery techniques to enhance system availability as well as performance in both distributed-memory and shared-bus shared-memory multiprocessor systems. Architectural support for the proposed rollback-recovery technique in a bus-based shared-memory multiprocessor system is also investigated to adaptively fine tune the proposed rollback-recovery technique in this type of system. A comparison of the performance of the proposed techniques with other existing techniques is made, a topic on which little quantitative information is available in the literature. New diagnosis concepts are introduced to show that the author's diagnosis technique yields higher diagnosis coverage and facilitates the performance evaluation of various fault-diagnosis techniques.

  16. Parallel Navier-Stokes computations on shared and distributed memory architectures

    NASA Technical Reports Server (NTRS)

    Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar

    1995-01-01

    We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.

  17. Exploring Shared Memory Protocols in FLASH

    SciTech Connect

    Horowitz, Mark; Kunz, Robert; Hall, Mary; Lucas, Robert; Chame, Jacqueline

    2007-04-01

    ABSTRACT The goal of this project was to improve the performance of large scientific and engineering applications through collaborative hardware and software mechanisms to manage the memory hierarchy of non-uniform memory access time (NUMA) shared-memory machines, as well as their component individual processors. In spite of the programming advantages of shared-memory platforms, obtaining good performance for large scientific and engineering applications on such machines can be challenging. Because communication between processors is managed implicitly by the hardware, rather than expressed by the programmer, application performance may suffer from unintended communication – communication that the programmer did not consider when developing his/her application. In this project, we developed and evaluated a collection of hardware, compiler, languages and performance monitoring tools to obtain high performance on scientific and engineering applications on NUMA platforms by managing communication through alternative coherence mechanisms. Alternative coherence mechanisms have often been discussed as a means for reducing unintended communication, although architecture implementations of such mechanisms are quite rare. This report describes an actual implementation of a set of coherence protocols that support coherent, non-coherent and write-update accesses for a CC-NUMA shared-memory architecture, the Stanford FLASH machine. Such an approach has the advantages of using alternative coherence only where it is beneficial, and also provides an evolutionary migration path for improving application performance. We present data on two computations, RandomAccess from the HPC Challenge benchmarks and a forward solver derived from LS-DYNA, showing the performance advantages of the alternative coherence mechanisms. For RandomAccess, the non-coherent and write-update versions can outperform the coherent version by factors of 5 and 2.5, respectively. In LS-DYNA, we obtain

  18. Checkpointing Shared Memory Programs at the Application-level

    SciTech Connect

    Bronevetsky, G; Schulz, M; Szwed, P; Marques, D; Pingali, K

    2004-09-08

    Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focused on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.

  19. C-MOS array design techniques: SUMC multiprocessor system study

    NASA Technical Reports Server (NTRS)

    Clapp, W. A.; Helbig, W. A.; Merriam, A. S.

    1972-01-01

    The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.

  20. Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

    NASA Technical Reports Server (NTRS)

    Smith, T. B., Jr.; Lala, J. H.

    1983-01-01

    The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.

  1. Distributed shared memory for roaming large volumes.

    PubMed

    Castanié, Laurent; Mion, Christophe; Cavin, Xavier; Lévy, Bruno

    2006-01-01

    We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming. PMID:17080865

  2. Shared virtual memory and generalized speedup

    NASA Technical Reports Server (NTRS)

    Sun, Xian-He; Zhu, Jianping

    1994-01-01

    Generalized speedup is defined as parallel speed over sequential speed. The generalized speedup and its relation with other existing performance metrics, such as traditional speedup, efficiency, scalability, etc., are carefully studied. In terms of the introduced asymptotic speed, it was shown that the difference between the generalized speedup and the traditional speedup lies in the definition of the efficiency of uniprocessor processing, which is a very important issue in shared virtual memory machines. A scientific application was implemented on a KSR-1 parallel computer. Experimental and theoretical results show that the generalized speedup is distinct from the traditional speedup and provides a more reasonable measurement. In the study of different speedups, various causes of superlinear speedup are also presented.

  3. Final Report: Programming Models for Shared Memory Clusters

    SciTech Connect

    May, J.; de Supinski, B.; Pudliner, B.; Taylor, S.; Baden, S.

    2000-01-04

    Most large parallel computers now built use a hybrid architecture called a shared memory cluster. In this design, a computer consists of several nodes connected by an interconnection network. Each node contains a pool of memory and multiple processors that share direct access to it. Because shared memory clusters combine architectural features of shared memory computers and distributed memory computers, they support several different styles of parallel programming or programming models. (Further information on the design of these systems and their programming models appears in Section 2.) The purpose of this project was to investigate the programming models available on these systems and to answer three questions: (1) How easy to use are the different programming models in real applications? (2) How do the hardware and system software on different computers affect the performance of these programming models? (3) What are the performance characteristics of different programming models for typical LLNL applications on various shared memory clusters?

  4. Reducing communication costs in the conjugate gradient algorithm on distributed memory multiprocessors

    SciTech Connect

    D'Azevedo, E.F.; Romine, C.H.

    1992-09-01

    The standard formulation of the conjugate gradient algorithm involves two inner product computations. The results of these two inner products are needed to update the search direction and the computed solution. In a distributed memory parallel environment, the computation and subsequent distribution of these two values requires two separate communication and synchronization phases. In this paper, we present a mathematically equivalent rearrangement of the standard algorithm that reduces the number of communication phases. We give a second derivation of the modified conjugate gradient algorithm in terms of the natural relationship with the underlying Lanczos process. We also present empirical evidence of the stability of this modified algorithm.

  5. Impact of Load Balancing on Unstructured Adaptive Grid Computations for Distributed-Memory Multiprocessors

    NASA Technical Reports Server (NTRS)

    Sohn, Andrew; Biswas, Rupak; Simon, Horst D.

    1996-01-01

    The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.

  6. A parallel numerical simulation for supersonic flows using zonal overlapped grids and local time steps for common and distributed memory multiprocessors

    SciTech Connect

    Patel, N.R.; Sturek, W.B.; Hiromoto, R.

    1989-01-01

    Parallel Navier-Stokes codes are developed to solve both two- dimensional and three-dimensional flow fields in and around ramjet and nose tip configurations. A multi-zone overlapped grid technique is used to extend an explicit finite-difference method to more complicated geometries. Parallel implementations are developed for execution on both distributed and common-memory multiprocessor architectures. For the steady-state solutions, the use of the local time-step method has the inherent advantage of reducing the communications overhead commonly incurred by parallel implementations. Computational results of the codes are given for a series of test problems. The parallel partitioning of computational zones is also discussed. 5 refs., 18 figs.

  7. Solution of the Euler and Navier-Stokes equations on MIMD distributed memory multiprocessors using cyclic reduction

    SciTech Connect

    Curchitser, E.N.; Pelz, R.B.; Marconi, F. Grumman Aerospace Corp., Bethpage, NY )

    1992-01-01

    The Euler and Navier-Stokes equations are solved for the steady, two-dimensional flow over a NACA 0012 airfoil using a 1024 node nCUBE/2 multiprocessor. Second-order, upwind-discretized difference equations are solved implicitly using ADI factorization. Parallel cyclic reduction is employed to solve the block tridiagonal systems. For realistic problems, communication times are negligible compared to calculation times. The processors are tightly synchronized, and their loads are well balanced. When the flux Jacobians flux are frozen, the wall-clock time for one implicit timestep is about equal to that of a multistage explicit scheme. 10 refs.

  8. The hierarchical spatial decomposition of three-dimensional particle- in-cell plasma simulations on MIMD distributed memory multiprocessors

    SciTech Connect

    Walker, D.W.

    1992-07-01

    The hierarchical spatial decomposition method is a promising approach to decomposing the particles and computational grid in parallel particle-in-cell application codes, since it is able to maintain approximate dynamic load balance while keeping communication costs low. In this paper we investigate issues in implementing a hierarchical spatial decomposition on a hypercube multiprocessor. Particular attention is focused on the communication needed to update guard ring data, and on the load balancing method. The hierarchical approach is compared with other dynamic load balancing schemes.

  9. Supporting shared data structures on distributed memory architectures

    NASA Technical Reports Server (NTRS)

    Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

    1990-01-01

    Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.

  10. A comparison of distributed memory and virtual shared memory parallel programming models

    SciTech Connect

    Keane, J.A.; Grant, A.J.; Xu, M.Q.

    1993-04-01

    The virtues of the different parallel programming models, shared memory and distributed memory, have been much debated. Conventionally the debate could be reduced to programming convenience on the one hand, and high salability factors on the other. More recently the debate has become somewhat blurred with the provision of virtual shared memory models built on machines with physically distributed memory. The intention of such models/machines is to provide scalable shared memory, i.e. to provide both programmer convenience and high salability. In this paper, the different models are considered from experiences gained with a number of system ranging from applications in both commerce and science to languages and operating systems. Case studies are introduced as appropriate.

  11. Parallel implementation and evaluation of motion estimation system algorithms on a distributed memory multiprocessor using knowledge based mappings

    NASA Technical Reports Server (NTRS)

    Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.

    1989-01-01

    Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.

  12. Efficient multiprocessor architecture for digital signal processing

    SciTech Connect

    Auguin, M.; Boeri, F.

    1982-01-01

    There is a continuing pressure of better processing performances in numerical signal processing. Effective utilization of LSI semiconductor technology allows the consideration of multiprocessor architectures. The problem of interconnecting the components of the architecture arises. The authors describe a control algorithm of the Benes interconnection network in a asynchronous multiprocessor system. A simulation study of the time-shared bus, of the omega network, of the benes network and of the crossbar network gives a comparison of performances. 8 references.

  13. Distributed simulation using a real-time shared memory network

    NASA Technical Reports Server (NTRS)

    Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.

    1993-01-01

    The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.

  14. Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

    NASA Technical Reports Server (NTRS)

    Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.

    2003-01-01

    Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.

  15. Analysis of fork-join program response times on multiprocessors

    SciTech Connect

    Towsley, D.; Stankovic, J.A. . Dept. of Computer and Information Science); Rommel, C.G. )

    1990-07-01

    In this paper, the authors develop analytic models for a shared memory multiprocessor that executes fork-join parallel programs. Here a fork-join program is one that consists of a set of n {ge} 1 parallel tasks. All of the tasks of a program arrive simultaneously to the system and the job is assumed to complete when the last task completes. They develop and analyze models for two processor sharing policies, called task scheduling processor sharing and job scheduling processor sharing. The first policy schedules tasks independently of each other and allows parallel execution of an individual program, whereas the second policy schedules each job as a unit and thereby does not allow parallel execution of an individual program.

  16. Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)

    2002-01-01

    The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.

  17. Parallel processing and medium-scale multiprocessors

    SciTech Connect

    Wouk, A.

    1989-01-01

    For some time, the community interested in large-scale scientific computing has been attempting to come to terms with parallel computation using a number of processors sufficient to make their concurrent utilization interesting, challenging, and, in the long run, beneficial. Unexpected consequences of parallelization have been discovered. It is possible to obtain reduced performance, both relative and absolute, from an increased number of processors, as a result of inappropriate use of resources in a multiprocessor environment. This exemplifies one of the paradoxes which result from our cultural bias towards sequential thought processes. As a consequence there is a bias for sequential styles of program development in a multiprocessor environment. The authors have learned that the problem of automatic optimization in compilation of parallel programs is computationally hard. Early hopes that automatic, optimal parallelization of sequentially conceived programs would be as achievable as earlier automatic vectorization had been, have been dashed. The authors lack the insights and folklore which are needed to develop useful methodologies and heuristics in the area of parallel computation. The authors are embarked on a voyage of exploration of this new territory, and the work described in this volume can provide helpful guidance. The authors have to explore fully the differences between distributed memory systems, shared memory systems, and combinations, as well as the relative applicability of SIMD and MIMD architectures. Based on the information obtained in such exploration, useful steps towards efficient utilization of many processors should become possible. This paper covers several areas: systems programming, parallel/language/programming systems, and applications programming.

  18. Analysis and comparison of cache coherence protocols for a packet-switched multiprocessor

    SciTech Connect

    Yang, Q.; Bhuyan, L.N.; Liu, B.C.

    1989-08-01

    The use of private caches in a multiprocessor system causes inconsistency of the shared data among the caches and among caches and the main memory. A large number of protocols have been proposed to solve this coherence problem. In this paper, the authors develop analytical models for seven existing cache protocols, namely: Write-Once, Write-Through, Synapse, Berkeley, Illinois, Firefly, and Dragon. The protocols are implemented on a multiprocessor with a packet-switched shared bus. The models are based on queueing networks that consist of both open and closed classes of customers. The models incorporate the requests for invalidation signals, write-through, and write-back operations and the solution is based on the mean value analysis (MVA) algorithm. Performance comparison among these protocols under various system parameters is carried out based on our models.

  19. Multi-processor including data flow accelerator module

    DOEpatents

    Davidson, George S.; Pierce, Paul E.

    1990-01-01

    An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.

  20. Rollback Hardware For Time Warp Multiprocessor Systems

    NASA Technical Reports Server (NTRS)

    Robb, Michael J.; Buzzell, Calvin A.

    1996-01-01

    Rollback Chip (RBC) module is computer circuit board containing special-purpose memory circuits for use in multiprocessor computer system. Designed to help realize speedup potential of parallel processing for simulation of discrete events by use of Time Warp operating system.

  1. Visual Tutoring System for Programming Multiprocessor Networks.

    ERIC Educational Resources Information Center

    Trichina, Elena

    1996-01-01

    Describes a visual tutoring system for programming distributive-memory multiprocessor networks. Highlights include difficulties of parallel programming, and three instructional modes in the system, including a hypertext-like lecture, a question-answer mode, and an expert aid mode. (Author/LRW)

  2. Neural networks and MIMD-multiprocessors

    NASA Technical Reports Server (NTRS)

    Vanhala, Jukka; Kaski, Kimmo

    1990-01-01

    Two artificial neural network models are compared. They are the Hopfield Neural Network Model and the Sparse Distributed Memory model. Distributed algorithms for both of them are designed and implemented. The run time characteristics of the algorithms are analyzed theoretically and tested in practice. The storage capacities of the networks are compared. Implementations are done using a distributed multiprocessor system.

  3. Experimenting With Multiprocessor Simulator Concepts

    NASA Technical Reports Server (NTRS)

    Blech, Richard A.; Williams, Anthony D.

    1989-01-01

    Multiple microcomputer system used to investigate application of parallel processing to real-time simulation. With dual-base architecture, each microcomputer communicates with corresponding microcomputer on opposite bus through dual-port interface memory. Transfers of data to and from front-end processor occur on interactive information bus. Transfers of data related to simulation calculations occur on real-time-information bus. System, called the real-time multiprocessor simulator (RTMPS), is tool for developing low-cost, portable, user-friendly simulators.

  4. A garbage collection algorithm for shared memory parallel processors

    SciTech Connect

    Crammond, J. )

    1988-12-01

    This paper describes a technique for adapting the Morris sliding garbage collection algorithm to execute on parallel machines with shared memory. The algorithm is described within the framework of an implementation of the parallel logic language Parlog. However, the algorithm is a general one and can easily be adapted to parallel Prolog systems and to other languages. The performance of the algorithm executing a few simple Parlog benchmarks is analyzed. Finally, it is shown how the technique for parallelizing the sequential algorithm can be adapted for a semi-space copying algorithm.

  5. Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures

    NASA Technical Reports Server (NTRS)

    Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)

    2002-01-01

    In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory

  6. High Performance, Dependable Multiprocessor

    NASA Technical Reports Server (NTRS)

    Ramos, Jeremy; Samson, John R.; Troxel, Ian; Subramaniyan, Rajagopal; Jacobs, Adam; Greco, James; Cieslewski, Grzegorz; Curreri, John; Fischer, Michael; Grobelny, Eric; George, Alan; Aggarwal, Vikas; Patel, Minesh; Some, Raphael

    2006-01-01

    With the ever increasing demand for higher bandwidth and processing capacity of today's space exploration, space science, and defense missions, the ability to efficiently apply commercial-off-the-shelf (COTS) processors for on-board computing is now a critical need. In response to this need, NASA's New Millennium Program office has commissioned the development of Dependable Multiprocessor (DM) technology for use in payload and robotic missions. The Dependable Multiprocessor technology is a COTS-based, power efficient, high performance, highly dependable, fault tolerant cluster computer. To date, Honeywell has successfully demonstrated a TRL4 prototype of the Dependable Multiprocessor [I], and is now working on the development of a TRLS prototype. For the present effort Honeywell has teamed up with the University of Florida's High-performance Computing and Simulation (HCS) Lab, and together the team has demonstrated major elements of the Dependable Multiprocessor TRLS system.

  7. Multiprocessors and runtime compilation

    NASA Technical Reports Server (NTRS)

    Saltz, Joel; Berryman, Harry; Wu, Janet

    1990-01-01

    Runtime preprocessing plays a major role in many efficient algorithms in computer science, as well as playing an important role in exploiting multiprocessor architectures. Examples are given that elucidate the importance of runtime preprocessing and show how these optimizations can be integrated into compilers. To support the arguments, transformations implemented in prototype multiprocessor compilers are described and benchmarks from the iPSC2/860, the CM-2, and the Encore Multimax/320 are presented.

  8. Ensuring correct rollback recovery in distributed shared memory systems

    NASA Technical Reports Server (NTRS)

    Janssens, Bob; Fuchs, W. Kent

    1995-01-01

    Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.

  9. Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory

    NASA Technical Reports Server (NTRS)

    Janssens, Bob; Fuchs, W. Kent

    1994-01-01

    Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.

  10. Parallel discrete event simulation: A shared memory approach

    NASA Technical Reports Server (NTRS)

    Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

    1987-01-01

    With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.

  11. A Parallel Saturation Algorithm on Shared Memory Architectures

    NASA Technical Reports Server (NTRS)

    Ezekiel, Jonathan; Siminiceanu

    2007-01-01

    Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.

  12. Translation techniques for distributed-shared memory programming models

    SciTech Connect

    Fuller, Douglas James

    2005-08-01

    The high performance computing community has experienced an explosive improvement in distributed-shared memory hardware. Driven by increasing real-world problem complexity, this explosion has ushered in vast numbers of new systems. Each new system presents new challenges to programmers and application developers. Part of the challenge is adapting to new architectures with new performance characteristics. Different vendors release systems with widely varying architectures that perform differently in different situations. Furthermore, since vendors need only provide a single performance number (total MFLOPS, typically for a single benchmark), they only have strong incentive initially to optimize the API of their choice. Consequently, only a fraction of the available APIs are well optimized on most systems. This causes issues porting and writing maintainable software, let alone issues for programmers burdened with mastering each new API as it is released. Also, programmers wishing to use a certain machine must choose their API based on the underlying hardware instead of the application. This thesis argues that a flexible, extensible translator for distributed-shared memory APIs can help address some of these issues. For example, a translator might take as input code in one API and output an equivalent program in another. Such a translator could provide instant porting for applications to new systems that do not support the application's library or language natively. While open-source APIs are abundant, they do not perform optimally everywhere. A translator would also allow performance testing using a single base code translated to a number of different APIs. Most significantly, this type of translator frees programmers to select the most appropriate API for a given application based on the application (and developer) itself instead of the underlying hardware.

  13. A Comparison of Shared Memory Parallel Programming Models

    SciTech Connect

    Mogill, Jace A; Haglin, David J

    2010-05-24

    The dominant parallel programming models for shared memory computers, Pthreads and OpenMP, are both thread-centric in that they are based on explicit management of tasks and enforce data dependencies and output ordering through task management. By comparison, the Cray XMT programming model is data-centric where the primary concern of the programmer is managing data dependencies, allowing threads to progress in a data flow fashion. The XMT implements this programming model by associating tag bits with each word of memory, affording efficient fine grained data synchronization independent of the number of processors or how tasks are scheduled. When task management is implicit and synchronization is abundant, efficient, and easy to use, programmers have viable alternatives to traditional thread-centric algorithms. In this paper we compare the amount of available parallelism relative to the amount of work in a variety of different algorithms and data structures when synchronization does not need to be rationed, as well as identify opportunities for platform and performance portability of the data-centric programming model on multi-core processors.

  14. Coupled cluster algorithms for networks of shared memory parallel processors

    NASA Astrophysics Data System (ADS)

    Bentz, Jonathan L.; Olson, Ryan M.; Gordon, Mark S.; Schmidt, Michael W.; Kendall, Ricky A.

    2007-05-01

    As the popularity of using SMP systems as the building blocks for high performance supercomputers increases, so too increases the need for applications that can utilize the multiple levels of parallelism available in clusters of SMPs. This paper presents a dual-layer distributed algorithm, using both shared-memory and distributed-memory techniques to parallelize a very important algorithm (often called the "gold standard") used in computational chemistry, the single and double excitation coupled cluster method with perturbative triples, i.e. CCSD(T). The algorithm is presented within the framework of the GAMESS [M.W. Schmidt, K.K. Baldridge, J.A. Boatz, S.T. Elbert, M.S. Gordon, J.J. Jensen, S. Koseki, N. Matsunaga, K.A. Nguyen, S. Su, T.L. Windus, M. Dupuis, J.A. Montgomery, General atomic and molecular electronic structure system, J. Comput. Chem. 14 (1993) 1347-1363]. (General Atomic and Molecular Electronic Structure System) program suite and the Distributed Data Interface [M.W. Schmidt, G.D. Fletcher, B.M. Bode, M.S. Gordon, The distributed data interface in GAMESS, Comput. Phys. Comm. 128 (2000) 190]. (DDI), however, the essential features of the algorithm (data distribution, load-balancing and communication overhead) can be applied to more general computational problems. Timing and performance data for our dual-level algorithm is presented on several large-scale clusters of SMPs.

  15. Efficient partitioning and assignment on programs for multiprocessor execution

    NASA Technical Reports Server (NTRS)

    Standley, Hilda M.

    1993-01-01

    The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.

  16. Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

    SciTech Connect

    Jin, Shuangshuang; Chen, Yousu; Wu, Di; Diao, Ruisheng; Huang, Zhenyu

    2015-12-09

    Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Message Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.

  17. Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver

    NASA Astrophysics Data System (ADS)

    Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre

    2014-06-01

    This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.

  18. Low latency memory access and synchronization

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Ohmacht, Martin; Steinmacher-Burow, Burkhard D.; Takken, Todd E.; Vranas, Pavlos M.

    2007-02-06

    A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.

  19. Low latency memory access and synchronization

    DOEpatents

    Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Hoenicke, Dirk; Ohmacht, Martin; Steinmacher-Burow, Burkhard D.; Takken, Todd E. , Vranas; Pavlos M.

    2010-10-19

    A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.

  20. Compiled execution of the reduce-or process model on multiprocessors

    SciTech Connect

    Ramkumar, B.; Kale, L.V. )

    1989-01-01

    In this paper, the authors describe the abstract machine developed for the reduce - or process model (ROPM) and its implementation on a variety of multiprocessors. In keeping with the objective behind the ROPM, the abstract machine is suitable for execution on both shared and nonshared memory machines. It uses structure sharing unlike most of the abstract machines based on the WAM. This is due to significant benefits in a nonshared memory context. It has currently been implemented on the Encore Multimax, the Sequent Symmetry, the Alliant, and the InteliPSC/2. The authors provide preliminary performance data of our implementations on these machines in this paper. The benchmarks chosen illustrate the range of programs which ROPM can parallelize - and, or, as well as and/or parallel programs are effectively parallelized and speeded up.

  1. Multiprocessor programming environment

    SciTech Connect

    Smith, M.B.; Fornaro, R.

    1988-12-01

    Programming tools and techniques have been well developed for traditional uniprocessor computer systems. The focus of this research project is on the development of a programming environment for a high speed real time heterogeneous multiprocessor system, with special emphasis on languages and compilers. The new tools and techniques will allow a smooth transition for programmers with experience only on single processor systems.

  2. MPI Support for Multi-core Architectures: Optimized Shared Memory Collectives

    SciTech Connect

    Graham, Richard L; Shipman, Galen M

    2008-01-01

    With local core counts on the rise, taking advantage of shared memory to optimize collective operations can improve performance. We study several on-host shared memory optimized algorithms for MPI Bcast, MPI Reduce, and MPI Allreduce, using tree-based, and reduce-scatter algorithms. For small data operations with relatively large synchronization costs fan-in/fan-out algorithms generally perform best. For large messages data manipulation constitute the largest cost and reduce-scatter algorithms are best for reductions. These optimization improve performance by up to a factor of three. Memory and cache sharing effect require deliberate process layout and careful radix selection for tree-based methods

  3. Distributed parallel messaging for multiprocessor systems

    DOEpatents

    Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka

    2013-06-04

    A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.

  4. Solutions and debugging for data consistency in multiprocessors with noncoherent caches

    SciTech Connect

    Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.; Gheith, A.M.

    1995-02-01

    We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on software methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.

  5. Considerations for Multiprocessor Topologies

    NASA Technical Reports Server (NTRS)

    Byrd, Gregory T.; Delagi, Bruce A.

    1987-01-01

    Choosing a multiprocessor interconnection topology may depend on high-level considerations, such as the intended application domain and the expected number of processors. It certainly depends on low-level implementation details, such as packaging and communications protocols. The authors first use rough measures of cost and performance to characterize several topologies. They then examine how implementation details can affect the realizable performance of a topology.

  6. Working Memory Span Development: A Time-Based Resource-Sharing Model Account

    ERIC Educational Resources Information Center

    Barrouillet, Pierre; Gavens, Nathalie; Vergauwe, Evie; Gaillard, Vinciane; Camos, Valerie

    2009-01-01

    The time-based resource-sharing model (P. Barrouillet, S. Bernardin, & V. Camos, 2004) assumes that during complex working memory span tasks, attention is frequently and surreptitiously switched from processing to reactivate decaying memory traces before their complete loss. Three experiments involving children from 5 to 14 years of age…

  7. Implementation of a parallel unstructured Euler solver on shared and distributed memory architectures

    NASA Technical Reports Server (NTRS)

    Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.

    1992-01-01

    An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.

  8. A shared neural ensemble links distinct contextual memories encoded close in time.

    PubMed

    Cai, Denise J; Aharoni, Daniel; Shuman, Tristan; Shobe, Justin; Biane, Jeremy; Song, Weilin; Wei, Brandon; Veshkini, Michael; La-Vu, Mimi; Lou, Jerry; Flores, Sergio E; Kim, Isaac; Sano, Yoshitake; Zhou, Miou; Baumgaertel, Karsten; Lavi, Ayal; Kamata, Masakazu; Tuszynski, Mark; Mayford, Mark; Golshani, Peyman; Silva, Alcino J

    2016-06-01

    Recent studies suggest that a shared neural ensemble may link distinct memories encoded close in time. According to the memory allocation hypothesis, learning triggers a temporary increase in neuronal excitability that biases the representation of a subsequent memory to the neuronal ensemble encoding the first memory, such that recall of one memory increases the likelihood of recalling the other memory. Here we show in mice that the overlap between the hippocampal CA1 ensembles activated by two distinct contexts acquired within a day is higher than when they are separated by a week. Several findings indicate that this overlap of neuronal ensembles links two contextual memories. First, fear paired with one context is transferred to a neutral context when the two contexts are acquired within a day but not across a week. Second, the first memory strengthens the second memory within a day but not across a week. Older mice, known to have lower CA1 excitability, do not show the overlap between ensembles, the transfer of fear between contexts, or the strengthening of the second memory. Finally, in aged mice, increasing cellular excitability and activating a common ensemble of CA1 neurons during two distinct context exposures rescued the deficit in linking memories. Taken together, these findings demonstrate that contextual memories encoded close in time are linked by directing storage into overlapping ensembles. Alteration of these processes by ageing could affect the temporal structure of memories, thus impairing efficient recall of related information. PMID:27251287

  9. A shared neural ensemble links distinct contextual memories encoded close in time

    NASA Astrophysics Data System (ADS)

    Cai, Denise J.; Aharoni, Daniel; Shuman, Tristan; Shobe, Justin; Biane, Jeremy; Song, Weilin; Wei, Brandon; Veshkini, Michael; La-Vu, Mimi; Lou, Jerry; Flores, Sergio E.; Kim, Isaac; Sano, Yoshitake; Zhou, Miou; Baumgaertel, Karsten; Lavi, Ayal; Kamata, Masakazu; Tuszynski, Mark; Mayford, Mark; Golshani, Peyman; Silva, Alcino J.

    2016-06-01

    Recent studies suggest that a shared neural ensemble may link distinct memories encoded close in time. According to the memory allocation hypothesis, learning triggers a temporary increase in neuronal excitability that biases the representation of a subsequent memory to the neuronal ensemble encoding the first memory, such that recall of one memory increases the likelihood of recalling the other memory. Here we show in mice that the overlap between the hippocampal CA1 ensembles activated by two distinct contexts acquired within a day is higher than when they are separated by a week. Several findings indicate that this overlap of neuronal ensembles links two contextual memories. First, fear paired with one context is transferred to a neutral context when the two contexts are acquired within a day but not across a week. Second, the first memory strengthens the second memory within a day but not across a week. Older mice, known to have lower CA1 excitability, do not show the overlap between ensembles, the transfer of fear between contexts, or the strengthening of the second memory. Finally, in aged mice, increasing cellular excitability and activating a common ensemble of CA1 neurons during two distinct context exposures rescued the deficit in linking memories. Taken together, these findings demonstrate that contextual memories encoded close in time are linked by directing storage into overlapping ensembles. Alteration of these processes by ageing could affect the temporal structure of memories, thus impairing efficient recall of related information.

  10. A system for simulating shared memory in heterogeneous distributed-memory networks with specialization for robotics applications

    SciTech Connect

    Jones, J.P.; Bangs, A.L.; Butler, P.L.

    1991-01-01

    Hetero Helix is a programming environment which simulates shared memory on a heterogeneous network of distributed-memory computers. The machines in the network may vary with respect to their native operating systems and internal representation of numbers. Hetero Helix presents a simple programming model to developers, and also considers the needs of designers, system integrators, and maintainers. The key software technology underlying Hetero Helix is the use of a compiler'' which analyzes the data structures in shared memory and automatically generates code which translates data representations from the format native to each machine into a common format, and vice versa. The design of Hetero Helix was motivated in particular by the requirements of robotics applications. Hetero Helix has been used successfully in an integration effort involving 27 CPUs in a heterogeneous network and a body of software totaling roughly 100,00 lines of code. 25 refs., 6 figs.

  11. Matrix factorization on a hypercube multiprocessor

    SciTech Connect

    Geist, G.A.; Heath, M.T.

    1985-08-01

    This paper is concerned with parallel algorithms for matrix factorization on distributed-memory, message-passing multiprocessors, with special emphasis on the hypercube. Both Cholesky factorization of symmetric positive definite matrices and LU factorization of nonsymmetric matrices using partial pivoting are considered. The use of the resulting triangular factors to solve systems of linear equations by forward and back substitutions is also considered. Efficiencies of various parallel computational approaches are compared in terms of empirical results obtained on an Intel iPSC hypercube. 19 refs., 6 figs., 2 tabs.

  12. Forgetting the unforgettable through conversation: socially shared retrieval-induced forgetting of September 11 memories.

    PubMed

    Coman, Alin; Manier, David; Hirst, William

    2009-05-01

    A speaker's selective recounting of memories shared with a listener will induce both the speaker and the listener to forget unmentioned, related material more than unmentioned, unrelated material. We extended this finding of within-individual and socially shared retrieval-induced forgetting to well-rehearsed, emotionally intense memories that are similar for the speaker and listener, but differ in specifics. A questionnaire probed participants' memory of the September 11 terrorist attacks. Questions and responses were grouped into category-exemplar structures. Then, participants selectively rehearsed their answers (using a structured interview in Experiment 1 and a joint recounting between pairs in Experiment 2). In subsequent recognition tests, response times yielded evidence of within-individual retrieval-induced forgetting and socially shared retrieval-induced forgetting. This result indicates that conversations can alter memories of speakers and listeners in similar ways, even when the memories differ. We discuss socially shared retrieval-induced forgetting as a mechanism for the formation of collective memories. PMID:19476592

  13. Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael

    2000-01-01

    The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.

  14. Programmable partitioning for high-performance coherence domains in a multiprocessor system

    DOEpatents

    Blumrich, Matthias A.; Salapura, Valentina

    2011-01-25

    A multiprocessor computing system and a method of logically partitioning a multiprocessor computing system are disclosed. The multiprocessor computing system comprises a multitude of processing units, and a multitude of snoop units. Each of the processing units includes a local cache, and the snoop units are provided for supporting cache coherency in the multiprocessor system. Each of the snoop units is connected to a respective one of the processing units and to all of the other snoop units. The multiprocessor computing system further includes a partitioning system for using the snoop units to partition the multitude of processing units into a plurality of independent, memory-consistent, adjustable-size processing groups. Preferably, when the processor units are partitioned into these processing groups, the partitioning system also configures the snoop units to maintain cache coherency within each of said groups.

  15. Socially shared mourning: construction and consumption of collective memory

    NASA Astrophysics Data System (ADS)

    Harju, Anu

    2015-04-01

    Social media, such as YouTube, is increasingly a site of collective remembering where personal tributes to celebrity figures become sites of public mourning. YouTube, especially, is rife with celebrity commemorations. Examining fans' online mourning practices on YouTube, this paper examines video tributes dedicated to the late Steve Jobs, with a focus on collective remembering and collective construction of memory. Combining netnography with critical discourse analysis, the analysis focuses on the user comments where the past unfolds in interaction and meanings are negotiated and contested. The paper argues that celebrity death may, for avid fans, be a source of disenfranchised grief, a type of grief characterised by inadequate social support, usually arising from lack of empathy for the loss. The paper sheds light on the functions digital memorials have for mourning fans (and fandom) and argues that social media sites have come to function as spaces of negotiation, legitimisation and alleviation of disenfranchised grief. It is also suggested that when it comes to disenfranchised grief, and grief work generally, the concept of community be widened to include communities of weak ties, a typical form of communal belonging on social media.

  16. A New Shared-Memory Programming Paradigm for Molecular Dynamics Simulations on the Intel Paragon

    SciTech Connect

    D'Azevedo, E.F.

    1995-01-01

    This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing.

  17. A new shared-memory programming paradigm for molecular dynamics simulations on the Intel Paragon

    SciTech Connect

    D`Azevedo, E.F.; Romine, C.H.

    1994-12-01

    This report describes the use of shared memory emulation with DOLIB (Distributed Object Library) to simplify parallel programming on the Intel Paragon. A molecular dynamics application is used as an example to illustrate the use of the DOLIB shared memory library. SOTON-PAR, a parallel molecular dynamics code with explicit message-passing using a Lennard-Jones 6-12 potential, is rewritten using DOLIB primitives. The resulting code has no explicit message primitives and resembles a serial code. The new code can perform dynamic load balancing and achieves better performance than the original parallel code with explicit message-passing.

  18. Shared Representations in Language Processing and Verbal Short-Term Memory: The Case of Grammatical Gender

    ERIC Educational Resources Information Center

    Schweppe, Judith; Rummer, Ralf

    2007-01-01

    The general idea of language-based accounts of short-term memory is that retention of linguistic materials is based on representations within the language processing system. In the present sentence recall study, we address the question whether the assumption of shared representations holds for morphosyntactic information (here: grammatical gender…

  19. Functions of Memory Sharing and Mother-Child Reminiscing Behaviors: Individual and Cultural Variations

    ERIC Educational Resources Information Center

    Kulkofsky, Sarah; Wang, Qi; Koh, Jessie Bee Kim

    2009-01-01

    This study examined maternal beliefs about the functions of memory sharing and the relations between these beliefs and mother-child reminiscing behaviors in a cross-cultural context. Sixty-three European American and 47 Chinese mothers completed an open-ended questionnaire concerning their beliefs about the functions of parent-child memory…

  20. Method and apparatus for single-stepping coherence events in a multiprocessor system under software control

    DOEpatents

    Blumrich, Matthias A.; Salapura, Valentina

    2010-11-02

    An apparatus and method are disclosed for single-stepping coherence events in a multiprocessor system under software control in order to monitor the behavior of a memory coherence mechanism. Single-stepping coherence events in a multiprocessor system is made possible by adding one or more step registers. By accessing these step registers, one or more coherence requests are processed by the multiprocessor system. The step registers determine if the snoop unit will operate by proceeding in a normal execution mode, or operate in a single-step mode.

  1. Optical Shared Memory Computing and Multiple Access Protocols for Photonic Networks

    NASA Astrophysics Data System (ADS)

    Li, Kuang-Yu.

    In this research we investigate potential applications of optics in massively parallel computer systems, especially focusing on design issues in three-dimensional optical data storage and free-space photonic networks. An optical implementation of a shared memory uses a single photorefractive crystal and can realize the set of memory modules in a digital shared memory computer. A complete instruction set consists of R sc EAD, W sc RITE, S sc ELECTIVE E sc RASE, and R sc EFRESH, which can be applied to any memory module independent of (and in parallel with) instructions to the other memory modules. In addition, a memory module can execute a sequence of R sc EAD operations simultaneously with the execution of a W sc RITE operation to accommodate differences in optical recording and readout times common to optical volume storage media. An experimental shared memory system is demonstrated and its projected performance is analyzed. A multiplexing technique is presented to significantly reduce both grating- and beam-degeneracy crosstalk in volume holographic systems, by incorporating space, angle, and wavelength as the multiplexing parameters. In this approach, each hologram, which results from the interference between a single input node and an object array, partially overlaps with the other holograms in its neighborhood. This technique can offer improved interconnection density, optical throughput, signal fidelity, and space-bandwidth product utilization. Design principles and numerical simulation results are presented. A free-space photonic cellular hypercube parallel computer, with emphasis on the design of a collisionless multiple access protocol, is presented. This design incorporates wavelength-, space-, and time-multiplexing to achieve multiple access, wavelength reuse, dense connectivity, collisionless communications, and a simple control mechanism. Analytic models based on semi-Markov processes are employed to analyze this protocol. The performance of the

  2. Visual and Spatial Working Memory Are Not that Dissociated after All: A Time-Based Resource-Sharing Account

    ERIC Educational Resources Information Center

    Vergauwe, Evie; Barrouillet, Pierre; Camos, Valerie

    2009-01-01

    Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and…

  3. Principles for problem aggregation and assignment in medium scale multiprocessors

    NASA Technical Reports Server (NTRS)

    Nicol, David M.; Saltz, Joel H.

    1987-01-01

    One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior.

  4. A prototype functional language implementation for hierarchical-memory architectures

    SciTech Connect

    Wolski, R.; Feo, J.; Cann, D.

    1991-06-05

    The first implementation of Sisal was designed for general shared-memory architectures. Since then, we have optimized the system for vector and coherent-cache multiprocessors. Coherent-cache systems can be thought of as simple, two-level hierarchical memory systems, where the memory hierarchy is managed by the hardware. The compiler and run-time system for such an architecture needs to maintain data locality so that the processor caches are used as much as possible. In this paper, we extend the coherent-cache implementation to include explicit compiler and run-time control for medium-grain and coarse-grain hierarchical-memory architectures. We implemented the extended system on the BBN Butterfly using interleaved shared memory exclusively for the purposes of data sharing and exploiting the per-processor local memories. We give preliminary performance results for this extended system. 10 refs., 7 figs.

  5. Parallel calculations on shared memory, NUMA-based computers using MATLAB

    NASA Astrophysics Data System (ADS)

    Krotkiewski, Marcin; Dabrowski, Marcin

    2014-05-01

    Achieving satisfactory computational performance in numerical simulations on modern computer architectures can be a complex task. Multi-core design makes it necessary to parallelize the code. Efficient parallelization on NUMA (Non-Uniform Memory Access) shared memory architectures necessitates explicit placement of the data in the memory close to the CPU that uses it. In addition, using more than 8 CPUs (~100 cores) requires a cluster solution of interconnected nodes, which involves (expensive) communication between the processors. It takes significant effort to overcome these challenges even when programming in low-level languages, which give the programmer full control over data placement and work distribution. Instead, many modelers use high-level tools such as MATLAB, which severely limit the optimization/tuning options available. Nonetheless, the advantage of programming simplicity and a large available code base can tip the scale in favor of MATLAB. We investigate whether MATLAB can be used for efficient, parallel computations on modern shared memory architectures. A common approach to performance optimization of MATLAB programs is to identify a bottleneck and migrate the corresponding code block to a MEX file implemented in, e.g. C. Instead, we aim at achieving a scalable parallel performance of MATLABs core functionality. Some of the MATLABs internal functions (e.g., bsxfun, sort, BLAS3, operations on vectors) are multi-threaded. Achieving high parallel efficiency of those may potentially improve the performance of significant portion of MATLABs code base. Since we do not have MATLABs source code, our performance tuning relies on the tools provided by the operating system alone. Most importantly, we use custom memory allocation routines, thread to CPU binding, and memory page migration. The performance tests are carried out on multi-socket shared memory systems (2- and 4-way Intel-based computers), as well as a Distributed Shared Memory machine with 96 CPU

  6. Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing

    SciTech Connect

    Bernstein, P.A.

    1988-02-01

    The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user. However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.

  7. HyperForest: A high performance multi-processor architecture for real-time intelligent systems

    SciTech Connect

    Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.

    1997-04-01

    Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doors for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.

  8. Using memory in the Cedar system

    SciTech Connect

    McGrath, R.E.; Emrath, P.

    1987-01-01

    The design of the virtual memory system for the Cedar multiprocessor under construction at the University of Illinois is discussed. The Cedar architecture features a hierarchy of memory, some shared by all processors, and some shared by subsets of processors. The Xylem operating system is based on Alliant Computer Systems CONCENTRIX(TM) operating system, which is based on 4.2BSD UNIX(TM). Xylem supports multi-tasking and demand paging of parts of the memory hierarchy into a linear virtual address space. Memory may be private to a task or shared between all the tasks. The locality and attributes of a page may be modified during the execution of a program. Examples of how these mechanisms can be used are discussed. 14 figs.

  9. Vascular system modeling in parallel environment - distributed and shared memory approaches

    PubMed Central

    Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

    2011-01-01

    The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891

  10. Embedded Multiprocessor Technology for VHSIC Insertion

    NASA Technical Reports Server (NTRS)

    Hayes, Paul J.

    1990-01-01

    Viewgraphs on embedded multiprocessor technology for VHSIC insertion are presented. The objective was to develop multiprocessor system technology providing user-selectable fault tolerance, increased throughput, and ease of application representation for concurrent operation. The approach was to develop graph management mapping theory for proper performance, model multiprocessor performance, and demonstrate performance in selected hardware systems.

  11. Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

    NASA Technical Reports Server (NTRS)

    Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

    1994-01-01

    The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.

  12. IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism

    SciTech Connect

    Lee, Seyong; Vetter, Jeffrey S

    2016-01-01

    We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.

  13. Performance measures for multiprocessor controllers

    NASA Technical Reports Server (NTRS)

    Krishna, C. M.; Shin, K. G.

    1982-01-01

    Performance measures to characterize fault tolerant multiprocessors used in the control of critical processes are considered. Our performance indices are based on controller response time. By relating this to the needs of the application, we have been able to derive indices that faithfully reflect the performance of the multiprocessor in the context of the application, that permit the objective comparison of rival computer systems, and that can either be definitively estimated or objectively measured. An example of a controller in an idealized satellite application is provided.

  14. Global arrays: A portable {open_quotes}shared-memory{close_quotes} programming model for distributed memory computers

    SciTech Connect

    Harrison, R.J.; Nieplocha, J.; Littlefield, R.J.

    1994-11-01

    Portability, efficiency, and ease of coding are all important considerations in choosing the programming model for a scalable parallel application. The message-passing programming model is widely used because of its portability, yet some applications are too complex to code in it while also trying to maintain a balanced computation load and avoid redundant computations. The shared-memory programming model simplifies coding, but it is not portable and often provides little control over interprocessor data transfer costs. This paper describes a new approach, called Global Arrays (GA), that combines the better features of both other models, leading to both simple coding and efficient execution. The key concept of GA is that it provides a portable interface through which each process in a MIMD parallel program can asynchronously access logical blocks of physically distributed matrices, with no need for explicit cooperation by other processes. The authors have implemented GA libraries on a variety of computer systems, including the Intel DELTA and Paragon, the IBM SP-1 (all message-passers), the Kendall Square KSR-2 (a nonuniform access shared-memory machine), and networks of Unix workstations. They discuss the design and implementation of these libraries, report their performance, illustrate the use of GA in the context of computational chemistry applications, and describe the use of a GA performance visualization tool.

  15. Shared and Distributed Memory Parallel Security Analysis of Large-Scale Source Code and Binary Applications

    SciTech Connect

    Quinlan, D; Barany, G; Panas, T

    2007-08-30

    Many forms of security analysis on large scale applications can be substantially automated but the size and complexity can exceed the time and memory available on conventional desktop computers. Most commercial tools are understandably focused on such conventional desktop resources. This paper presents research work on the parallelization of security analysis of both source code and binaries within our Compass tool, which is implemented using the ROSE source-to-source open compiler infrastructure. We have focused on both shared and distributed memory parallelization of the evaluation of rules implemented as checkers for a wide range of secure programming rules, applicable to desktop machines, networks of workstations and dedicated clusters. While Compass as a tool focuses on source code analysis and reports violations of an extensible set of rules, the binary analysis work uses the exact same infrastructure but is less well developed into an equivalent final tool.

  16. Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems

    NASA Technical Reports Server (NTRS)

    Naik, Vijay K.; Patrick, Merrell L.

    1989-01-01

    Communication requirements of Cholesky factorization of dense and sparse symmetric, positive definite matrices are analyzed. The communication requirement is characterized by the data traffic generated on multiprocessor systems with local and shared memory. Lower bound proofs are given to show that when the load is uniformly distributed the data traffic associated with factoring an n x n dense matrix using n to the alpha power (alpha less than or equal 2) processors is omega(n to the 2 + alpha/2 power). For n x n sparse matrices representing a square root of n x square root of n regular grid graph the data traffic is shown to be omega(n to the 1 + alpha/2 power), alpha less than or equal 1. Partitioning schemes that are variations of block assignment scheme are described and it is shown that the data traffic generated by these schemes are asymptotically optimal. The schemes allow efficient use of up to O(n to the 2nd power) processors in the dense case and up to O(n) processors in the sparse case before the total data traffic reaches the maximum value of O(n to the 3rd power) and O(n to the 3/2 power), respectively. It is shown that the block based partitioning schemes allow a better utilization of the data accessed from shared memory and thus reduce the data traffic than those based on column-wise wrap around assignment schemes.

  17. The Terracorrelator: a shared memory HPC facility for real-time seismological cross-correlation analyses

    NASA Astrophysics Data System (ADS)

    Atkinson, Malcolm; Bell, Andrew; Curtis, Andrew; Entwistle, Elizabeth; Filgueira, Rosa; Krause, Amrey; Main, Ian; Meles, Giovani; Miniter, Mike; Zhao, Youqian

    2015-04-01

    Earthquakes and volcanic eruptions may in some instances be preceded or accompanied by changes in the geophysical properties of the Earth, such as seismic velocities or event rates. The development of reliable probabilistic forecasting methods for these hazards requires real-time analysis of seismic data and truly prospective forecasting and testing to reduce bias. However, potential forecasting techniques, including seismic interferometry and earthquake "repeater" analysis, require a large number of waveform cross-correlations; this is computationally intensive, and is particularly challenging in real-time. Here we describe the "Terracorrelator", a new high performance computing facility at the University of Edinburgh designed for real-time cross-correlational analyses. The machine consists of two 2TB shared memory nodes for cross-correlation and post-processing, and two Intel Xeon Phi nodes for pre-processing. The Terracorrelator has been tested on a seismic interferometry case study using ObsPy for seismic operations and processing, and Dispel4Py for writing and executing the workflow. The workflow is distributed automatically for parallel processing in a shared memory multicore environment. Preliminary results have demonstrated that data from 1000 seismic stations can be pre-processed, and each station cross-correlated with all others (499500 cross-correlations) in hourly or daily intervals sufficiently quickly to keep ahead of new data arriving, on one of the shared memory nodes. The second node is therefore free to perform interpretative analysis on the outputs, for example to look at changes in the resulting correlations. These promising results suggest that it will be possible to undertake real-time interferometric analysis using Sure~1000 stations, and to test the predictive power of current seismic velocity changes for future hazard occurrence.

  18. A multiprocessor operating system simulator

    SciTech Connect

    Johnston, G.M.; Campbell, R.H. . Dept. of Computer Science)

    1988-01-01

    This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT and T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows that of the Choices family of operating systems for loosely and tightly coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.

  19. A Multiprocessor Operating System Simulator

    NASA Technical Reports Server (NTRS)

    Johnston, Gary M.; Campbell, Roy H.

    1988-01-01

    This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall semester of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT&T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows that of the 'Choices' family of operating systems for loosely- and tightly-coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.

  20. Reproducibility in a multiprocessor system

    DOEpatents

    Bellofatto, Ralph A; Chen, Dong; Coteus, Paul W; Eisley, Noel A; Gara, Alan; Gooding, Thomas M; Haring, Rudolf A; Heidelberger, Philip; Kopcsay, Gerard V; Liebsch, Thomas A; Ohmacht, Martin; Reed, Don D; Senger, Robert M; Steinmacher-Burow, Burkhard; Sugawara, Yutaka

    2013-11-26

    Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed; a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deterministic chip interfaces, zero-impact communication with the system, precise stop of the system and a scan of the system state.

  1. Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

    NASA Astrophysics Data System (ADS)

    Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

    2015-09-01

    The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.

  2. An experimental distributed microprocessor implementation with a shared memory communications and control medium

    NASA Technical Reports Server (NTRS)

    Mejzak, R. S.

    1980-01-01

    The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.

  3. An implementation of SISAL for distributed-memory architectures

    SciTech Connect

    Beard, P.C.

    1995-06-01

    This thesis describes a new implementation of the implicitly parallel functional programming language SISAL, for massively parallel processor supercomputers. The Optimizing SISAL Compiler (OSC), developed at Lawrence Livermore National Laboratory, was originally designed for shared-memory multiprocessor machines and has been adapted to distributed-memory architectures. OSC has been relatively portable between shared-memory architectures, because they are architecturally similar, and OSC generates portable C code. However, distributed-memory architectures are not standardized -- each has a different programming model. Distributed-memory SISAL depends on a layer of software that provides a portable, distributed, shared-memory abstraction. This layer is provided by Split-C, a dialect of the C programming language developed at U.C. Berkeley, which has demonstrated good performance on distributed-memory architectures. Split-C provides important capabilities for good performance: support for program-specific distributed data structures, and split-phase memory operations. Distributed data structures help achieve good memory locality, while split-phase memory operations help tolerate the longer communication latencies inherent in distributed-memory architectures. The distributed-memory SISAL compiler and run-time system takes advantage of these capabilities. The results of these efforts is a compiler that runs identically on the Thinking Machines Connection Machine (CM-5), and the Meiko Computing Surface (CS-2).

  4. Parallel Reduction of Large Radar Interferometry Scenes on a Mid-scale, Symmetric Multiprocessor Mainframe Computer

    NASA Astrophysics Data System (ADS)

    Harcke, L. J.; Zebker, H. A.

    2006-12-01

    We report on experiences in processing repeat-orbit interferometry data sets on a mid-scale multiprocessor mainframe computer. Newer applications of interferometric and polarimetric data processing, such as permanent scatterer deformation monitoring, require the generation of many tens of repeat-pass interferometry data pairs, perhaps 30 to 50, to provide sufficient input to the deformation model. Moving existing radar processing techniques toward massively parallel computation provides a path to coping with such large data sets, which can consist of 30 to 50 gigabytes (GB) of raw data. In June 2006, the Stanford School of Earth Sciences dedicated a new computation center for general research use. Two large machines compose the center: a single-node, symmetric multiprocessor (SMP) machine with 48 processor cores and a single 192~GB memory, and a 64 node distributed cluster containing 128 processor cores with at least 2~GB of memory per node. Distributed processing of the matched filter for synthetic aperture radar image formation requires a high communication-to-computation ratio. Experiments performed over a decade ago on distributed memory supercomputers, and repeated a half-decade ago on commodity workstation clusters, both demonstrated saturation of inter-node communication links. For this reason, we chose to parallelize the interferometric processor on the shared memory computer using the OpenMP programming standard. We find, not unexpectedly, that the input/output stage of processing standard 100-by-100~kilometer ERS-1 scenes quickly dominates the total computation time, and that only modest increases in processing time are achieved after 8 to 16 processor cores are brought to bear on a single data set. The input and output data sit in single, serially accessed disk files, creating a bottleneck for overall throughput. This points to a scheme for efficient partitioning of mid-size (24 to 48~core) machines for reducing large Earth science data sets, where 3 to

  5. Multiprocessor computer overset grid method and apparatus

    DOEpatents

    Barnette, Daniel W.; Ober, Curtis C.

    2003-01-01

    A multiprocessor computer overset grid method and apparatus comprises associating points in each overset grid with processors and using mapped interpolation transformations to communicate intermediate values between processors assigned base and target points of the interpolation transformations. The method allows a multiprocessor computer to operate with effective load balance on overset grid applications.

  6. Coscheduling Technique for Symmetric Multiprocessor Clusters

    SciTech Connect

    Yoo, A B; Jette, M A

    2000-09-18

    Coscheduling is essential for obtaining good performance in a time-shared symmetric multiprocessor (SMP) cluster environment. However, the most common technique, gang scheduling, has limitations such as poor scalability and vulnerability to faults mainly due to explicit synchronization between its components. A decentralized approach called dynamic coscheduling (DCS) has been shown to be effective for network of workstations (NOW), but this technique is not suitable for the workloads on a very large SMP-cluster with thousands of processors. Furthermore, its implementation can be prohibitively expensive for such a large-scale machine. IN this paper, they propose a novel coscheduling technique based on the DCS approach which can achieve coscheduling on very large SMP-clusters in a scalable, efficient, and cost-effective way. In the proposed technique, each local scheduler achieves coscheduling based upon message traffic between the components of parallel jobs. Message trapping is carried out at the user-level, eliminating the need for unsupported hardware or device-level programming. A sending process attaches its status to outgoing messages so local schedulers on remote nodes can make more intelligent scheduling decisions. Once scheduled, processes are guaranteed some minimum period of time to execute. This provides an opportunity to synchronize the parallel job's components across all nodes and achieve good program performance. The results from a performance study reveal that the proposed technique is a promising approach that can reduce response time significantly over uncoordinated time-sharing and batch scheduling.

  7. Real-time topological image smoothing on shared memory parallel machines

    NASA Astrophysics Data System (ADS)

    Mahmoudi, Ramzi; Akil, Mohamed

    2011-03-01

    Smoothing filter is the method of choice for image preprocessing and pattern recognition. We present a new concurrent method for smoothing 2D object in binary case. Proposed method provides a parallel computation while preserving the topology by using homotopic transformations. We introduce an adapted parallelization strategy called split, distribute and merge (SDM) strategy which allows efficient parallelization of a large class of topological operators including, mainly, smoothing, skeletonization, and watershed algorithms. To achieve a good speedup, we cared about task scheduling. Distributed work during smoothing process is done by a variable number of threads. Tests on 2D binary image (512*512), using shared memory parallel machine (SMPM) with 8 CPU cores (2× Xeon E5405 running at frequency of 2 GHz), showed an enhancement of 5.2 thus a cadency of 32 images per second is achieved.

  8. Exploring the use of I/O nodes for computation in a MIMD multiprocessor

    NASA Technical Reports Server (NTRS)

    Kotz, David; Cai, Ting

    1995-01-01

    As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.

  9. Improved multiprocessor garbage collection algorithms

    SciTech Connect

    Newman, I.A.; Stallard, R.P.; Woodward, M.C.

    1983-01-01

    Outlines the results of an investigation of existing multiprocessor garbage collection algorithms and introduces two new algorithms which significantly improve some aspects of the performance of their predecessors. The two algorithms arise from different starting assumptions. One considers the case where the algorithm will terminate successfully whatever list structure is being processed and assumes that the extra data space should be minimised. The other seeks a very fast garbage collection time for list structures that do not contain loops. Results of both theoretical and experimental investigations are given to demonstrate the efficacy of the algorithms. 7 references.

  10. Simulating an aerospace multiprocessor. [for space guidance computers

    NASA Technical Reports Server (NTRS)

    Mallach, E. G.

    1976-01-01

    The paper describes a simulator which was used to evaluate the architecture of an aerospace multiprocessor. The simulator models interactions among the processors, memories, the central data bus, and a possible 'job stack'. Special features of the simulator are discussed, including the use of explicitly coded and individually distinguishable 'job models' instead of a statistically defined 'job mix' and a specialized Job Model Definition Language to automate the detailed coding of the models. Some results are presented which show that when the simulator was employed in conjunction with queuing theory and Markov-process analysis, more insight into system behavior was obtained than would have been with any one technique alone.

  11. Multiprocessor system with multiple concurrent modes of execution

    SciTech Connect

    Ahn, Daniel; Ceze, Luis H; Chen, Dong; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin

    2013-12-31

    A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.

  12. Aho-Corasick String Matching on Shared and Distributed Memory Parallel Architectures

    SciTech Connect

    Tumeo, Antonino; Villa, Oreste; Chavarría-Miranda, Daniel

    2012-03-01

    String matching is at the core of many critical applications, including network intrusion detection systems, search engines, virus scanners, spam filters, DNA and protein sequencing, and data mining. For all of these applications string matching requires a combination of (sometimes all) the following characteristics: high and/or predictable performance, support for large data sets and flexibility of integration and customization. Many software based implementations targeting conventional cache-based microprocessors fail to achieve high and predictable performance requirements, while Field-Programmable Gate Array (FPGA) implementations and dedicated hardware solutions fail to support large data sets (dictionary sizes) and are difficult to integrate and customize. The advent of multicore, multithreaded, and GPU-based systems is opening the possibility for software based solutions to reach very high performance at a sustained rate. This paper compares several software-based implementations of the Aho-Corasick string searching algorithm for high performance systems. We discuss the implementation of the algorithm on several types of shared-memory high-performance architectures (Niagara 2, large x86 SMPs and Cray XMT), distributed memory with homogeneous processing elements (InfiniBand cluster of x86 multicores) and heterogeneous processing elements (InfiniBand cluster of x86 multicores with NVIDIA Tesla C10 GPUs). We describe in detail how each solution achieves the objectives of supporting large dictionaries, sustaining high performance, and enabling customization and flexibility using various data sets.

  13. Pushing away the communication bottleneck with optical interconnects in symmetric multiprocessors

    NASA Astrophysics Data System (ADS)

    Hlayhel, Wissam; Collet, Jacques H.; Rochange, Christine; Litaize, Daniel

    2000-05-01

    We analyze the bandwidth needed for transmitting the addresses in future symmetric multiprocessor machines (SMP), constructed around a shared bus due to the critical obligation to preserve the coherence of the memory hierarchy. We show that an address-transaction bandwidth as high as several hundreds of Gbit/s will be necessary not to slow down the execution of most applications in large SMP's. This communication bandwidth seems incompatible with the operation constraints of shared electrical busses, making necessary the search for other implementations of the address transmission network. We consider the introduction of optical interconnects (OI) in this context. We review several solutions, in the ascending order of complexity of the optical subsystems as one critical issue concerns the degree of sophistication of the optical solutions and their cost. We first consider simple point to point OI's for a SMP chipset. The interest for OI's comes from the low energy consumption and from the possibility, in the future, to integrate several thousands of optical input/outputs per electronic chip. The we consider the implementation of an optical bus that is a multipoint optical line involving more optical functionality. We discuss the possibility of multiple accesses to the bus, and the constraints related to the necessity to maintain the coherence of caches.

  14. Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

    NASA Technical Reports Server (NTRS)

    Djomehri, M. Jahed; Rizk, Yehia M.

    1999-01-01

    The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each

  15. Multiprocessor performance modeling with ADAS

    NASA Technical Reports Server (NTRS)

    Hayes, Paul J.; Andrews, Asa M.

    1989-01-01

    A graph managing strategy referred to as the Algorithm to Architecture Mapping Model (ATAMM) appears useful for the time-optimized execution of application algorithm graphs in embedded multiprocessors and for the performance prediction of graph designs. This paper reports the modeling of ATAMM in the Architecture Design and Assessment System (ADAS) to make an independent verification of ATAMM's performance prediction capability and to provide a user framework for the evaluation of arbitrary algorithm graphs. Following an overview of ATAMM and its major functional rules are descriptions of the ADAS model of ATAMM, methods to enter an arbitrary graph into the model, and techniques to analyze the simulation results. The performance of a 7-node graph example is evaluated using the ADAS model and verifies the ATAMM concept by substantiating previously published performance results.

  16. Multiprocessor Neural Network in Healthcare.

    PubMed

    Godó, Zoltán Attila; Kiss, Gábor; Kocsis, Dénes

    2015-01-01

    A possible way of creating a multiprocessor artificial neural network is by the use of microcontrollers. The RISC processors' high performance and the large number of I/O ports mean they are greatly suitable for creating such a system. During our research, we wanted to see if it is possible to efficiently create interaction between the artifical neural network and the natural nervous system. To achieve as much analogy to the living nervous system as possible, we created a frequency-modulated analog connection between the units. Our system is connected to the living nervous system through 128 microelectrodes. Two-way communication is provided through A/D transformation, which is even capable of testing psychopharmacons. The microcontroller-based analog artificial neural network can play a great role in medical singal processing, such as ECG, EEG etc. PMID:26152990

  17. Improvement of multiprocessing performance by using optical centralized shared bus

    NASA Astrophysics Data System (ADS)

    Han, Xuliang; Chen, Ray T.

    2004-06-01

    With the ever-increasing need to solve larger and more complex problems, multiprocessing is attracting more and more research efforts. One of the challenges facing the multiprocessor designers is to fulfill in an effective manner the communications among the processes running in parallel on multiple multiprocessors. The conventional electrical backplane bus provides narrow bandwidth as restricted by the physical limitations of electrical interconnects. In the electrical domain, in order to operate at high frequency, the backplane topology has been changed from the simple shared bus to the complicated switched medium. However, the switched medium is an indirect network. It cannot support multicast/broadcast as effectively as the shared bus. Besides the additional latency of going through the intermediate switching nodes, signal routing introduces substantial delay and considerable system complexity. Alternatively, optics has been well known for its interconnect capability. Therefore, it has become imperative to investigate how to improve multiprocessing performance by utilizing optical interconnects. From the implementation standpoint, the existing optical technologies still cannot fulfill the intelligent functions that a switch fabric should provide as effectively as their electronic counterparts. Thus, an innovative optical technology that can provide sufficient bandwidth capacity, while at the same time, retaining the essential merits of the shared bus topology, is highly desirable for the multiprocessing performance improvement. In this paper, the optical centralized shared bus is proposed for use in the multiprocessing systems. This novel optical interconnect architecture not only utilizes the beneficial characteristics of optics, but also retains the desirable properties of the shared bus topology. Meanwhile, from the architecture standpoint, it fits well in the centralized shared-memory multiprocessing scheme. Therefore, a smooth migration with substantial

  18. ATAMM enhancement and multiprocessor performance evaluation

    NASA Technical Reports Server (NTRS)

    Stoughton, John W.; Mielke, Roland R.; Som, Sukhamoy; Obando, Rodrigo; Malekpour, Mahyar R.; Jones, Robert L., III; Mandala, Brij Mohan V.

    1991-01-01

    ATAMM (Algorithm To Architecture Mapping Model) enhancement and multiprocessor performance evaluation is discussed. The following topics are included: the ATAMM model; ATAMM enhancement; ADM (Advanced Development Model) implementation of ATAMM; and ATAMM support tools.

  19. Parallel computational steering for HPC applications using HDF5 files in distributed shared memory.

    PubMed

    Biddiscombe, John; Soumagne, Jerome; Oger, Guillaume; Guibert, David; Piccinali, Jean-Guillaume

    2012-06-01

    Interfacing a GUI driven visualization/analysis package to an HPC application enables a supercomputer to be used as an interactive instrument. We achieve this by replacing the IO layer in the HDF5 library with a custom driver which transfers data in parallel between simulation and analysis. Our implementation using ParaView as the interface, allows a flexible combination of parallel simulation, concurrent parallel analysis, and GUI client, either on the same or separate machines. Each MPI job may use different core counts or hardware configurations, allowing fine tuning of the amount of resources dedicated to each part of the workload. By making use of a distributed shared memory file, one may read data from the simulation, modify it using ParaView pipelines, write it back, to be reused by the simulation (or vice versa). This allows not only simple parameter changes, but complete remeshing of grids, or operations involving regeneration of field values over the entire domain. To avoid the problem of manually customizing the GUI for each application that is to be steered, we make use of XML templates that describe outputs from the simulation (and inputs back to it) to automatically generate GUI controls for manipulation of the simulation. PMID:22350196

  20. Shared Etiology of Phonological Memory and Vocabulary Deficits in School-Age Children

    PubMed Central

    Peterson, Robin L.; Pennington, Bruce F.; Samuelsson, Stefan; Byrne, Brian; Olson, Richard K.

    2012-01-01

    Purpose The goal of this study was to investigate the etiologic basis for the association between deficits in phonological memory (PM) and vocabulary in school-age children. Method Children with deficits in PM or vocabulary were identified within the International Longitudinal Twin Study (ILTS). The ILTS includes 1,045 twin pairs from the United States, Australia, and Scandinavia aged 5 to 8 years. We applied the DeFries-Fulker regression method to determine whether problems in PM and vocabulary tend to co-occur because of overlapping genes, overlapping environmental risk factors, or both. Results Among children with isolated PM deficits, we found significant bivariate heritability of PM and vocabulary weaknesses both within and across time. However, when probands were selected for a vocabulary deficit, there was no evidence for bivariate heritability. In this case, the PM-vocabulary relationship appeared to owe to common shared environmental experiences. Conclusions The findings are consistent with previous research on the heritability of specific language impairment and suggest that there are etiologic subgroups of children with poor vocabulary for different reasons, one more influenced by genes and another more influenced by environment. PMID:23275423

  1. MLP: A Parallel Programming Alternative to MPI for New Shared Memory Parallel Systems

    NASA Technical Reports Server (NTRS)

    Taft, James R.

    1999-01-01

    Recent developments at the NASA AMES Research Center's NAS Division have demonstrated that the new generation of NUMA based Symmetric Multi-Processing systems (SMPs), such as the Silicon Graphics Origin 2000, can successfully execute legacy vector oriented CFD production codes at sustained rates far exceeding processing rates possible on dedicated 16 CPU Cray C90 systems. This high level of performance is achieved via shared memory based Multi-Level Parallelism (MLP). This programming approach, developed at NAS and outlined below, is distinct from the message passing paradigm of MPI. It offers parallelism at both the fine and coarse grained level, with communication latencies that are approximately 50-100 times lower than typical MPI implementations on the same platform. Such latency reductions offer the promise of performance scaling to very large CPU counts. The method draws on, but is also distinct from, the newly defined OpenMP specification, which uses compiler directives to support a limited subset of multi-level parallel operations. The NAS MLP method is general, and applicable to a large class of NASA CFD codes.

  2. Testing and operating a multiprocessor chip with processor redundancy

    DOEpatents

    Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J

    2014-10-21

    A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.

  3. Multiprocessor architecture to handle TJ-II VXI-based digitization channels

    NASA Astrophysics Data System (ADS)

    Crémy, C.; Vega, J.; Sánchez, E.; Dulya, C. M.; Portas, A.

    1999-01-01

    The data acquisition System (DAS) of the TJ-II stellerator provides up to 300 digitization channels integrated in register-based VXI modules designed in CIEMAT Laboratories. The modules are embedded into six 13-slot VXI chassis connected to the TJ-II DAS central computer by means of a dual LAN topology. During normal operation, remote control of the VXI systems and channel setup are accomplished through an Ethernet LAN, while two FDDI rings are dedicated to postdischarge fast data transfer. The former network link is performed by the bus controller whereas the latter one is provided through a FDDI node controller installed in the mainframe, thus creating a multiprocessor architecture. Dedicated software, running on the VxWorks operating system, has been developed to provide handling of the VXI systems including the following facilities: mainframe information readout, channel setup, real time digitization handling, and data transfer. This software, implemented in C++, is distributed over the two CPUs. Interprocessor communication for synchronization purposes is based on a backplane shared memory pool.

  4. Thread mapping using system-level model for shared memory multicores

    NASA Astrophysics Data System (ADS)

    Mitra, Reshmi

    Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.

  5. Parallelization of a multiregion flow and transport code using software emulated global shared memory and high performance FORTRAN

    SciTech Connect

    D`Azevedo, E.F.; Gwo, Jin-Ping

    1997-02-01

    The objectives of this research are (1) to parallelize a suite of multiregion groundwater flow and solute transport codes that use Galerkin and Lagrangian- Eulerian finite element methods, (2) to test the compatibility of a global shared memory emulation software with a High Performance FORTRAN (HPF) compiler, and (3) to obtain performance characteristics and scalability of the parallel codes. The suite of multiregion flow and transport codes, 3DMURF and 3DMURT, were parallelized using the DOLIB shared memory emulation, in conjunction with the PGI HPF compiler, to run on the Intel Paragons at the Oak Ridge National Laboratory (ORNL) and a network of workstations. The novelty of this effort is first in the use of HPF and global shared memory emulation concurrently to facilitate the conversion of a serial code to a parallel code, and secondly the shared memory library enables efficient implementation of Lagrangian particle tracking along flow characteristics. The latter allows long-time-step-size simulation with particle tracking and dynamic particle redistribution for load balancing, thereby reducing the number of time steps needed for most transient problems. The parallel codes were applied to a pumping well problem to test the efficiency of the domain decomposition and particle tracking algorithms. The full problem domain consists of over 200,000 degrees of freedom with highly nonlinear soil property functions. Relatively good scalability was obtained for a preliminary test run on the Intel Paragons at the Center for Computational Sciences (CCS), ORNL. However, due to the difficulties we encountered in the PGI HPF compiler, as of the writing of this manuscript we are able to report results from 3DMURF only.

  6. Linear solvers on multiprocessor machines

    SciTech Connect

    Kalogerakis, M.A.

    1986-01-01

    Two new methods are introduced for the parallel solution of banded linear systems on multiprocessor machines. Moreover, some new techniques are obtained as variations of the two methods that are applicable to special instances of the problem. Comparisons with the best known methods are performed, from which it is concluded that the two methods are superior, while their variations for special instances are, in general, competitive and in some cases best. In the process, some new results on the parallel prefix problem are obtained and a new design for this problem is presented that is suitable for VLSI implementation. Furthermore, a general model is introduced for the analysis and classification of methods that are based on row transformations of matrices. It is seen that most known methods are included in this model. It is demonstrated that this model may be used as a basis for the analysis as well as the generation of important aspects of those methods, such as their arithmetic complexity and interprocessor communication requirements.

  7. Iterative algorithms for tridiagonal matrices on a WSI-multiprocessor

    NASA Astrophysics Data System (ADS)

    Gajski, D. D.; Sameh, A. H.; Wisniewski, J. A.

    With the rapid advances in semiconductor technology, the construction of Wafer Scale Integration (WSI)-multiprocessors consisting of a large number of processors is now feasible. The implementation of some basis linear algebra algorithms on such multiprocessors is illustrated.

  8. Multiprocessor smalltalk: Implementation, performance, and analysis

    SciTech Connect

    Pallas, J.I.

    1990-01-01

    Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possible to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.

  9. Multiprocessor switch with selective pairing

    SciTech Connect

    Gara, Alan; Gschwind, Michael K; Salapura, Valentina

    2014-03-11

    System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus

  10. A combined PLC and CPU approach to multiprocessor control

    SciTech Connect

    Harris, J.J.; Broesch, J.D.; Coon, R.M.

    1995-10-01

    A sophisticated multiprocessor control system has been developed for use in the E-Power Supply System Integrated Control (EPSSIC) on the DIII-D tokamak. EPSSIC provides control and interlocks for the ohmic heating coil power supply and its associated systems. Of particular interest is the architecture of this system: both a Programmable Logic Controller (PLC) and a Central Processor Unit (CPU) have been combined on a standard VME bus. The PLC and CPU input and output signals are routed through signal conditioning modules, which provide the necessary voltage and ground isolation. Additionally these modules adapt the signal levels to that of the VME I/O boards. One set of I/O signals is shared between the two processors. The resulting multiprocessor system provides a number of advantages: redundant operation for mission critical situations, flexible communications using conventional TCP/IP protocols, the simplicity of ladder logic programming for the majority of the control code, and an easily maintained and expandable non-proprietary system.

  11. Associative-memory representations emerge as shared spatial patterns of theta activity spanning the primate temporal cortex.

    PubMed

    Nakahara, Kiyoshi; Adachi, Ken; Kawasaki, Keisuke; Matsuo, Takeshi; Sawahata, Hirohito; Majima, Kei; Takeda, Masaki; Sugiyama, Sayaka; Nakata, Ryota; Iijima, Atsuhiko; Tanigawa, Hisashi; Suzuki, Takafumi; Kamitani, Yukiyasu; Hasegawa, Isao

    2016-01-01

    Highly localized neuronal spikes in primate temporal cortex can encode associative memory; however, whether memory formation involves area-wide reorganization of ensemble activity, which often accompanies rhythmicity, or just local microcircuit-level plasticity, remains elusive. Using high-density electrocorticography, we capture local-field potentials spanning the monkey temporal lobes, and show that the visual pair-association (PA) memory is encoded in spatial patterns of theta activity in areas TE, 36, and, partially, in the parahippocampal cortex, but not in the entorhinal cortex. The theta patterns elicited by learned paired associates are distinct between pairs, but similar within pairs. This pattern similarity, emerging through novel PA learning, allows a machine-learning decoder trained on theta patterns elicited by a particular visual item to correctly predict the identity of those elicited by its paired associate. Our results suggest that the formation and sharing of widespread cortical theta patterns via learning-induced reorganization are involved in the mechanisms of associative memory representation. PMID:27282247

  12. Associative-memory representations emerge as shared spatial patterns of theta activity spanning the primate temporal cortex

    PubMed Central

    Nakahara, Kiyoshi; Adachi, Ken; Kawasaki, Keisuke; Matsuo, Takeshi; Sawahata, Hirohito; Majima, Kei; Takeda, Masaki; Sugiyama, Sayaka; Nakata, Ryota; Iijima, Atsuhiko; Tanigawa, Hisashi; Suzuki, Takafumi; Kamitani, Yukiyasu; Hasegawa, Isao

    2016-01-01

    Highly localized neuronal spikes in primate temporal cortex can encode associative memory; however, whether memory formation involves area-wide reorganization of ensemble activity, which often accompanies rhythmicity, or just local microcircuit-level plasticity, remains elusive. Using high-density electrocorticography, we capture local-field potentials spanning the monkey temporal lobes, and show that the visual pair-association (PA) memory is encoded in spatial patterns of theta activity in areas TE, 36, and, partially, in the parahippocampal cortex, but not in the entorhinal cortex. The theta patterns elicited by learned paired associates are distinct between pairs, but similar within pairs. This pattern similarity, emerging through novel PA learning, allows a machine-learning decoder trained on theta patterns elicited by a particular visual item to correctly predict the identity of those elicited by its paired associate. Our results suggest that the formation and sharing of widespread cortical theta patterns via learning-induced reorganization are involved in the mechanisms of associative memory representation. PMID:27282247

  13. Numerical Linear Algebra On The CEDAR Multiprocessor

    NASA Astrophysics Data System (ADS)

    Meier, Ulrike; Sameh, Ahmed

    1988-01-01

    In this paper we describe in some detail the architectural features of the CEDAR. multiprocessor. We also discuss strategies for implementation of dense matrix computations, and present performance results on one cluster for a variety of linear system solvers, eigenvalue problem solvers, as well as algorithms for solving linear least squares problems.

  14. Spaceborne VHSIC multiprocessor system for AI applications

    NASA Technical Reports Server (NTRS)

    Lum, Henry, Jr.; Shrobe, Howard E.; Aspinall, John G.

    1988-01-01

    A multiprocessor system, under design for space-station applications, makes use of the latest generation symbolic processor and packaging technology. The result will be a compact, space-qualified system two to three orders of magnitude more powerful than present-day symbolic processing systems.

  15. Fault-tolerant interconnection networks for multiprocessor systems

    SciTech Connect

    Nassar, H.M.

    1989-01-01

    Interconnection networks represent the backbone of multiprocessor systems. A failure in the network, therefore, could seriously degrade the system performance. For this reason, fault tolerance has been regarded as a major consideration in interconnection network design. This thesis presents two novel techniques to provide fault tolerance capabilities to three major networks: the Beneline network and the Clos network. First, the Simple Fault Tolerance Technique (SFT) is presented. The SFT technique is in fact the result of merging two widely known interconnection mechanisms: a normal interconnection network and a shared bus. This technique is most suitable for networks with small switches, such as the Baseline network and the Benes network. For the Clos network, whose switches may be large for the SFT, another technique is developed to produce the Fault-Tolerant Clos (FTC) network. In the FTC, one switch is added to each stage. The two techniques are described and thoroughly analyzed.

  16. Multi-core and Many-core Shared-memory Parallel Raycasting Volume Rendering Optimization and Tuning

    SciTech Connect

    Howison, Mark

    2012-01-31

    Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. And, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

  17. Shared and distinct contributions of rostrolateral prefrontal cortex to analogical reasoning and episodic memory retrieval.

    PubMed

    Westphal, Andrew J; Reggente, Nicco; Ito, Kaori L; Rissman, Jesse

    2016-03-01

    Rostrolateral prefrontal cortex (RLPFC) is widely appreciated to support higher cognitive functions, including analogical reasoning and episodic memory retrieval. However, these tasks have typically been studied in isolation, and thus it is unclear whether they involve common or distinct RLPFC mechanisms. Here, we introduce a novel functional magnetic resonance imaging (fMRI) task paradigm to compare brain activity during reasoning and memory tasks while holding bottom-up perceptual stimulation and response demands constant. Univariate analyses on fMRI data from twenty participants identified a large swath of left lateral prefrontal cortex, including RLPFC, that showed common engagement on reasoning trials with valid analogies and memory trials with accurately retrieved source details. Despite broadly overlapping recruitment, multi-voxel activity patterns within left RLPFC reliably differentiated these two trial types, highlighting the presence of at least partially distinct information processing modes. Functional connectivity analyses demonstrated that while left RLPFC showed consistent coupling with the fronto-parietal control network across tasks, its coupling with other cortical areas varied in a task-dependent manner. During the memory task, this region strengthened its connectivity with the default mode and memory retrieval networks, whereas during the reasoning task it coupled more strongly with a nearby left prefrontal region (BA 45) associated with semantic processing, as well as with a superior parietal region associated with visuospatial processing. Taken together, these data suggest a domain-general role for left RLPFC in monitoring and/or integrating task-relevant knowledge representations and showcase how its function cannot solely be attributed to episodic memory or analogical reasoning computations. Hum Brain Mapp 37:896-912, 2016. © 2015 Wiley Periodicals, Inc. PMID:26663572

  18. Shared Etiology of Phonological Memory and Vocabulary Deficits in School-Age Children

    ERIC Educational Resources Information Center

    Peterson, Robin L.; Pennington, Bruce F.; Samuelsson, Stefan; Byrne, Brian; Olson, Richard K.

    2013-01-01

    Purpose: The goal of this study was to investigate the etiologic basis for the association between deficits in phonological memory (PM) and vocabulary in school-age children. Method: Children with deficits in PM or vocabulary were identified within the International Longitudinal Twin Study (ILTS; Samuelsson et al., 2005). The ILTS includes 1,045…

  19. Fault detection, isolation and reconfiguration in FTMP Methods and experimental results. [fault tolerant multiprocessor

    NASA Technical Reports Server (NTRS)

    Lala, J. H.

    1983-01-01

    The Fault-Tolerant Multiprocessor (FTMP) is a highly reliable computer designed to meet a goal of 10 to the -10th failures per hour and built with the objective of flying an active-control transport aircraft. Fault detection, identification, and recovery software is described, and experimental results obtained by injecting faults in the pin level in the FTMP are presented. Over 21,000 faults were injected in the CPU, memory, bus interface circuits, and error detection, masking, and error reporting circuits of one LRU of the multiprocessor. Detection, isolation, and reconfiguration times were recorded for each fault, and the results were found to agree well with earlier assumptions made in reliability modeling.

  20. Satisfiability Test with Synchronous Simulated Annealing on the Fujitsu AP1000 Massively-Parallel Multiprocessor

    NASA Technical Reports Server (NTRS)

    Sohn, Andrew; Biswas, Rupak

    1996-01-01

    Solving the hard Satisfiability Problem is time consuming even for modest-sized problem instances. Solving the Random L-SAT Problem is especially difficult due to the ratio of clauses to variables. This report presents a parallel synchronous simulated annealing method for solving the Random L-SAT Problem on a large-scale distributed-memory multiprocessor. In particular, we use a parallel synchronous simulated annealing procedure, called Generalized Speculative Computation, which guarantees the same decision sequence as sequential simulated annealing. To demonstrate the performance of the parallel method, we have selected problem instances varying in size from 100-variables/425-clauses to 5000-variables/21,250-clauses. Experimental results on the AP1000 multiprocessor indicate that our approach can satisfy 99.9 percent of the clauses while giving almost a 70-fold speedup on 500 processors.

  1. Neural substrates of shared attention as social memory: A hyperscanning functional magnetic resonance imaging study.

    PubMed

    Koike, Takahiko; Tanabe, Hiroki C; Okazaki, Shuntaro; Nakagawa, Eri; Sasaki, Akihiro T; Shimada, Koji; Sugawara, Sho K; Takahashi, Haruka K; Yoshihara, Kazufumi; Bosch-Bayard, Jorge; Sadato, Norihiro

    2016-01-15

    During a dyadic social interaction, two individuals can share visual attention through gaze, directed to each other (mutual gaze) or to a third person or an object (joint attention). Shared attention is fundamental to dyadic face-to-face interaction, but how attention is shared, retained, and neutrally represented in a pair-specific manner has not been well studied. Here, we conducted a two-day hyperscanning functional magnetic resonance imaging study in which pairs of participants performed a real-time mutual gaze task followed by a joint attention task on the first day, and mutual gaze tasks several days later. The joint attention task enhanced eye-blink synchronization, which is believed to be a behavioral index of shared attention. When the same participant pairs underwent mutual gaze without joint attention on the second day, enhanced eye-blink synchronization persisted, and this was positively correlated with inter-individual neural synchronization within the right inferior frontal gyrus. Neural synchronization was also positively correlated with enhanced eye-blink synchronization during the previous joint attention task session. Consistent with the Hebbian association hypothesis, the right inferior frontal gyrus had been activated both by initiating and responding to joint attention. These results indicate that shared attention is represented and retained by pair-specific neural synchronization that cannot be reduced to the individual level. PMID:26514295

  2. LDRD final report : managing shared memory data distribution in hybrid HPC applications.

    SciTech Connect

    Merritt, Alexander M.; Pedretti, Kevin Thomas Tauke

    2010-09-01

    MPI is the dominant programming model for distributed memory parallel computers, and is often used as the intra-node programming model on multi-core compute nodes. However, application developers are increasingly turning to hybrid models that use threading within a node and MPI between nodes. In contrast to MPI, most current threaded models do not require application developers to deal explicitly with data locality. With increasing core counts and deeper NUMA hierarchies seen in the upcoming LANL/SNL 'Cielo' capability supercomputer, data distribution poses an upper boundary on intra-node scalability within threaded applications. Data locality therefore has to be identified at runtime using static memory allocation policies such as first-touch or next-touch, or specified by the application user at launch time. We evaluate several existing techniques for managing data distribution using micro-benchmarks on an AMD 'Magny-Cours' system with 24 cores among 4 NUMA domains and argue for the adoption of a dynamic runtime system implemented at the kernel level, employing a novel page table replication scheme to gather per-NUMA domain memory access traces.

  3. Shared Memory Performance of Multi-Computer Terminals in Distributed Information Systems.

    ERIC Educational Resources Information Center

    Reddi, Arumalla V.

    1984-01-01

    Presents a system model for transmission of input data that is coming from terminals of users in a limited user resource-sharing environment. Performance of a mini/microcomputer receiving mixture of picture-phone terminal data is analyzed with constant service times, synchronous transmission, and single-server interruptions through first-order…

  4. Autobiographical Memory Sharing in Everyday Life: Characteristics of a Good Story

    ERIC Educational Resources Information Center

    Baron, Jacqueline M.; Bluck, Susan

    2009-01-01

    Storytelling is a ubiquitous human activity that occurs across the lifespan as part of everyday life. Studies from three disparate literatures suggest that older adults (as compared to younger adults) are (a) less likely to recall story details, (b) more likely to go off-target when sharing stories, and, in contrast, (c) more likely to receive…

  5. The fault-tolerant multiprocessor computer

    NASA Technical Reports Server (NTRS)

    Smith, T. B., III (Editor); Lala, J. H. (Editor); Goldberg, J. (Editor); Kautz, W. H. (Editor); Melliar-Smith, P. M. (Editor); Green, M. W. (Editor); Levitt, K. N. (Editor); Schwartz, R. L. (Editor); Weinstock, C. B. (Editor); Palumbo, D. L. (Editor)

    1986-01-01

    The development and evaluation of fault-tolerant computer architectures and software-implemented fault tolerance (SIFT) for use in advanced NASA vehicles and potentially in flight-control systems are described in a collection of previously published reports prepared for NASA. Topics addressed include the principles of fault-tolerant multiprocessor (FTMP) operation; processor and slave regional designs; FTMP executive, facilities, acceptance-test/diagnostic, applications, and support software; FTM reliability and availability models; SIFT hardware design; and SIFT validation and verification.

  6. Pipeline multiprocessor architecture for high speed cell image analysis

    SciTech Connect

    Castleman, K.R.; Price, K.H.; Eskenazi, R.; Ovadya, M.M.; Navon, M.A.

    1983-10-01

    A pipeline multiple-microprocessor architecture for high-speed digital image processing is being developed. The goal is a compact, fast, and low-cost pap smear analyzer for cervical cancer detection. Each processor communicates with one or two upstream processors and from one to 13 downstream processors via shared memory. Each of the identical pipeline modules (PC boards) has a Motorla MC6809 microprocessor with a 2 megabyte memory management unit, two 64kbyte dual-port image memories (shared with upstream processors) and one 64kbyte dual-port program memory (shared with a host computer). Intermodule communication is achieved by ribbon cables connected to connectors at the top of the boards. This allows considerable flexibility in configuring the system. This architecture should facilitate efficient (fast, low-cost) implementations of complex single-purpose image processing systems.

  7. Developmental improvements in the resolution and capacity of visual working memory share a common source.

    PubMed

    Simmering, Vanessa R; Miller, Hilary E

    2016-08-01

    The nature of visual working memory (VWM) representations is currently a source of debate between characterizations as slot-like versus a flexibly-divided pool of resources. Recently, a dynamic neural field model has been proposed as an alternative account that focuses more on the processes by which VWM representations are formed, maintained, and used in service of behavior. This dynamic model has explained developmental increases in VWM capacity and resolution through strengthening excitatory and inhibitory connections. Simulations of developmental improvements in VWM resolution suggest that one important change is the accuracy of comparisons between items held in memory and new inputs. Thus, the ability to detect changes is a critical component of developmental improvements in VWM performance across tasks, leading to the prediction that capacity and resolution should correlate during childhood. Comparing 5- to 8-year-old children's performance across color discrimination and change detection tasks revealed the predicted correlation between estimates of VWM capacity and resolution, supporting the hypothesis that increasing connectivity underlies improvements in VWM during childhood. These results demonstrate the importance of formalizing the processes that support the use of VWM, rather than focusing solely on the nature of representations. We conclude by considering our results in the broader context of VWM development. PMID:27329264

  8. Memory.

    ERIC Educational Resources Information Center

    McKean, Kevin

    1983-01-01

    Discusses current research (including that involving amnesiacs and snails) into the nature of the memory process, differentiating between and providing examples of "fact" memory and "skill" memory. Suggests that three brain parts (thalamus, fornix, mammilary body) are involved in the memory process. (JN)

  9. The change probability effect: incidental learning, adaptability, and shared visual working memory resources.

    PubMed

    van Lamsweerde, Amanda E; Beck, Melissa R

    2011-12-01

    Statistical properties in the visual environment can be used to improve performance on visual working memory (VWM) tasks. The current study examined the ability to incidentally learn that a change is more likely to occur to a particular feature dimension (shape, color, or location) and use this information to improve change detection performance for that dimension (the change probability effect). Participants completed a change detection task in which one change type was more probable than others. Change probability effects were found for color and shape changes, but not location changes, and intentional strategies did not improve the effect. Furthermore, the change probability effect developed and adapted to new probability information quickly. Finally, in some conditions, an improvement in change detection performance for a probable change led to an impairment in change detection for improbable changes. PMID:21963330

  10. Directions in parallel programming: HPF, shared virtual memory and object parallelism in pC++

    NASA Technical Reports Server (NTRS)

    Bodin, Francois; Priol, Thierry; Mehrotra, Piyush; Gannon, Dennis

    1994-01-01

    Fortran and C++ are the dominant programming languages used in scientific computation. Consequently, extensions to these languages are the most popular for programming massively parallel computers. We discuss two such approaches to parallel Fortran and one approach to C++. The High Performance Fortran Forum has designed HPF with the intent of supporting data parallelism on Fortran 90 applications. HPF works by asking the user to help the compiler distribute and align the data structures with the distributed memory modules in the system. Fortran-S takes a different approach in which the data distribution is managed by the operating system and the user provides annotations to indicate parallel control regions. In the case of C++, we look at pC++ which is based on a concurrent aggregate parallel model.

  11. Use of a multiprocessor for control of a robotic system

    NASA Technical Reports Server (NTRS)

    Klein, C. A.; Wahawisan, W.

    1982-01-01

    In the case of complex industrial operations, the use of a multiprocessor for the control of a robotic system has a number of potential advantages over an employment of a single-processor design. In addition to the increased speed of parallel computation, multiprocessors can provide greater reliability. However, the control of a system with a multiprocessor is much more complicated, and requires resolution of a number of design criteria. The degree of coupling, the processor interconnection configuration, and the method of fault detection and correction are factors in the design of a multiprocessor which need to be considered for each particular application. In connection with the present investigation, a five-unit multiprocessor was built to experiment with the multiprocessor control of a robotic system. While the primary application of the multiprocessor involves the control of the Ohio State University (OSU) Hexapod, an 18-degree-of-freedom, motor-driven walking machine, the multiprocessor was designed to be versatile enough for a numbr of other uses.

  12. An Analysis of an Improved Bus-Based Multiprocessor Architecture

    NASA Technical Reports Server (NTRS)

    Ricks, Kenneth G.; Wells, B. Earl

    1998-01-01

    This paper analyses the effectiveness of a hybrid multiprocessing/multicomputing architecture that is based upon a single-board-computer multiprocessor (SBCM) architecture. Based upon empirical analysis using discrete event simulations and Monte Carlo techniques, this hybrid architecture, called the enhanced single-board-computer multiprocessor (ESBCM), is shown to have improved performance and scalability characteristics over current SBCM designs.

  13. Allocation for the SANDAC multiprocessor system

    SciTech Connect

    Ravl, T.M.; Ercegovac, M.D.

    1986-01-01

    This report describes an algorithm for the static allocation of tasks in a general Dataflow Multiprocessor and the SANDAC IV system in particular. Initially a model of execution and the underlying assumptions about the architecture are outlined. The authors then discuss a Graph Reduction algorithm for preprocessing the computation graph. The Graph Reduction algorithm reduces a fine grain graph to an optimal grain graph. The heuristic allocation algorithm is presented and is based on giving precedence to critical paths and minimizing the communication time between tasks. The performance of the algorithm is analyzed and the effect of varying parameters is studied. Subsequently an alternative variation with better characteristics is proposed.

  14. Partitioning of regular computation on multiprocessor systems

    NASA Technical Reports Server (NTRS)

    Lee, Fung Fung

    1988-01-01

    Problem partitioning of regular computation over two dimensional meshes on multiprocessor systems is examined. The regular computation model considered involves repetitive evaluation of values at each mesh point with local communication. The computational workload and the communication pattern are the same at each mesh point. The regular computation model arises in numerical solutions of partial differential equations and simulations of cellular automata. Given a communication pattern, a systematic way to generate a family of partitions is presented. The influence of various partitioning schemes on performance is compared on the basis of computation to communication ratio.

  15. A class Hierarchical, object-oriented approach to virtual memory management

    NASA Technical Reports Server (NTRS)

    Russo, Vincent F.; Campbell, Roy H.; Johnston, Gary M.

    1989-01-01

    The Choices family of operating systems exploits class hierarchies and object-oriented programming to facilitate the construction of customized operating systems for shared memory and networked multiprocessors. The software is being used in the Tapestry laboratory to study the performance of algorithms, mechanisms, and policies for parallel systems. Described here are the architectural design and class hierarchy of the Choices virtual memory management system. The software and hardware mechanisms and policies of a virtual memory system implement a memory hierarchy that exploits the trade-off between response times and storage capacities. In Choices, the notion of a memory hierarchy is captured by abstract classes. Concrete subclasses of those abstractions implement a virtual address space, segmentation, paging, physical memory management, secondary storage, and remote (that is, networked) storage. Captured in the notion of a memory hierarchy are classes that represent memory objects. These classes provide a storage mechanism that contains encapsulated data and have methods to read or write the memory object. Each of these classes provides specializations to represent the memory hierarchy.

  16. A multiprocessor airborne lidar data system

    NASA Astrophysics Data System (ADS)

    Wright, C. W.; Bailey, S. A.; Heath, G. E.; Piazza, C. R.

    A new multiprocessor data acquisition system was developed for the existing Airborne Oceanographic Lidar (AOL). This implementation simultaneously utilizes five single board 68010 microcomputers, the UNIX system V operating system, and the real time executive VRTX. The original data acquisition system was implemented on a Hewlett Packard HP 21-MX 16 bit minicomputer using a multi-tasking real time operating system and a mixture of assembly and FORTRAN languages. The present collection of data sources produce data at widely varied rates and require varied amounts of burdensome real time processing and formatting. It was decided to replace the aging HP 21-MX minicomputer with a multiprocessor system. A new and flexible recording format was devised and implemented to accommodate the constantly changing sensor configuration. A central feature of this data system is the minimization of non-remote sensing bus traffic. Therefore, it is highly desirable that each micro be capable of functioning as much as possible on-card or via private peripherals. The bus is used primarily for the transfer of remote sensing data to or from the buffer queue.

  17. A multiprocessor airborne lidar data system

    NASA Technical Reports Server (NTRS)

    Wright, C. W.; Bailey, S. A.; Heath, G. E.; Piazza, C. R.

    1988-01-01

    A new multiprocessor data acquisition system was developed for the existing Airborne Oceanographic Lidar (AOL). This implementation simultaneously utilizes five single board 68010 microcomputers, the UNIX system V operating system, and the real time executive VRTX. The original data acquisition system was implemented on a Hewlett Packard HP 21-MX 16 bit minicomputer using a multi-tasking real time operating system and a mixture of assembly and FORTRAN languages. The present collection of data sources produce data at widely varied rates and require varied amounts of burdensome real time processing and formatting. It was decided to replace the aging HP 21-MX minicomputer with a multiprocessor system. A new and flexible recording format was devised and implemented to accommodate the constantly changing sensor configuration. A central feature of this data system is the minimization of non-remote sensing bus traffic. Therefore, it is highly desirable that each micro be capable of functioning as much as possible on-card or via private peripherals. The bus is used primarily for the transfer of remote sensing data to or from the buffer queue.

  18. Selection in spatial working memory is independent of perceptual selective attention, but they interact in a shared spatial priority map.

    PubMed

    Hedge, Craig; Oberauer, Klaus; Leonards, Ute

    2015-11-01

    We examined the relationship between the attentional selection of perceptual information and of information in working memory (WM) through four experiments, using a spatial WM-updating task. Participants remembered the locations of two objects in a matrix and worked through a sequence of updating operations, each mentally shifting one dot to a new location according to an arrow cue. Repeatedly updating the same object in two successive steps is typically faster than switching to the other object; this object switch cost reflects the shifting of attention in WM. In Experiment 1, the arrows were presented in random peripheral locations, drawing perceptual attention away from the selected object in WM. This manipulation did not eliminate the object switch cost, indicating that the mechanisms of perceptual selection do not underlie selection in WM. Experiments 2a and 2b corroborated the independence of selection observed in Experiment 1, but showed a benefit to reaction times when the placement of the arrow cue was aligned with the locations of relevant objects in WM. Experiment 2c showed that the same benefit also occurs when participants are not able to mark an updating location through eye fixations. Together, these data can be accounted for by a framework in which perceptual selection and selection in WM are separate mechanisms that interact through a shared spatial priority map. PMID:26341873

  19. Mapping of H.264 decoding on a multiprocessor architecture

    NASA Astrophysics Data System (ADS)

    van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.

    2003-05-01

    Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the

  20. Insertion of coherence requests for debugging a multiprocessor

    DOEpatents

    Blumrich, Matthias A.; Salapura, Valentina

    2010-02-23

    A method and system are disclosed to insert coherence events in a multiprocessor computer system, and to present those coherence events to the processors of the multiprocessor computer system for analysis and debugging purposes. The coherence events are inserted in the computer system by adding one or more special insert registers. By writing into the insert registers, coherence events are inserted in the multiprocessor system as if they were generated by the normal coherence protocol. Once these coherence events are processed, the processing of coherence events can continue in the normal operation mode.

  1. A New coscheduling technique for a cluster of symmetric multiprocessors

    SciTech Connect

    Yoo, A B; Jette, M A

    2000-04-17

    Coscheduling is essential for obtaining good performance in a time-shared symmetric multiprocessor (SMP) cluster environment. However, the most common technique, gang scheduling, has limitations such as poor scalability and vulnerability to faults mainly due to explicit synchronization between its components. A decentralized approach called dynamic coscheduling (DCS) has been shown to be effective for network of workstations (NOW), but this technique is not suitable for the workloads on a very large SMP-cluster with thousands of processors. Furthermore, its implementation can be prohibitively expensive for such a large-scale machine. In this paper, we propose a novel coscheduling technique based on the DCS approach which can achieve coscheduling on very large SMP-clusters in a scalable, efficient, and cost-effective way. In the proposed technique, each local scheduler achieves coscheduling based upon message traffic between the components of parallel jobs. Message trapping is carried out at the user-level, eliminating the need for unsupported hardware or device-level programming. A sending process attaches its status to outgoing messages so local schedulers on remote nodes can make more intelligent scheduling decisions. Once scheduled, processes are guaranteed some minimum period of time to execute. This provides an opportunity to synchronize the parallel job's components across all nodes and achieve good program performance. The results from a performance study reveal that the proposed technique is a promising approach that can reduce response time significantly over uncoordinated scheduling.

  2. Clocking and synchronization circuits in multiprocessor systems

    SciTech Connect

    Jeong, D.K.

    1989-01-01

    Microprocessors based on RISC (Reduced Instruction Set Computer) concepts have demonstrated an ability to provide more computing power at a given level of integration than conventional microprocessors. The next step is multiprocessors composed of RISC processing elements. Communication bandwidth among such microprocessors is critical in achieving efficient hardware utilization. This thesis focuses on the communication capability of VLSI circuits and presents new circuit techniques as a guide to build an interconnection network of VLSI microprocessors. Circuit techniques for PLL-based clock generation are described along with stability criteria. The main objective of the circuit is to realize a zero delay buffer. Experimental results show the feasibility of such circuits in VLSI. Synchronizer circuit configurations in both bipolar and MOS technology that best utilize each device, or overcome the technology limit using a bandwidth doubling technique are shown. Interface techniques including handshake mechanisms in such a system are also described.

  3. High throughput network for multiprocessor interconnections

    NASA Astrophysics Data System (ADS)

    Raatikainen, Pertti; Zidbeck, Juha

    1993-05-01

    Multiprocessor architectures are needed to support modern broadband applications, since traditional bus structures are not capable of providing high throughput. New bus structures are needed, especially in the area of network components and terminals. A study to find an efficient and cost effective interconnection topology for the future high speed products is presented. The most common bus topologies are introduced, and their characteristics are estimated to decide which one of them offers best performance and lowest implementation cost. The ring topology is chosen to be studied in more detail. Four competing bus access schemes for the high throughput ring are introduced as well as simulation models for each of them. Using transfer delay and throughput results, as well as keeping the implementation point of view in mind, the best candidate is selected to be studied and experimented in the succeeding research project.

  4. Fault diagnosis in sparse multiprocessor systems

    NASA Technical Reports Server (NTRS)

    Blough, Douglas M.; Sullivan, Gregory F.; Masson, Gerald M.

    1988-01-01

    The problem of fault diagnosis in multiprocessor systems is considered under a uniformly probabilistic model in which processors are faulty with probability p. This work focuses on minimizing the number of tests that must be conducted in order to correctly diagnose the state of every processor in the system with high probability. A diagnosis algorithm that can correctly diagnose the state of every processor with probability approaching one in a class of systems performing slightly greater than a linear number of tests is presented. A nearly matching lower bound on the number of tests required to achieve correct diagnosis in arbitrary systems is also proven. The number of tests required under this probabilistic model is shown to be significantly less than under a bounded-size fault set model. Because the number of tests that must be conducted is a measure of the diagnosis overhead, these results represent a dramatic improvement in the performance of system-level diagnosis technique.

  5. Hardware for a real-time multiprocessor simulator

    NASA Technical Reports Server (NTRS)

    Blech, R. A.; Arpasi, D. J.

    1984-01-01

    The hardware for a real time multiprocessor simulator (RTMPS) developed at the NASA Lewis Research Center is described. The RTMPS is a multiple microprocessor system used to investigate the application of parallel processing concepts to real time simulation. It is designed to provide flexible data exchange paths between processors by using off the shelf microcomputer boards and minimal customized interfacing. A dedicated operator interface allows easy setup of the simulator and quick interpreting of simulation data. Simulations for the RTMPS are coded in a NASA designed real time multiprocessor language (RTMPL). This language is high level and geared to the multiprocessor environment. A real time multiprocessor operating system (RTMPOS) has also been developed that provides a user friendly operator interface. The RTMPS and supporting software are currently operational and are being evaluated at Lewis. The results of this evaluation will be used to specify the design of an optimized parallel processing system for real time simulation of dynamic systems.

  6. VME rollback hardware for time warp multiprocessor systems

    NASA Technical Reports Server (NTRS)

    Robb, Michael J.; Buzzell, Calvin A.

    1992-01-01

    The purpose of the research effort is to develop and demonstrate innovative hardware to implement specific rollback and timing functions required for efficient queue management and precision timekeeping in multiprocessor discrete event simulations. The previously completed phase 1 effort demonstrated the technical feasibility of building hardware modules which eliminate the state saving overhead of the Time Warp paradigm used in distributed simulations on multiprocessor systems. The current phase 2 effort will build multiple pre-production rollback hardware modules integrated with a network of Sun workstations, and the integrated system will be tested by executing a Time Warp simulation. The rollback hardware will be designed to interface with the greatest number of multiprocessor systems possible. The authors believe that the rollback hardware will provide for significant speedup of large scale discrete event simulation problems and allow multiprocessors using Time Warp to dramatically increase performance.

  7. File-System Workload on a Scientific Multiprocessor

    NASA Technical Reports Server (NTRS)

    Kotz, David; Nieuwejaar, Nils

    1995-01-01

    Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.

  8. Evaluation of the Cedar memory system: Configuration of 16 by 16

    NASA Technical Reports Server (NTRS)

    Gallivan, K.; Jalby, W.; Wijshoff, H.

    1991-01-01

    Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented.

  9. FTMP (Fault Tolerant Multiprocessor) programmer's manual

    NASA Technical Reports Server (NTRS)

    Feather, F. E.; Liceaga, C. A.; Padilla, P. A.

    1986-01-01

    The Fault Tolerant Multiprocessor (FTMP) computer system was constructed using the Rockwell/Collins CAPS-6 processor. It is installed in the Avionics Integration Research Laboratory (AIRLAB) of NASA Langley Research Center. It is hosted by AIRLAB's System 10, a VAX 11/750, for the loading of programs and experimentation. The FTMP support software includes a cross compiler for a high level language called Automated Engineering Design (AED) System, an assembler for the CAPS-6 processor assembly language, and a linker. Access to this support software is through an automated remote access facility on the VAX which relieves the user of the burden of learning how to use the IBM 4381. This manual is a compilation of information about the FTMP support environment. It explains the FTMP software and support environment along many of the finer points of running programs on FTMP. This will be helpful to the researcher trying to run an experiment on FTMP and even to the person probing FTMP with fault injections. Much of the information in this manual can be found in other sources; we are only attempting to bring together the basic points in a single source. If the reader should need points clarified, there is a list of support documentation in the back of this manual.

  10. Prefetching in file systems for MIMD multiprocessors

    NASA Technical Reports Server (NTRS)

    Kotz, David F.; Ellis, Carla Schlatter

    1990-01-01

    The question of whether prefetching blocks on the file into the block cache can effectively reduce overall execution time of a parallel computation, even under favorable assumptions, is considered. Experiments have been conducted with an interleaved file system testbed on the Butterfly Plus multiprocessor. Results of these experiments suggest that (1) the hit ratio, the accepted measure in traditional caching studies, may not be an adequate measure of performance when the workload consists of parallel computations and parallel file access patterns, (2) caching with prefetching can significantly improve the hit ratio and the average time to perform an I/O (input/output) operation, and (3) an improvement in overall execution time has been observed in most cases. In spite of these gains, prefetching sometimes results in increased execution times (a negative result, given the optimistic nature of the study). The authors explore why it is not trivial to translate savings on individual I/O requests into consistently better overall performance and identify the key problems that need to be addressed in order to improve the potential of prefetching techniques in the environment.

  11. Interrupt handling in a multiprocessor computing system

    SciTech Connect

    D'Amico, L.W.; Guyer, J.M.

    1989-01-03

    A multiprocessor computing system is described comprising: a system bus, including an address bus for carrying an address phase of an instruction and a data bus for carrying a data phase of an instruction; a plurality of processing units connected to the system bus, each processing unit including means for generating broadcast interrupt origin request instructions on the system bus; asynchronous input/output channel controllers connected to the system bus, each of the input/output channel controllers including means for generating a synchronizing signal in response to completion of an address phase of a broadcast instruction on the system bus, and corresponding to a different one of the processing units connected through each of the input/output channel controllers, the input/output channel controllers being arranged on the priority lines in order of priority, the priority lines being gated in an input/output channel controller so that priority is asserted over all lower priority input/output channel controllers on a priority line by an input/output channel controller if the input/output channel controller has an interrupt pending in the input/output channel controller for the processing unit corresponding to the priority line.

  12. Particle simulation in a multiprocessor environment

    NASA Technical Reports Server (NTRS)

    Mcdonald, Jeffrey D.

    1991-01-01

    A parallel implementation of a particle simulation method that is portable between a wide class of multiprocessor computers is presented. A fine grain spatial decomposition is utilized where several subdomains having a regular structure are computed at each processing node. This leads directly to an efficient and straightforward load balancing scheme if the number of subdomains at each processor is permitted to vary in an appropriate manner. Three dimensional simulations incorporating full thermochemical nonequilibrium are possible using the resulting code. Vectorizable algorithms are retained from earlier work allowing efficient use of deeply pipelined node processors where available. Performance results are presented from three different machine architectures demonstrating the portability of the code. On a 128-node Intel iPSC/860, performance is twice that of a single Cray-Y/MP CPU running a highly vectorized simulation code. Speedup is linear over the full range of number of processors on all target machines, indicating scalability of the method to higher degrees of parallelism.

  13. Memories.

    ERIC Educational Resources Information Center

    Brand, Judith, Ed.

    1998-01-01

    This theme issue of the journal "Exploring" covers the topic of "memories" and describes an exhibition at San Francisco's Exploratorium that ran from May 22, 1998 through January 1999 and that contained over 40 hands-on exhibits, demonstrations, artworks, images, sounds, smells, and tastes that demonstrated and depicted the biological,…

  14. A fault-tolerant multiprocessor architecture for aircraft, volume 1. [autopilot configuration

    NASA Technical Reports Server (NTRS)

    Smith, T. B.; Hopkins, A. L.; Taylor, W.; Ausrotas, R. A.; Lala, J. H.; Hanley, L. D.; Martin, J. H.

    1978-01-01

    A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed.

  15. Efficient static scheduling of loops on synchronous multiprocessors

    SciTech Connect

    Zaky, A.M.

    1989-01-01

    This dissertation investigates efficient compile-time scheduling techniques for exploiting parallelism on synchronous multiprocessors. Synchronous multiprocessors, e.g. Very Long Instruction Word (VLIW) machines, are very effective in utilizing unstructured fine-grained parallelism in programs. The effectiveness of such machines is crucially dependent on the static compile-time analysis and detection of potential parallelism. The first part of the dissertation focuses on scheduling sequential loops on multiprocessors with multiple identical processor units. The problem of determining the maximal initiation rate for the execution of a sequential loop with uniform dependence distances on a synchronous multiprocessor is addressed and cast as an eigenvalue problem in a path algebra. A low-order polynomial algorithm for the determination of the optimal loop initiation rate is developed, and a schedule that exploits fine-grained parallelism and achieves the optimal initiation rate is developed under an idealized unbounded processor model. Next, the concepts developed above are extended to deal with perfectly-nested loops with uniform dependences. A strategy is developed to identify both the loop level and fine-grained expression level parallelism in nested loops, and to efficiently schedule such loops on synchronous multiprocessors. Loop scheduling techniques such as Do-Across, Wavefront scheduling, and fine-grained scheduling techniques such as loop unfolding are shown to be derivable within the presented framework.

  16. Clocking and synchronization circuits in multiprocessor systems

    SciTech Connect

    Jeong, Deog-Kyoon.

    1989-01-01

    Microprocessors based on RISC (Reduced Instruction Set Computer) concepts have demonstrated an ability to provide more computing power at a given level of integration than conventional microprocessors. The next step is multiprocessors composed of RISC processing elements. Communication bandwidth among such microprocessors is critical in achieving efficient hardware utilization. This thesis focuses on the communication capability of VLSI circuits and presents new circuit techniques as a guide to build an interconnection network of VLSI microprocessors. Two of the most prominent problems in a synchronous system, which most of the current computer systems are based on, have been clock skew and synchronization failure. A new concept called self-timed systems solves such problems but has not been accepted in microprocessor implementations yet because of its complex design procedure and increased overhead. With this in mind, this thesis concentrates on a system in which individual synchronous subsystems are connected asynchronously. Synchronous subsystems operate with a better control over clock skew using a phase locked loop (PLL) technique. Communication among subsystems is done asynchronously with a controlled synchronization failure rate. One advantage is that conventional VLSI design methodologies which are more efficient can still be applied. Circuit techniques for PLL-based clock generation are described along with stability criteria. The main objective of the circuit is to realize a zero delay buffer. Experimental results show the feasibility of such circuits in VLSI. Synchronizer circuit configurations in both bipolar and MOS technology that best utilize each device, or overcome the technology limit using a bandwidth doubling technique are shown. Interface techniques including handshake mechanisms in such a system are also described.

  17. Real-Time Multiprocessor Programming Language (RTMPL) user's manual

    NASA Technical Reports Server (NTRS)

    Arpasi, D. J.

    1985-01-01

    A real-time multiprocessor programming language (RTMPL) has been developed to provide for high-order programming of real-time simulations on systems of distributed computers. RTMPL is a structured, engineering-oriented language. The RTMPL utility supports a variety of multiprocessor configurations and types by generating assembly language programs according to user-specified targeting information. Many programming functions are assumed by the utility (e.g., data transfer and scaling) to reduce the programming chore. This manual describes RTMPL from a user's viewpoint. Source generation, applications, utility operation, and utility output are detailed. An example simulation is generated to illustrate many RTMPL features.

  18. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 2: FTMP software

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.

  19. Job-mix modeling and system analysis of an aerospace multiprocessor.

    NASA Technical Reports Server (NTRS)

    Mallach, E. G.

    1972-01-01

    An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.

  20. Memory System Technologies for Future High-End Computing Systems

    SciTech Connect

    McKee, S A; de Supinski, B R; Mueller, F; Tyson, G S

    2003-05-16

    Our ability to solve Grand Challenge Problems in computing hinges on the development of reliable and efficient High-End Computing systems. Unfortunately, the increasing gap between memory and processor speeds remains one of the major bottlenecks in modern architectures. Uniprocessor nodes still suffer, but symmetric multiprocessor nodes--where access to physical memory is shared among all processors--are among the hardest hit. In the latter case, the memory system must juggle multiple working sets and maintain memory coherence, on top of simply responding to access requests. To illustrate the severity of the current situation, consider two important examples: even the high-performance parallel supercomputers in use at Department of Energy National labs observe single-processor utilization rates as low as 5%, and transaction processing commercial workloads see utilizations of at most about 33%. A wealth of research demonstrates that traditional memory systems are incapable of bridging the processor/memory performance gap, and the problem continues to grow. The success of future High-End Computing platforms therefore depends on our developing hardware and software technologies to dramatically relieve the memory bottleneck. In order to take better advantage of the tremendous computing power of modern microprocessors and future High-End systems, we consider it crucial to develop the hardware for intelligent, adaptable memory systems; the middleware and OS modifications to manage them; and the compiler technology and performance tools to exploit them. Taken together, these will provide the foundations for meeting the requirements of future generations of performance-critical, parallel systems based on either uniprocessor or SMP nodes (including PIM organizations). We feel that such solutions should not be vendor-specific, but should be sufficiently general and adaptable such that the technologies could be leveraged by any commercial vendor of High-End Computing systems

  1. Fault tree models for fault tolerant hypercube multiprocessors

    NASA Technical Reports Server (NTRS)

    Boyd, Mark A.; Tuazon, Jezus O.

    1991-01-01

    Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.

  2. Shared Attention.

    PubMed

    Shteynberg, Garriy

    2015-09-01

    Shared attention is extremely common. In stadiums, public squares, and private living rooms, people attend to the world with others. Humans do so across all sensory modalities-sharing the sights, sounds, tastes, smells, and textures of everyday life with one another. The potential for attending with others has grown considerably with the emergence of mass media technologies, which allow for the sharing of attention in the absence of physical co-presence. In the last several years, studies have begun to outline the conditions under which attending together is consequential for human memory, motivation, judgment, emotion, and behavior. Here, I advance a psychological theory of shared attention, defining its properties as a mental state and outlining its cognitive, affective, and behavioral consequences. I review empirical findings that are uniquely predicted by shared-attention theory and discuss the possibility of integrating shared-attention, social-facilitation, and social-loafing perspectives. Finally, I reflect on what shared-attention theory implies for living in the digital world. PMID:26385997

  3. Characterizing parallel file-access patterns on a large-scale multiprocessor

    NASA Technical Reports Server (NTRS)

    Purakayastha, Apratim; Ellis, Carla Schlatter; Kotz, David; Nieuwejaar, Nils; Best, Michael

    1994-01-01

    Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors. Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines. The results of our trace analysis lead to recommendations for parallel file system design. First the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load conditions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.

  4. Fast decision tree-based method to index large DNA-protein sequence databases using hybrid distributed-shared memory programming model.

    PubMed

    Jaber, Khalid Mohammad; Abdullah, Rosni; Rashid, Nur'Aini Abdul

    2014-01-01

    In recent times, the size of biological databases has increased significantly, with the continuous growth in the number of users and rate of queries; such that some databases have reached the terabyte size. There is therefore, the increasing need to access databases at the fastest rates possible. In this paper, the decision tree indexing model (PDTIM) was parallelised, using a hybrid of distributed and shared memory on resident database; with horizontal and vertical growth through Message Passing Interface (MPI) and POSIX Thread (PThread), to accelerate the index building time. The PDTIM was implemented using 1, 2, 4 and 5 processors on 1, 2, 3 and 4 threads respectively. The results show that the hybrid technique improved the speedup, compared to a sequential version. It could be concluded from results that the proposed PDTIM is appropriate for large data sets, in terms of index building time. PMID:24794073

  5. Combined shared and distributed memory ab-initio computations of molecular-hydrogen systems in the correlated state: Process pool solution and two-level parallelism

    NASA Astrophysics Data System (ADS)

    Biborski, Andrzej; Kądzielawa, Andrzej P.; Spałek, Józef

    2015-12-01

    An efficient computational scheme devised for investigations of ground state properties of the electronically correlated systems is presented. As an example, (H2)n chain is considered with the long-range electron-electron interactions taken into account. The implemented procedure covers: (i) single-particle Wannier wave-function basis construction in the correlated state, (ii) microscopic parameters calculation, and (iii) ground state energy optimization. The optimization loop is based on highly effective process-pool solution - specific root-workers approach. The hierarchical, two-level parallelism was applied: both shared (by use of Open Multi-Processing) and distributed (by use of Message Passing Interface) memory models were utilized. We discuss in detail the feature that such approach results in a substantial increase of the calculation speed reaching factor of 300 for the fully parallelized solution. The scheme elaborated in detail reflects the situation in which the most demanding task is the single-particle basis optimization.

  6. A prototype functional language implementation for hierarchical- memory architectures

    SciTech Connect

    Wolski, R.; Feo, J.; Cann, D.

    1992-01-14

    Programming languages are the most important tool at a programmers' disposal. All other tools correct, visualize, or evaluate the product crafted by this tool. The advent of multiprocessor computer systems has greatly complicated the programmer's task an increased his need for high-level languages capable of automatically taming these architectures. In this paper, we describe a prototype implementation of Sisal for multiprocessor, hierarchical-memory systems. The implementation includes explicit compiler and runtime control that effectively exploits the different levels of memory and manages interprocess communications (IPC). We give preliminary performance results for this system on the BBN TC2000.

  7. An Or Processing Multiprocessor System For Artificial Intelligence

    NASA Astrophysics Data System (ADS)

    Fu, Hsin-Chia; Chiang, Cheng-Chin

    1989-03-01

    In this paper, an OR-Parallel Execution model based multiprocessor system is proposed. Our OR-parallel execution model addresses the following features: (1) Run-time Intelligent Backtracking, (2) Distributed process control and execution, (3) Minimization of data communication between processors, and (4) Minimization of parallel processing management overhead. Special hardware modules such as Intelligent Backtracking Controller, and Forward Execution Controller are designed to further enhance these features in run-time. A bus connected multiprocessor system is designed to experience the proposed OR-parallel execution model. Recent simulation results indicate that the OR-parallel execution model can be successfully used to conduct the parallel processing of most non-deterministic Prolog applications such as database systems, rule-based expert systems, natural language processing and theorem proving, etc.

  8. FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft

    NASA Technical Reports Server (NTRS)

    Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.

    1978-01-01

    The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.

  9. Analysis of a Multiprocessor Guidance Computer. Ph.D. Thesis

    NASA Technical Reports Server (NTRS)

    Maltach, E. G.

    1969-01-01

    The design of the next generation of spaceborne digital computers is described. It analyzes a possible multiprocessor computer configuration. For the analysis, a set of representative space computing tasks was abstracted from the Lunar Module Guidance Computer programs as executed during the lunar landing, from the Apollo program. This computer performs at this time about 24 concurrent functions, with iteration rates from 10 times per second to once every two seconds. These jobs were tabulated in a machine-independent form, and statistics of the overall job set were obtained. It was concluded, based on a comparison of simulation and Markov results, that the Markov process analysis is accurate in predicting overall trends and in configuration comparisons, but does not provide useful detailed information in specific situations. Using both types of analysis, it was determined that the job scheduling function is a critical one for efficiency of the multiprocessor. It is recommended that research into the area of automatic job scheduling be performed.

  10. Queueing analysis of a canonical model of real-time multiprocessors

    NASA Technical Reports Server (NTRS)

    Krishna, C. M.; Shin, K. G.

    1983-01-01

    A logical classification of multiprocessor structures from the point of view of control applications is presented. A computation of the response time distribution for a canonical model of a real time multiprocessor is presented. The multiprocessor is approximated by a blocking model. Two separate models are derived: one created from the system's point of view, and the other from the point of view of an incoming task.

  11. VME multiprocessor system for data acquisition at OSIRIS

    SciTech Connect

    Ziem, P.; Kiehne, T.; Beschorner, C.; Drescher, B.; Zahn, J.

    1987-08-01

    A VME multiprocessor system for data acquisition and data display utilizing several MC68XXX based CPUs and VMEbuses is described. The design of the VME system was stimulated by the data handling requirements of experiments using the anti-Compton spectrometer OSIRIS, i.e. data storage on optical disks and on-line accumulation of large 2-dimensional histograms (4096 x 4096 channels). Due to the general approach the VME system is easily applicable for other nuclear physics experiments.

  12. Modeling and measurement of fault-tolerant multiprocessors

    NASA Technical Reports Server (NTRS)

    Shin, K. G.; Woodbury, M. H.; Lee, Y. H.

    1985-01-01

    The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.

  13. Fault-free performance validation of fault-tolerant multiprocessors

    NASA Technical Reports Server (NTRS)

    Czeck, Edward W.; Feather, Frank E.; Grizzaffi, Ann Marie; Segall, Zary Z.; Siewiorek, Daniel P.

    1987-01-01

    A validation methodology for testing the performance of fault-tolerant computer systems was developed and applied to the Fault-Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB facility. This methodology was claimed to be general enough to apply to any ultrareliable computer system. The goal of this research was to extend the validation methodology and to demonstrate the robustness of the validation methodology by its more extensive application to NASA's Fault-Tolerant Multiprocessor System (FTMP) and to the Software Implemented Fault-Tolerance (SIFT) Computer System. Furthermore, the performance of these two multiprocessors was compared by conducting similar experiments. An analysis of the results shows high level language instruction execution times for both SIFT and FTMP were consistent and predictable, with SIFT having greater throughput. At the operating system level, FTMP consumes 60% of the throughput for its real-time dispatcher and 5% on fault-handling tasks. In contrast, SIFT consumes 16% of its throughput for the dispatcher, but consumes 66% in fault-handling software overhead.

  14. Modelling parallel programs and multiprocessor architectures with AXE

    NASA Technical Reports Server (NTRS)

    Yan, Jerry C.; Fineman, Charles E.

    1991-01-01

    AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.

  15. Input/output system for multiprocessors

    SciTech Connect

    Bernick, D.L.; Chan, K.K.; Chan, W.M.; Dan, Y.F.; Hoang, D.M.; Hussain, Z.; Iswandhi, G.I.; Korpi, J.E.; Sanner, M.W.; Zwangerman, J.A.

    1989-04-11

    A device controller is described, comprising: a first port-input/output controller coupled to a first input/output channel bus; and a second port-input/output controlled coupled to a second input/output channel bus; each of the first and second port-input/output controllers having: a first ownership latch means for granting shared ownership of the device controller to a first host processor to provide a first data path on a first I/O channel through the first port I/O controller between the first host processor and any peripheral, and at least a second ownership latch means operative independently of the first ownership latch means for granting shared ownership of the device controller to a second host processor independently of the first port input/output controller to provide a second data path on a second I/O channel through the second port I/O controller between the second host processor and any peripheral devices coupled to the device controller.

  16. Development and evaluation of a Fault-Tolerant Multiprocessor (FTMP) computer. Volume 3: FTMP test and evaluation

    NASA Technical Reports Server (NTRS)

    Lala, J. H.; Smith, T. B., III

    1983-01-01

    The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.

  17. Reconfigurable high-speed optoelectronic interconnect technology for multiprocessor computers

    NASA Astrophysics Data System (ADS)

    Cheng, Julian

    1995-06-01

    We describe a compact optoelectronic switching technology for interconnecting multiple computer processors and shared memory modules together through dynamically reconfigurable optical paths to provide simultaneous, high speed communication amongst different nodes. Each switch provides a optical link to other nodes as well as electrical access to an individual processor, and it can perform optical and optoelectronic switching to covert digital data between various electrical and optical input/output formats. This multifunctional switching technology is based on the monolithic integration of arrays of vertical-cavity surface-emitting lasers with photodetectors and heterojunction bipolar transistors. The various digital switching and routing functions, as well as optically cascaded multistage operation, have been experimentally demonstrated.

  18. Cell Multiprocessor Communication Network: Built for Speed

    SciTech Connect

    Kistler, Mike; Perrone, Michael; Petrini, Fabrizio

    2006-06-01

    The existence of major obstacles to the traditional path to processor performance improvement has led chip manufacturers to consider multi-core designs. These architectural solutions promise a variety of power/performance and area/performance benefits. But additional care must be taken to ensure that these benefits are not lost due to inadequate design of the on-chip communication network. This paper presents the design challenges of the on-chip network of the Cell Broadband Engine (Cell BE) processor, and describes in detail its architectural design and the network, communication and synchronization protocols. In the experimental evaluation, performed on an early prototype, we analyze the communication characteristics of the Cell BE processor, using a series of microbenchmarks involving various DMA traffic patterns and synchronization protocols. We find that the on-chip communication subsystem is well matched to the to computational capacity of the processor. A Synergistic Processing Element (SPE) can issue an internal direct memory access (DMA) operation in less than 4 nanoseconds, and a DMA of a single cache line can be executed in less the than 100 nanoseconds. SPEs can achieve the optimal bandwidth of 25.6 GB/second in point to point communication with surprisingly small messages ?only a few KB, using batches of non-blocking DMAs. The aggregate network behavior under heavy load is also remarkably efficient, reaching almost 200 GB/second with collective patterns and optimal contention resolution under hot-spot traffic. Additionally, we demonstrate the consistency of these hardware results with identical experiments carried out using the Mambo simulator software for Cell BE.

  19. Cache directory look-up re-use as conflict check mechanism for speculative memory requests

    DOEpatents

    Ohmacht, Martin

    2013-09-10

    In a cache memory, energy and other efficiencies can be realized by saving a result of a cache directory lookup for sequential accesses to a same memory address. Where the cache is a point of coherence for speculative execution in a multiprocessor system, with directory lookups serving as the point of conflict detection, such saving becomes particularly advantageous.

  20. Fault-free validation of a fault-tolerant multiprocessor: Baseline experiments and workoad implementation

    NASA Technical Reports Server (NTRS)

    Feather, F.; Siewiorek, D.; Segall, Z.

    1986-01-01

    In the future, aircraft employing active control technology must use highly reliable multiprocessors in order to achieve flight safety. Such computers must be experimentally validated before they are deployed. This project outlines a methodology for doing fault-free validation of reliable multiprocessors. The methodology begins with baseline experiments, which test single phenomenon. As experiments progress, tools for performance testing are developed. This report presents the results of interrupt baseline experiments performed on the Fault-Tolerant Multiprocessor (FTMP) at NASA-Langley's AIRLAB. Interrupt-causing excepting conditions were tested, and several were found to have unimplemented interrupt handling software while one had an unimplemented interrupt vector. A synthetic workload model for realtime multiprocessors is then developed as an application level performance analysis tool. Details of the workload implementation and calibration are presented. Both the experimental methodology and the synthetic workload model are general enough to be applicable to reliable multi-processors besides FTMP.

  1. A reconfigurable optoelectronic interconnect technology for multi-processor networks

    SciTech Connect

    Lu, Y.C.; Cheng, J.; Zolper, J.C.; Klem, J.

    1995-05-01

    This paper describes a new optical interconnect architecture and the integrated optoelectronic circuit technology for implementing a parallel, reconfigurable, multiprocessor network. The technology consists of monolithic array`s of optoelectronic switches that integrate vertical-cavity surface-emitting lasers with three-terminal heterojunction phototransistors, which effectively combined the functions of an optical transceiver and an optical spatial routing switch. These switches have demonstrated optical switching at 200 Mb/s, and electrical-to-optical data conversion at > 500 Mb/s, with a small-signal electrical-to-optical modulation bandwidth of {approximately} 4 GHz.

  2. The design of a high performance dataflow processor for multiprocessor systems

    SciTech Connect

    Luc, K.Q.

    1989-01-01

    The objective of this work is to design a high performance dynamic dataflow processor for multiprocessor systems. The performance of contemporary dataflow processors is limited due to the presence of a component, called a matching unit. The function of this unit is to match instruction tokens in order to detect the executability of instructions. Since activities within the matching unit are sequential in nature and require multiple memory accesses, the unit has been identified as a major performance bottleneck in a prototype processor. The author proposes a natural way to partition the set of tokens and present a new implementation for the matching unit, called an Instance-Based Matching Unit. The new unit requires tokens to be partitioned into blocks and allows matching of these blocks of tokens to proceed concurrently. With the new matching unit, substantial throughput enhancement for the unit is reported. He then analyzes the throughputs at various stages of a conventional dataflow processor. The results thus obtained direct us to propose an optimum configuration for an effective sub-processor. The maximum throughput of this sub-processor is determined by the throughput of a queue. With the sub-processor as a building block, a high performance dataflow processor is presented which consists of multiple copies of the sub-processor. Characteristics of the processor are studied with the Livermore Fortran Kernels as inputs. The performance of this processor is high, and the performance increases with the number of Sub-Processors.

  3. Multi-processor developments in the United States for future high energy physics experiments and accelerators

    SciTech Connect

    Gaines, I.

    1988-03-01

    The use of multi-processors for analysis and high-level triggering in High Energy Physics experiments, pioneered by the early emulator systems, has reached maturity, in particular with the multiple microprocessor systems in use at Fermilab. It is widely acknowledged that such systems will fulfill the major portion of the computing needs of future large experiments. Recent developments at Fermilab's Advanced Computer Program will make such systems even more powerful, cost-effective, and easier to use than they are at present. The next generation of microprocessors, already available, will provide CPU power of about one VAX 780 equivalent/$300, while supporting most VMS FORTRAN extensions and large (>8MB) amounts of memory. Low cost high density mass storage devices (based on video tape cartridge technology) will allow parallel I/O to remove potential I/O bottlenecks in systems of over 1000 VAX equipment processors. New interconnection schemes and system software will allow more flexible topologies and extremely high data bandwidth, especially for on-line systems. This talk will summarize the work at the Advanced Computer Program and the rest of the US in this field. 3 refs., 4 figs.

  4. Experience with a Genetic Algorithm Implemented on a Multiprocessor Computer

    NASA Technical Reports Server (NTRS)

    Plassman, Gerald E.; Sobieszczanski-Sobieski, Jaroslaw

    2000-01-01

    Numerical experiments were conducted to find out the extent to which a Genetic Algorithm (GA) may benefit from a multiprocessor implementation, considering, on one hand, that analyses of individual designs in a population are independent of each other so that they may be executed concurrently on separate processors, and, on the other hand, that there are some operations in a GA that cannot be so distributed. The algorithm experimented with was based on a gaussian distribution rather than bit exchange in the GA reproductive mechanism, and the test case was a hub frame structure of up to 1080 design variables. The experimentation engaging up to 128 processors confirmed expectations of radical elapsed time reductions comparing to a conventional single processor implementation. It also demonstrated that the time spent in the non-distributable parts of the algorithm and the attendant cross-processor communication may have a very detrimental effect on the efficient utilization of the multiprocessor machine and on the number of processors that can be used effectively in a concurrent manner. Three techniques were devised and tested to mitigate that effect, resulting in efficiency increasing to exceed 99 percent.

  5. Scalable Multiprocessor for High-Speed Computing in Space

    NASA Technical Reports Server (NTRS)

    Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard

    2004-01-01

    A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.

  6. A prototype functional language implementation for hierarchical- memory architectures. Revision 1

    SciTech Connect

    Wolski, R.; Feo, J.; Cann, D.

    1992-01-14

    Programming languages are the most important tool at a programmers` disposal. All other tools correct, visualize, or evaluate the product crafted by this tool. The advent of multiprocessor computer systems has greatly complicated the programmer`s task an increased his need for high-level languages capable of automatically taming these architectures. In this paper, we describe a prototype implementation of Sisal for multiprocessor, hierarchical-memory systems. The implementation includes explicit compiler and runtime control that effectively exploits the different levels of memory and manages interprocess communications (IPC). We give preliminary performance results for this system on the BBN TC2000.

  7. On some parallel algorithms on a ring of processors

    NASA Astrophysics Data System (ADS)

    Sameh, A.

    1985-07-01

    In this paper we describe some linear algebra multiprocessor algorithms which are suitable for a ring of processors. These algorithms are organized in such a way as to be easily modified for general-purpose multiprocessors with shared global memories.

  8. A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

    DOE PAGESBeta

    Kumar, B.; Huang, C. -H.; Sadayappan, P.; Johnson, R. W.

    1995-01-01

    In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required workingmore » storage of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less

  9. A measurement-based performability model for a multiprocessor system

    NASA Technical Reports Server (NTRS)

    Ilsueh, M. C.; Iyer, Ravi K.; Trivedi, K. S.

    1987-01-01

    A measurement-based performability model based on real error-data collected on a multiprocessor system is described. Model development from the raw errror-data to the estimation of cumulative reward is described. Both normal and failure behavior of the system are characterized. The measured data show that the holding times in key operational and failure states are not simple exponential and that semi-Markov process is necessary to model the system behavior. A reward function, based on the service rate and the error rate in each state, is then defined in order to estimate the performability of the system and to depict the cost of different failure types and recovery procedures.

  10. Fault-free performance validation of avionic multiprocessors

    NASA Technical Reports Server (NTRS)

    Czeck, Edward W.; Feather, Frank E.; Grizzaffi, Ann Marie; Segall, Zary Z.; Finelli, George B.

    1986-01-01

    This paper describes the application of a portion of a validation methodology to NASA's Fault-Tolerant Multiprocessor System (FTMP) and the Software Implemented Fault-Tolerance (SIFT) computer system. The methodology entails a building block approach, starting with simple baseline experiments and building to more complex experiments. The goal of the validation methodology is to thoroughly test and characterize the performance and behavior of ultrareliable computer systems. The validation methodology presented in this paper showed that the methodology is not machine specific and can be used in lieu of life testing approaches. By applying a building block approach at the systems level, the machine complexity was broken down to manageable levels independent of system implementation.