Shared versus distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Jordan, Harry F.
1991-01-01
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
A simple modern correctness condition for a space-based high-performance multiprocessor
NASA Technical Reports Server (NTRS)
Probst, David K.; Li, Hon F.
1992-01-01
A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
A general model for memory interference in a multiprocessor system with memory hierarchy
NASA Technical Reports Server (NTRS)
Taha, Badie A.; Standley, Hilda M.
1989-01-01
The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
Cache-based error recovery for shared memory multiprocessor systems
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.
1989-01-01
A multiprocessor cache-based checkpointing and recovery scheme for of recovering from transient processor errors in a shared-memory multiprocessor with private caches is presented. New implementation techniques that use checkpoint identifiers and recovery stacks to reduce performance degradation in processor utilization during normal execution are examined. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions that take error latency into account are presented.
Multiprocessor shared-memory information exchange
DOE Office of Scientific and Technical Information (OSTI.GOV)
Santoline, L.L.; Bowers, M.D.; Crew, A.W.
1989-02-01
In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
C-MOS array design techniques: SUMC multiprocessor system study
NASA Technical Reports Server (NTRS)
Clapp, W. A.; Helbig, W. A.; Merriam, A. S.
1972-01-01
The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.
Scheduling for Locality in Shared-Memory Multiprocessors
1993-05-01
Submitted in Partial Fulfillment of the Requirements for the Degree ’)iIC Q(JALfryT INSPECTED 5 DOCTOR OF PHILOSOPHY I Accesion For Supervised by NTIS CRAM... architecture on parallel program performance, explain the implications of this trend on popular parallel programming models, and propose system software to 0...decomoosition and scheduling algorithms. I. SUIUECT TERMS IS. NUMBER OF PAGES shared-memory multiprocessors; architecture trends; loop 110 scheduling
Compiler-directed cache management in multiprocessors
NASA Technical Reports Server (NTRS)
Cheong, Hoichi; Veidenbaum, Alexander V.
1990-01-01
The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace-driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
The performance of disk arrays in shared-memory database machines
NASA Technical Reports Server (NTRS)
Katz, Randy H.; Hong, Wei
1993-01-01
In this paper, we examine how disk arrays and shared memory multiprocessors lead to an effective method for constructing database machines for general-purpose complex query processing. We show that disk arrays can lead to cost-effective storage systems if they are configured from suitably small formfactor disk drives. We introduce the storage system metric data temperature as a way to evaluate how well a disk configuration can sustain its workload, and we show that disk arrays can sustain the same data temperature as a more expensive mirrored-disk configuration. We use the metric to evaluate the performance of disk arrays in XPRS, an operational shared-memory multiprocessor database system being developed at the University of California, Berkeley.
Method for prefetching non-contiguous data structures
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2009-05-05
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
A cache-aided multiprocessor rollback recovery scheme
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent
1989-01-01
This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
A multiarchitecture parallel-processing development environment
NASA Technical Reports Server (NTRS)
Townsend, Scott; Blech, Richard; Cole, Gary
1993-01-01
A description is given of the hardware and software of a multiprocessor test bed - the second generation Hypercluster system. The Hypercluster architecture consists of a standard hypercube distributed-memory topology, with multiprocessor shared-memory nodes. By using standard, off-the-shelf hardware, the system can be upgraded to use rapidly improving computer technology. The Hypercluster's multiarchitecture nature makes it suitable for researching parallel algorithms in computational field simulation applications (e.g., computational fluid dynamics). The dedicated test-bed environment of the Hypercluster and its custom-built software allows experiments with various parallel-processing concepts such as message passing algorithms, debugging tools, and computational 'steering'. Such research would be difficult, if not impossible, to achieve on shared, commercial systems.
The force on the flex: Global parallelism and portability
NASA Technical Reports Server (NTRS)
Jordan, H. F.
1986-01-01
A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
Experimental evaluation of multiprocessor cache-based error recovery
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. K.
1991-01-01
Several variations of cache-based checkpointing for rollback error recovery in shared-memory multiprocessors have been recently developed. By modifying the cache replacement policy, these techniques use the inherent redundancy in the memory hierarchy to periodically checkpoint the computation state. Three schemes, different in the manner in which they avoid rollback propagation, are evaluated. By simulation with address traces from parallel applications running on an Encore Multimax shared-memory multiprocessor, the performance effect of integrating the recovery schemes in the cache coherence protocol are evaluated. The results indicate that the cache-based schemes can provide checkpointing capability with low performance overhead but uncontrollable high variability in the checkpoint interval.
Address tracing for parallel machines
NASA Technical Reports Server (NTRS)
Stunkel, Craig B.; Janssens, Bob; Fuchs, W. Kent
1991-01-01
Recently implemented parallel system address-tracing methods based on several metrics are surveyed. The issues specific to collection of traces for both shared and distributed memory parallel computers are highlighted. Five general categories of address-trace collection methods are examined: hardware-captured, interrupt-based, simulation-based, altered microcode-based, and instrumented program-based traces. The problems unique to shared memory and distributed memory multiprocessors are examined separately.
NASA Technical Reports Server (NTRS)
Smith, T. B., Jr.; Lala, J. H.
1983-01-01
The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports
1991-06-01
Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Parallelization of NAS Benchmarks for Shared Memory Multiprocessors
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry C.; Saini, Subhash (Technical Monitor)
1998-01-01
This paper presents our experiences of parallelizing the sequential implementation of NAS benchmarks using compiler directives on SGI Origin2000 distributed shared memory (DSM) system. Porting existing applications to new high performance parallel and distributed computing platforms is a challenging task. Ideally, a user develops a sequential version of the application, leaving the task of porting to new generations of high performance computing systems to parallelization tools and compilers. Due to the simplicity of programming shared-memory multiprocessors, compiler developers have provided various facilities to allow the users to exploit parallelism. Native compilers on SGI Origin2000 support multiprocessing directives to allow users to exploit loop-level parallelism in their programs. Additionally, supporting tools can accomplish this process automatically and present the results of parallelization to the users. We experimented with these compiler directives and supporting tools by parallelizing sequential implementation of NAS benchmarks. Results reported in this paper indicate that with minimal effort, the performance gain is comparable with the hand-parallelized, carefully optimized, message-passing implementations of the same benchmarks.
MPF: A portable message passing facility for shared memory multiprocessors
NASA Technical Reports Server (NTRS)
Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.
1987-01-01
The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
The FORCE - A highly portable parallel programming language
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger
1989-01-01
This paper explains why the FORCE parallel programming language is easily portable among six different shared-memory multiprocessors, and how a two-level macro preprocessor makes it possible to hide low-level machine dependencies and to build machine-independent high-level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared-memory multiprocessor executing them.
Solutions and debugging for data consistency in multiprocessors with noncoherent caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.
1995-02-01
We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches
NASA Astrophysics Data System (ADS)
Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur
2018-03-01
Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.
Shared Memory Parallelization of an Implicit ADI-type CFD Code
NASA Technical Reports Server (NTRS)
Hauser, Th.; Huang, P. G.
1999-01-01
A parallelization study designed for ADI-type algorithms is presented using the OpenMP specification for shared-memory multiprocessor programming. Details of optimizations specifically addressed to cache-based computer architectures are described and performance measurements for the single and multiprocessor implementation are summarized. The paper demonstrates that optimization of memory access on a cache-based computer architecture controls the performance of the computational algorithm. A hybrid MPI/OpenMP approach is proposed for clusters of shared memory machines to further enhance the parallel performance. The method is applied to develop a new LES/DNS code, named LESTool. A preliminary DNS calculation of a fully developed channel flow at a Reynolds number of 180, Re(sub tau) = 180, has shown good agreement with existing data.
Reader set encoding for directory of shared cache memory in multiprocessor system
Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang
2014-06-10
In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Shared Versus Distributed Memory Multiprocessors
1991-01-01
multiprocessors should hawe shared or dis.trimuted meieo-% ha~ trr ~ g ’’~ de~i c4~accio;, S Cm teaicners argue S trongly tor Outiding (li15 tri huted...Applications, MIT Press (1985). 161 D. Gajski et el., "Cedar," Proc. Compcon, pp. 306-309 (Spring 19S9). 171 S. Ahuja, N. Carriero and D. Gelernter, "Linda
A robot arm simulation with a shared memory multiprocessor machine
NASA Technical Reports Server (NTRS)
Kim, Sung-Soo; Chuang, Li-Ping
1989-01-01
A parallel processing scheme for a single chain robot arm is presented for high speed computation on a shared memory multiprocessor. A recursive formulation that is derived from a virtual work form of the d'Alembert equations of motion is utilized for robot arm dynamics. A joint drive system that consists of a motor rotor and gears is included in the arm dynamics model, in order to take into account gyroscopic effects due to the spinning of the rotor. The fine grain parallelism of mechanical and control subsystem models is exploited, based on independent computation associated with bodies, joint drive systems, and controllers. Efficiency and effectiveness of the parallel scheme are demonstrated through simulations of a telerobotic manipulator arm. Two different mechanical subsystem models, i.e., with and without gyroscopic effects, are compared, to show the trade-off between efficiency and accuracy.
Scheduler-Conscious Synchronization.
1994-12-01
SPONSORING I MONITORING Office of Naval Research ARPA AGENCY REPORT NUMBER Information Systems 3701 N. Fairfax Drive TR 550 Arlington VA 22217 Arlington VA...Broughton. A New Approach to Exclusive Data Access in Shared Memory Multiprocessors. Technical Report UCRL -97663, Lawrence Livermore National Laboratory
Parallel processing for scientific computations
NASA Technical Reports Server (NTRS)
Alkhatib, Hasan S.
1995-01-01
The scope of this project dealt with the investigation of the requirements to support distributed computing of scientific computations over a cluster of cooperative workstations. Various experiments on computations for the solution of simultaneous linear equations were performed in the early phase of the project to gain experience in the general nature and requirements of scientific applications. A specification of a distributed integrated computing environment, DICE, based on a distributed shared memory communication paradigm has been developed and evaluated. The distributed shared memory model facilitates porting existing parallel algorithms that have been designed for shared memory multiprocessor systems to the new environment. The potential of this new environment is to provide supercomputing capability through the utilization of the aggregate power of workstations cooperating in a cluster interconnected via a local area network. Workstations, generally, do not have the computing power to tackle complex scientific applications, making them primarily useful for visualization, data reduction, and filtering as far as complex scientific applications are concerned. There is a tremendous amount of computing power that is left unused in a network of workstations. Very often a workstation is simply sitting idle on a desk. A set of tools can be developed to take advantage of this potential computing power to create a platform suitable for large scientific computations. The integration of several workstations into a logical cluster of distributed, cooperative, computing stations presents an alternative to shared memory multiprocessor systems. In this project we designed and evaluated such a system.
Parallelising a molecular dynamics algorithm on a multi-processor workstation
NASA Astrophysics Data System (ADS)
Müller-Plathe, Florian
1990-12-01
The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
1988-02-29
by memory copyin g will degrade system performance on shared-memory multiprocessors. Virtual memor y (VM) remapping, as opposed to memory copying...Bershad, G.D. Giuseppe Facchetti, Kevin Fall, G . Scott Graham, Ellen Nelson , P. Venkat Rangan, Bruno Sartirana, Shin-Yuan Tzou, Raj Vaswani, and Robert...Remote Execution in NEST", IEEE Trans. on Software Eng. 13, 8 (August 1987), 905-912. 3. G . T. Almes, A. P. Black, E. Lazowska and J. Noe, "The Eden
Preliminary basic performance analysis of the Cedar multiprocessor memory system
NASA Technical Reports Server (NTRS)
Gallivan, K.; Jalby, W.; Turner, S.; Veidenbaum, A.; Wijshoff, H.
1991-01-01
Some preliminary basic results on the performance of the Cedar multiprocessor memory system are presented. Empirical results are presented and used to calibrate a memory system simulator which is then used to discuss the scalability of the system.
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G; Salapura, Valentina
2014-12-02
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Conditional load and store in a shared memory
Blumrich, Matthias A; Ohmacht, Martin
2015-02-03
A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
Implementation and performance of parallel Prolog interpreter
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wei, S.; Kale, L.V.; Balkrishna, R.
1988-01-01
In this paper, the authors discuss the implementation of a parallel Prolog interpreter on different parallel machines. The implementation is based on the REDUCE--OR process model which exploits both AND and OR parallelism in logic programs. It is machine independent as it runs on top of the chare-kernel--a machine-independent parallel programming system. The authors also give the performance of the interpreter running a diverse set of benchmark pargrams on parallel machines including shared memory systems: an Alliant FX/8, Sequent and a MultiMax, and a non-shared memory systems: Intel iPSC/32 hypercube, in addition to its performance on a multiprocessor simulation system.
NASA Astrophysics Data System (ADS)
Li, Hao; Xie, Lunguo
2013-03-01
The design of cache system for Chip Multiprocessor (CMP) face many challenges because future CMPs will have more cores and greater on-chip cache capacity. There are two base design schemes about L2 cache: private scheme in which each L2 slice is treated as a private L2 cache and shared scheme in which all L2 slices are treated as a large L2 cache shared by all cores. Private caches provide the lowest hit latency but reduce the total effective cache capacity. A shared L2 cache increases the effective cache capacity but has long hit latencies when data is on a remote tile. This paper present a new Controlled Replication (CR) policy to reduce the capacities occupied by redundant shared replicas. the new CR policy increases the effective capacity than victim replication scheme and has lower hit latency than shared scheme. We evaluate the various schemes using full-system simulation of parallel applications. Results show that CR reduces the average memory access latency of shared scheme by an average of 13%, providing better overall performance than victim replication and shared schemes.
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry
1998-01-01
This paper presents a model to evaluate the performance and overhead of parallelizing sequential code using compiler directives for multiprocessing on distributed shared memory (DSM) systems. With increasing popularity of shared address space architectures, it is essential to understand their performance impact on programs that benefit from shared memory multiprocessing. We present a simple model to characterize the performance of programs that are parallelized using compiler directives for shared memory multiprocessing. We parallelized the sequential implementation of NAS benchmarks using native Fortran77 compiler directives for an Origin2000, which is a DSM system based on a cache-coherent Non Uniform Memory Access (ccNUMA) architecture. We report measurement based performance of these parallelized benchmarks from four perspectives: efficacy of parallelization process; scalability; parallelization overhead; and comparison with hand-parallelized and -optimized version of the same benchmarks. Our results indicate that sequential programs can conveniently be parallelized for DSM systems using compiler directives but realizing performance gains as predicted by the performance model depends primarily on minimizing architecture-specific data locality overhead.
Integrating Software Modules For Robot Control
NASA Technical Reports Server (NTRS)
Volpe, Richard A.; Khosla, Pradeep; Stewart, David B.
1993-01-01
Reconfigurable, sensor-based control system uses state variables in systematic integration of reusable control modules. Designed for open-architecture hardware including many general-purpose microprocessors, each having own local memory plus access to global shared memory. Implemented in software as extension of Chimera II real-time operating system. Provides transparent computing mechanism for intertask communication between control modules and generic process-module architecture for multiprocessor realtime computation. Used to control robot arm. Proves useful in variety of other control and robotic applications.
Estimating Performance of Single Bus, Shared Memory Multiprocessors
1987-05-01
Chandy78] K.M. Chandy, C.M. Sauer, "Approximate methods for analyzing queuing network models of computing systems," Computing Surveys, vol10 , no 3...Denning78] P. Denning, J. Buzen, "The operational analysis of queueing network models", Computing Sur- veys, vol10 , no 3, September 1978, pp 225-261
Efficient ICCG on a shared memory multiprocessor
NASA Technical Reports Server (NTRS)
Hammond, Steven W.; Schreiber, Robert
1989-01-01
Different approaches are discussed for exploiting parallelism in the ICCG (Incomplete Cholesky Conjugate Gradient) method for solving large sparse symmetric positive definite systems of equations on a shared memory parallel computer. Techniques for efficiently solving triangular systems and computing sparse matrix-vector products are explored. Three methods for scheduling the tasks in solving triangular systems are implemented on the Sequent Balance 21000. Sample problems that are representative of a large class of problems solved using iterative methods are used. We show that a static analysis to determine data dependences in the triangular solve can greatly improve its parallel efficiency. We also show that ignoring symmetry and storing the whole matrix can reduce solution time substantially.
Multi-processor including data flow accelerator module
Davidson, George S.; Pierce, Paul E.
1990-01-01
An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Dynamic programming on a shared-memory multiprocessor
NASA Technical Reports Server (NTRS)
Edmonds, Phil; Chu, Eleanor; George, Alan
1993-01-01
Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G.; Salapura, Valentina
2012-07-24
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Multiprocessor architectural study
NASA Technical Reports Server (NTRS)
Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.
1972-01-01
An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
PANDA: A distributed multiprocessor operating system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chubb, P.
1989-01-01
PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
A Low-Cost and Energy-Efficient Multiprocessor System-on-Chip for UWB MAC Layer
NASA Astrophysics Data System (ADS)
Xiao, Hao; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki; Nakase, Yuko; Kimura, Sadahiro
Ultra-wideband (UWB) technology has attracted much attention recently due to its high data rate and low emission power. Its media access control (MAC) protocol, WiMedia MAC, promises a lot of facilities for high-speed and high-quality wireless communication. However, these benefits in turn involve a large amount of computational load, which challenges the traditional uniprocessor architecture based implementation method to provide the required performance. However, the constrained cost and power budget, on the other hand, makes using commercial multiprocessor solutions unrealistic. In this paper, a low-cost and energy-efficient multiprocessor system-on-chip (MPSoC), which tackles at once the aspects of system design, software migration and hardware architecture, is presented for the implementation of UWB MAC layer. Experimental results show that the proposed MPSoC, based on four simple RISC processors and shared-memory infrastructure, achieves up to 45% performance improvement and 65% power saving, but takes 15% less area than the uniprocessor implementation.
Cache directory lookup reader set encoding for partial cache line speculation support
Gara, Alan; Ohmacht, Martin
2014-10-21
In a multiprocessor system, with conflict checking implemented in a directory lookup of a shared cache memory, a reader set encoding permits dynamic recordation of read accesses. The reader set encoding includes an indication of a portion of a line read, for instance by indicating boundaries of read accesses. Different encodings may apply to different types of speculative execution.
HyperForest: A high performance multi-processor architecture for real-time intelligent systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.
1997-04-01
Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doorsmore » for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.« less
Proceedings of the second SISAL users` conference
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J T; Frerking, C; Miller, P J
1992-12-01
This report contains papers on the following topics: A sisal code for computing the fourier transform on S{sub N}; five ways to fill your knapsack; simulating material dislocation motion in sisal; candis as an interface for sisal; parallelisation and performance of the burg algorithm on a shared-memory multiprocessor; use of genetic algorithm in sisal to solve the file design problem; implementing FFT`s in sisal; programming and evaluating the performance of signal processing applications in the sisal programming environment; sisal and Von Neumann-based languages: translation and intercommunication; an IF2 code generator for ADAM architecture; program partitioning for NUMA multiprocessor computer systems;more » mapping functional parallelism on distributed memory machines; implicit array copying: prevention is better than cure ; mathematical syntax for sisal; an approach for optimizing recursive functions; implementing arrays in sisal 2.0; Fol: an object oriented extension to the sisal language; twine: a portable, extensible sisal execution kernel; and investigating the memory performance of the optimizing sisal compiler.« less
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors
NASA Technical Reports Server (NTRS)
Probst, David K.
1993-01-01
A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Ohmacht, Martin
2017-08-15
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Ohmacht, Martin
2014-09-09
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Importance of balanced architectures in the design of high-performance imaging systems
NASA Astrophysics Data System (ADS)
Sgro, Joseph A.; Stanton, Paul C.
1999-03-01
Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Applications considerations in the system design of highly concurrent multiprocessors
NASA Technical Reports Server (NTRS)
Lundstrom, Stephen F.
1987-01-01
A flow model processor approach to parallel processing is described, using very-high-performance individual processors, high-speed circuit switched interconnection networks, and a high-speed synchronization capability to minimize the effect of the inherently serial portions of applications on performance. Design studies related to the determination of the number of processors, the memory organization, and the structure of the networks used to interconnect the processor and memory resources are discussed. Simulations indicate that applications centered on the large shared data memory should be able to sustain over 500 million floating point operations per second.
A fault-tolerant information processing concept for space vehicles.
NASA Technical Reports Server (NTRS)
Hopkins, A. L., Jr.
1971-01-01
A distributed fault-tolerant information processing system is proposed, comprising a central multiprocessor, dedicated local processors, and multiplexed input-output buses connecting them together. The processors in the multiprocessor are duplicated for error detection, which is felt to be less expensive than using coded redundancy of comparable effectiveness. Error recovery is made possible by a triplicated scratchpad memory in each processor. The main multiprocessor memory uses replicated memory for error detection and correction. Local processors use any of three conventional redundancy techniques: voting, duplex pairs with backup, and duplex pairs in independent subsystems.
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
On the impact of communication complexity on the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D. B.; Van Rosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
A class Hierarchical, object-oriented approach to virtual memory management
NASA Technical Reports Server (NTRS)
Russo, Vincent F.; Campbell, Roy H.; Johnston, Gary M.
1989-01-01
The Choices family of operating systems exploits class hierarchies and object-oriented programming to facilitate the construction of customized operating systems for shared memory and networked multiprocessors. The software is being used in the Tapestry laboratory to study the performance of algorithms, mechanisms, and policies for parallel systems. Described here are the architectural design and class hierarchy of the Choices virtual memory management system. The software and hardware mechanisms and policies of a virtual memory system implement a memory hierarchy that exploits the trade-off between response times and storage capacities. In Choices, the notion of a memory hierarchy is captured by abstract classes. Concrete subclasses of those abstractions implement a virtual address space, segmentation, paging, physical memory management, secondary storage, and remote (that is, networked) storage. Captured in the notion of a memory hierarchy are classes that represent memory objects. These classes provide a storage mechanism that contains encapsulated data and have methods to read or write the memory object. Each of these classes provides specializations to represent the memory hierarchy.
The FORCE: A highly portable parallel programming language
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Alaghband, Gita; Jakob, Ruediger
1989-01-01
Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them.
Error recovery in shared memory multiprocessors using private caches
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.
1990-01-01
The problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
Parallel Navier-Stokes computations on shared and distributed memory architectures
NASA Technical Reports Server (NTRS)
Hayder, M. Ehtesham; Jayasimha, D. N.; Pillay, Sasi Kumar
1995-01-01
We study a high order finite difference scheme to solve the time accurate flow field of a jet using the compressible Navier-Stokes equations. As part of our ongoing efforts, we have implemented our numerical model on three parallel computing platforms to study the computational, communication, and scalability characteristics. The platforms chosen for this study are a cluster of workstations connected through fast networks (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and a distributed memory multiprocessor (the IBM SPI). Our focus in this study is on the LACE testbed. We present some results for the Cray YMP and the IBM SP1 mainly for comparison purposes. On the LACE testbed, we study: (1) the communication characteristics of Ethernet, FDDI, and the ALLNODE networks and (2) the overheads induced by the PVM message passing library used for parallelizing the application. We demonstrate that clustering of workstations is effective and has the potential to be computationally competitive with supercomputers at a fraction of the cost.
NASA Technical Reports Server (NTRS)
Quealy, Angela; Cole, Gary L.; Blech, Richard A.
1993-01-01
The Application Portable Parallel Library (APPL) is a subroutine-based library of communication primitives that is callable from applications written in FORTRAN or C. APPL provides a consistent programmer interface to a variety of distributed and shared-memory multiprocessor MIMD machines. The objective of APPL is to minimize the effort required to move parallel applications from one machine to another, or to a network of homogeneous machines. APPL encompasses many of the message-passing primitives that are currently available on commercial multiprocessor systems. This paper describes APPL (version 2.3.1) and its usage, reports the status of the APPL project, and indicates possible directions for the future. Several applications using APPL are discussed, as well as performance and overhead results.
Cache as point of coherence in multiprocessor system
Blumrich, Matthias A.; Ceze, Luis H.; Chen, Dong; Gara, Alan; Heidelberger, Phlip; Ohmacht, Martin; Steinmacher-Burow, Burkhard; Zhuang, Xiaotong
2016-11-29
In a multiprocessor system, a conflict checking mechanism is implemented in the L2 cache memory. Different versions of speculative writes are maintained in different ways of the cache. A record of speculative writes is maintained in the cache directory. Conflict checking occurs as part of directory lookup. Speculative versions that do not conflict are aggregated into an aggregated version in a different way of the cache. Speculative memory access requests do not go to main memory.
Blumrich, Matthias A.; Salapura, Valentina
2010-11-02
An apparatus and method are disclosed for single-stepping coherence events in a multiprocessor system under software control in order to monitor the behavior of a memory coherence mechanism. Single-stepping coherence events in a multiprocessor system is made possible by adding one or more step registers. By accessing these step registers, one or more coherence requests are processed by the multiprocessor system. The step registers determine if the snoop unit will operate by proceeding in a normal execution mode, or operate in a single-step mode.
Programmable partitioning for high-performance coherence domains in a multiprocessor system
Blumrich, Matthias A [Ridgefield, CT; Salapura, Valentina [Chappaqua, NY
2011-01-25
A multiprocessor computing system and a method of logically partitioning a multiprocessor computing system are disclosed. The multiprocessor computing system comprises a multitude of processing units, and a multitude of snoop units. Each of the processing units includes a local cache, and the snoop units are provided for supporting cache coherency in the multiprocessor system. Each of the snoop units is connected to a respective one of the processing units and to all of the other snoop units. The multiprocessor computing system further includes a partitioning system for using the snoop units to partition the multitude of processing units into a plurality of independent, memory-consistent, adjustable-size processing groups. Preferably, when the processor units are partitioned into these processing groups, the partitioning system also configures the snoop units to maintain cache coherency within each of said groups.
NASA Technical Reports Server (NTRS)
Bradley, D. B.; Irwin, J. D.
1974-01-01
A computer simulation model for a multiprocessor computer is developed that is useful for studying the problem of matching multiprocessor's memory space, memory bandwidth and numbers and speeds of processors with aggregate job set characteristics. The model assumes an input work load of a set of recurrent jobs. The model includes a feedback scheduler/allocator which attempts to improve system performance through higher memory bandwidth utilization by matching individual job requirements for space and bandwidth with space availability and estimates of bandwidth availability at the times of memory allocation. The simulation model includes provisions for specifying precedence relations among the jobs in a job set, and provisions for specifying precedence execution of TMR (Triple Modular Redundant and SIMPLEX (non redundant) jobs.
Improvement of multiprocessing performance by using optical centralized shared bus
NASA Astrophysics Data System (ADS)
Han, Xuliang; Chen, Ray T.
2004-06-01
With the ever-increasing need to solve larger and more complex problems, multiprocessing is attracting more and more research efforts. One of the challenges facing the multiprocessor designers is to fulfill in an effective manner the communications among the processes running in parallel on multiple multiprocessors. The conventional electrical backplane bus provides narrow bandwidth as restricted by the physical limitations of electrical interconnects. In the electrical domain, in order to operate at high frequency, the backplane topology has been changed from the simple shared bus to the complicated switched medium. However, the switched medium is an indirect network. It cannot support multicast/broadcast as effectively as the shared bus. Besides the additional latency of going through the intermediate switching nodes, signal routing introduces substantial delay and considerable system complexity. Alternatively, optics has been well known for its interconnect capability. Therefore, it has become imperative to investigate how to improve multiprocessing performance by utilizing optical interconnects. From the implementation standpoint, the existing optical technologies still cannot fulfill the intelligent functions that a switch fabric should provide as effectively as their electronic counterparts. Thus, an innovative optical technology that can provide sufficient bandwidth capacity, while at the same time, retaining the essential merits of the shared bus topology, is highly desirable for the multiprocessing performance improvement. In this paper, the optical centralized shared bus is proposed for use in the multiprocessing systems. This novel optical interconnect architecture not only utilizes the beneficial characteristics of optics, but also retains the desirable properties of the shared bus topology. Meanwhile, from the architecture standpoint, it fits well in the centralized shared-memory multiprocessing scheme. Therefore, a smooth migration with substantial multiprocessing performance improvement is expected. To prove the technical feasibility from the architecture standpoint, a conceptual emulation of the centralized shared-memory multiprocessing scheme is demonstrated on a generic PCI subsystem with an optical centralized shared bus.
Local concurrent error detection and correction in data structures using virtual backpointers
NASA Technical Reports Server (NTRS)
Li, C. C.; Chen, P. P.; Fuchs, W. K.
1987-01-01
A new technique, based on virtual backpointers, for local concurrent error detection and correction in linked data structures is presented. Two new data structures, the Virtual Double Linked List, and the B-tree with Virtual Backpointers, are described. For these structures, double errors can be detected in 0(1) time and errors detected during forward moves can be corrected in 0(1) time. The application of a concurrent auditor process to data structure error detection and correction is analyzed, and an implementation is described, to determine the effect on mean time to failure of a multi-user shared database system. The implementation utilizes a Sequent shared memory multiprocessor system operating on a shared databased of Virtual Double Linked Lists.
Local concurrent error detection and correction in data structures using virtual backpointers
NASA Technical Reports Server (NTRS)
Li, Chung-Chi Jim; Chen, Paul Peichuan; Fuchs, W. Kent
1989-01-01
A new technique, based on virtual backpointers, for local concurrent error detection and correction in linked data strutures is presented. Two new data structures, the Virtual Double Linked List, and the B-tree with Virtual Backpointers, are described. For these structures, double errors can be detected in 0(1) time and errors detected during forward moves can be corrected in 0(1) time. The application of a concurrent auditor process to data structure error detection and correction is analyzed, and an implementation is described, to determine the effect on mean time to failure of a multi-user shared database system. The implementation utilizes a Sequent shared memory multiprocessor system operating on a shared database of Virtual Double Linked Lists.
Software-Controlled Caches in the VMP Multiprocessor
1986-03-01
programming system level that Processors is tuned for the VMP design. In this vein, we are interested in exploring how far the software support can go to ...handled in software, analogously to the handling agement of the shared program state is familiar and of virtual memory page faults. Hardware support for...ensure good behavior, as opposed to how Each cache miss results in bus traffic. Table 2 pro- vides the bus cost for the "average" cache miss. Fig
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
NASA Technical Reports Server (NTRS)
Fatoohi, Rod; Saini, Subbash; Ciotti, Robert
2006-01-01
We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
High speed quantitative digital microscopy
NASA Technical Reports Server (NTRS)
Castleman, K. R.; Price, K. H.; Eskenazi, R.; Ovadya, M. M.; Navon, M. A.
1984-01-01
Modern digital image processing hardware makes possible quantitative analysis of microscope images at high speed. This paper describes an application to automatic screening for cervical cancer. The system uses twelve MC6809 microprocessors arranged in a pipeline multiprocessor configuration. Each processor executes one part of the algorithm on each cell image as it passes through the pipeline. Each processor communicates with its upstream and downstream neighbors via shared two-port memory. Thus no time is devoted to input-output operations as such. This configuration is expected to be at least ten times faster than previous systems.
Resource Management for Distributed Parallel Systems
NASA Technical Reports Server (NTRS)
Neuman, B. Clifford; Rao, Santosh
1993-01-01
Multiprocessor systems should exist in the the larger context of distributed systems, allowing multiprocessor resources to be shared by those that need them. Unfortunately, typical multiprocessor resource management techniques do not scale to large networks. The Prospero Resource Manager (PRM) is a scalable resource allocation system that supports the allocation of processing resources in large networks and multiprocessor systems. To manage resources in such distributed parallel systems, PRM employs three types of managers: system managers, job managers, and node managers. There exist multiple independent instances of each type of manager, reducing bottlenecks. The complexity of each manager is further reduced because each is designed to utilize information at an appropriate level of abstraction.
Programming parallel architectures: The BLAZE family of languages
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush
1988-01-01
Programming multiprocessor architectures is a critical research issue. An overview is given of the various approaches to programming these architectures that are currently being explored. It is argued that two of these approaches, interactive programming environments and functional parallel languages, are particularly attractive since they remove much of the burden of exploiting parallel architectures from the user. Also described is recent work by the author in the design of parallel languages. Research on languages for both shared and nonshared memory multiprocessors is described, as well as the relations of this work to other current language research projects.
Parallel Gaussian elimination of a block tridiagonal matrix using multiple microcomputers
NASA Technical Reports Server (NTRS)
Blech, Richard A.
1989-01-01
The solution of a block tridiagonal matrix using parallel processing is demonstrated. The multiprocessor system on which results were obtained and the software environment used to program that system are described. Theoretical partitioning and resource allocation for the Gaussian elimination method used to solve the matrix are discussed. The results obtained from running 1, 2 and 3 processor versions of the block tridiagonal solver are presented. The PASCAL source code for these solvers is given in the appendix, and may be transportable to other shared memory parallel processors provided that the synchronization outlines are reproduced on the target system.
Simulation Analysis of Data Sharing in Shared Memory Multiprocessors
1989-02-24
LIMITATION OF ABSTRACT Same as Report (SAR) 18. NUMBER OF PAGES 178 19a. NAME OF RESPONSIBLE PERSON a. REPORT unclassified b . ABSTRACT unclassified...work. Andrea Casotto (CELL), Steve McGrogan (SPICE), Srinivas Devadas (TOPOP1) and Hi-Keung Tony Ma (VERIFY) donated the parallel programs and a con...Effect of Block Size on B us Utilization 120 5-14 Ratio of Sharing Bus Cyc les to Total Bus Cycles 120 5-15 Oassification of Bus Cyc les for
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Marquez, Andres; Choudhury, Sutanay
2012-09-01
Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems
NASA Technical Reports Server (NTRS)
Naik, Vijay K.; Patrick, Merrell L.
1989-01-01
Communication requirements of Cholesky factorization of dense and sparse symmetric, positive definite matrices are analyzed. The communication requirement is characterized by the data traffic generated on multiprocessor systems with local and shared memory. Lower bound proofs are given to show that when the load is uniformly distributed the data traffic associated with factoring an n x n dense matrix using n to the alpha power (alpha less than or equal 2) processors is omega(n to the 2 + alpha/2 power). For n x n sparse matrices representing a square root of n x square root of n regular grid graph the data traffic is shown to be omega(n to the 1 + alpha/2 power), alpha less than or equal 1. Partitioning schemes that are variations of block assignment scheme are described and it is shown that the data traffic generated by these schemes are asymptotically optimal. The schemes allow efficient use of up to O(n to the 2nd power) processors in the dense case and up to O(n) processors in the sparse case before the total data traffic reaches the maximum value of O(n to the 3rd power) and O(n to the 3/2 power), respectively. It is shown that the block based partitioning schemes allow a better utilization of the data accessed from shared memory and thus reduce the data traffic than those based on column-wise wrap around assignment schemes.
Distributed parallel messaging for multiprocessor systems
Chen, Dong; Heidelberger, Philip; Salapura, Valentina; Senger, Robert M; Steinmacher-Burrow, Burhard; Sugawara, Yutaka
2013-06-04
A method and apparatus for distributed parallel messaging in a parallel computing system. The apparatus includes, at each node of a multiprocessor network, multiple injection messaging engine units and reception messaging engine units, each implementing a DMA engine and each supporting both multiple packet injection into and multiple reception from a network, in parallel. The reception side of the messaging unit (MU) includes a switch interface enabling writing of data of a packet received from the network to the memory system. The transmission side of the messaging unit, includes switch interface for reading from the memory system when injecting packets into the network.
Hybrid Memory Management for Parallel Execution of Prolog on Shared Memory Multiprocessors
1990-06-01
organizing data to increase locality. The stack structure exhibits greater locality than the heap structure. Tradeoff decisions can also be made on...PROGRAM ELEMENT NUMBER 6. AUTHOR(S) 5d. PROJECT NUMBER 5e. TASK NUMBER 5f. WORK UNIT NUMBER 7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES...University of California at Berkeley,Department of Electrical Engineering and Computer Sciences,Berkeley,CA,94720 8. PERFORMING ORGANIZATION REPORT
Multiprocessor architecture: Synthesis and evaluation
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1990-01-01
Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Performances of multiprocessor multidisk architectures for continuous media storage
NASA Astrophysics Data System (ADS)
Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.
1996-03-01
Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Sequoia: A fault-tolerant tightly coupled multiprocessor for transaction processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, P.A.
1988-02-01
The Sequoia computer is a tightly coupled multiprocessor, and thus attains the performance advantages of this style of architecture. It avoids most of the fault-tolerance disadvantages of tight coupling by using a new fault-tolerance design. The Sequoia architecture is similar to other multimicroprocessor architectures, such as those of Encore and Sequent, in that it gives dozens of microprocessors shared access to a large main memory. It resembles the Stratus architecture in its extensive use of hardware fault-detection techniques. It resembles Stratus and Auragen in its ability to quickly recover all processes after a single point failure, transparently to the user.more » However, Sequoia is unique in its combination of a large-scale tightly coupled architecture with a hardware approach to fault tolerance. This article gives an overview of how the hardware architecture and operating systems (OS) work together to provide a high degree of fault tolerance with good system performance.« less
OPAD-EDIFIS Real-Time Processing
NASA Technical Reports Server (NTRS)
Katsinis, Constantine
1997-01-01
The Optical Plume Anomaly Detection (OPAD) detects engine hardware degradation of flight vehicles through identification and quantification of elemental species found in the plume by analyzing the plume emission spectra in a real-time mode. Real-time performance of OPAD relies on extensive software which must report metal amounts in the plume faster than once every 0.5 sec. OPAD software previously written by NASA scientists performed most necessary functions at speeds which were far below what is needed for real-time operation. The research presented in this report improved the execution speed of the software by optimizing the code without changing the algorithms and converting it into a parallelized form which is executed in a shared-memory multiprocessor system. The resulting code was subjected to extensive timing analysis. The report also provides suggestions for further performance improvement by (1) identifying areas of algorithm optimization, (2) recommending commercially available multiprocessor architectures and operating systems to support real-time execution and (3) presenting an initial study of fault-tolerance requirements.
Framework for analysis of guaranteed QOS systems
NASA Astrophysics Data System (ADS)
Chaudhry, Shailender; Choudhary, Alok
1997-01-01
Multimedia data is isochronous in nature and entails managing and delivering high volumes of data. Multiprocessors with their large processing power, vast memory, and fast interconnects, are an ideal candidate for the implementation of multimedia applications. Initially, multiprocessors were designed to execute scientific programs and thus their architecture was optimized to provide low message latency and efficiently support regular communication patterns. Hence, they have a regular network topology and most use wormhole routing. The design offers the benefits of a simple router, small buffer size, and network latency that is almost independent of path length. Among the various multimedia applications, video on demand (VOD) server is well-suited for implementation using parallel multiprocessors. Logical models for VOD servers are presently mapped onto multiprocessors. Our paper provides a framework for calculating bounds on utilization of system resources with which QoS parameters for each isochronous stream can be guaranteed. Effects of the architecture of multiprocessors, and efficiency of various local models and mapping on particular architectures can be investigated within our framework. Our framework is based on rigorous proofs and provides tight bounds. The results obtained may be used as the basis for admission control tests. To illustrate the versatility of our framework, we provide bounds on utilization for various logical models applied to mesh connected architectures for a video on demand server. Our results show that worm hole routing can lead to packets waiting for transmission of other packets that apparently share no common resources. This situation is analogous to head-of-the-line blocking. We find that the provision of multiple VCs per link and multiple flit buffers improves utilization (even under guaranteed QoS parameters). This analogous to parallel iterative matching.
NASA Astrophysics Data System (ADS)
Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry
2003-08-01
Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Chapman, Barbara; Mehrotra, Piyush; Zima, Hans
1991-01-01
Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
Efficient partitioning and assignment on programs for multiprocessor execution
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1993-01-01
The general problem studied is that of segmenting or partitioning programs for distribution across a multiprocessor system. Efficient partitioning and the assignment of program elements are of great importance since the time consumed in this overhead activity may easily dominate the computation, effectively eliminating any gains made by the use of the parallelism. In this study, the partitioning of sequentially structured programs (written in FORTRAN) is evaluated. Heuristics, developed for similar applications are examined. Finally, a model for queueing networks with finite queues is developed which may be used to analyze multiprocessor system architectures with a shared memory approach to the problem of partitioning. The properties of sequentially written programs form obstacles to large scale (at the procedure or subroutine level) parallelization. Data dependencies of even the minutest nature, reflecting the sequential development of the program, severely limit parallelism. The design of heuristic algorithms is tied to the experience gained in the parallel splitting. Parallelism obtained through the physical separation of data has seen some success, especially at the data element level. Data parallelism on a grander scale requires models that accurately reflect the effects of blocking caused by finite queues. A model for the approximation of the performance of finite queueing networks is developed. This model makes use of the decomposition approach combined with the efficiency of product form solutions.
Evaluation of the Cedar memory system: Configuration of 16 by 16
NASA Technical Reports Server (NTRS)
Gallivan, K.; Jalby, W.; Wijshoff, H.
1991-01-01
Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented.
Architectures for reasoning in parallel
NASA Technical Reports Server (NTRS)
Hall, Lawrence O.
1989-01-01
The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.
Exploring the use of I/O nodes for computation in a MIMD multiprocessor
NASA Technical Reports Server (NTRS)
Kotz, David; Cai, Ting
1995-01-01
As parallel systems move into the production scientific-computing world, the emphasis will be on cost-effective solutions that provide high throughput for a mix of applications. Cost effective solutions demand that a system make effective use of all of its resources. Many MIMD multiprocessors today, however, distinguish between 'compute' and 'I/O' nodes, the latter having attached disks and being dedicated to running the file-system server. This static division of responsibilities simplifies system management but does not necessarily lead to the best performance in workloads that need a different balance of computation and I/O. Of course, computational processes sharing a node with a file-system service may receive less CPU time, network bandwidth, and memory bandwidth than they would on a computation-only node. In this paper we begin to examine this issue experimentally. We found that high performance I/O does not necessarily require substantial CPU time, leaving plenty of time for application computation. There were some complex file-system requests, however, which left little CPU time available to the application. (The impact on network and memory bandwidth still needs to be determined.) For applications (or users) that cannot tolerate an occasional interruption, we recommend that they continue to use only compute nodes. For tolerant applications needing more cycles than those provided by the compute nodes, we recommend that they take full advantage of both compute and I/O nodes for computation, and that operating systems should make this possible.
Multiprogramming performance degradation - Case study on a shared memory multiprocessor
NASA Technical Reports Server (NTRS)
Dimpsey, R. T.; Iyer, R. K.
1989-01-01
The performance degradation due to multiprogramming overhead is quantified for a parallel-processing machine. Measurements of real workloads were taken, and it was found that there is a moderate correlation between the completion time of a program and the amount of system overhead measured during program execution. Experiments in controlled environments were then conducted to calculate a lower bound on the performance degradation of parallel jobs caused by multiprogramming overhead. The results show that the multiprogramming overhead of parallel jobs consumes at least 4 percent of the processor time. When two or more serial jobs are introduced into the system, this amount increases to 5.3 percent
Neural networks and MIMD-multiprocessors
NASA Technical Reports Server (NTRS)
Vanhala, Jukka; Kaski, Kimmo
1990-01-01
Two artificial neural network models are compared. They are the Hopfield Neural Network Model and the Sparse Distributed Memory model. Distributed algorithms for both of them are designed and implemented. The run time characteristics of the algorithms are analyzed theoretically and tested in practice. The storage capacities of the networks are compared. Implementations are done using a distributed multiprocessor system.
Parallel and fault-tolerant algorithms for hypercube multiprocessors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Aykanat, C.
1988-01-01
Several techniques for increasing the performance of parallel algorithms on distributed-memory message-passing multi-processor systems are investigated. These techniques are effectively implemented for the parallelization of the Scaled Conjugate Gradient (SCG) algorithm on a hypercube connected message-passing multi-processor. Significant performance improvement is achieved by using these techniques. The SCG algorithm is used for the solution phase of an FE modeling system. Almost linear speed-up is achieved, and it is shown that hypercube topology is scalable for an FE class of problem. The SCG algorithm is also shown to be suitable for vectorization, and near supercomputer performance is achieved on a vectormore » hypercube multiprocessor by exploiting both parallelization and vectorization. Fault-tolerance issues for the parallel SCG algorithm and for the hypercube topology are also addressed.« less
Event parallelism: Distributed memory parallel computing for high energy physics experiments
NASA Astrophysics Data System (ADS)
Nash, Thomas
1989-12-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Bolosky, William Joseph
1993-01-01
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
NASA Astrophysics Data System (ADS)
Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.
2014-03-01
The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
Parallel Computing for Probabilistic Response Analysis of High Temperature Composites
NASA Technical Reports Server (NTRS)
Sues, R. H.; Lua, Y. J.; Smith, M. D.
1994-01-01
The objective of this Phase I research was to establish the required software and hardware strategies to achieve large scale parallelism in solving PCM problems. To meet this objective, several investigations were conducted. First, we identified the multiple levels of parallelism in PCM and the computational strategies to exploit these parallelisms. Next, several software and hardware efficiency investigations were conducted. These involved the use of three different parallel programming paradigms and solution of two example problems on both a shared-memory multiprocessor and a distributed-memory network of workstations.
Parallelization of a Monte Carlo particle transport simulation code
NASA Astrophysics Data System (ADS)
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Data traffic reduction schemes for sparse Cholesky factorizations
NASA Technical Reports Server (NTRS)
Naik, Vijay K.; Patrick, Merrell L.
1988-01-01
Load distribution schemes are presented which minimize the total data traffic in the Cholesky factorization of dense and sparse, symmetric, positive definite matrices on multiprocessor systems with local and shared memory. The total data traffic in factoring an n x n sparse, symmetric, positive definite matrix representing an n-vertex regular 2-D grid graph using n (sup alpha), alpha is equal to or less than 1, processors are shown to be O(n(sup 1 + alpha/2)). It is O(n(sup 3/2)), when n (sup alpha), alpha is equal to or greater than 1, processors are used. Under the conditions of uniform load distribution, these results are shown to be asymptotically optimal. The schemes allow efficient use of up to O(n) processors before the total data traffic reaches the maximum value of O(n(sup 3/2)). The partitioning employed within the scheme, allows a better utilization of the data accessed from shared memory than those of previously published methods.
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Labarta, Jesus; Gimenez, Judit
2004-01-01
With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
NASA Technical Reports Server (NTRS)
Jordan, Harry F.; Benten, Muhammad S.; Arenstorf, Norbert S.; Ramanan, Aruna V.
1987-01-01
A methodology for writing parallel programs for shared memory multiprocessors has been formalized as an extension to the Fortran language and implemented as a macro preprocessor. The extended language is known as the Force, and this manual describes how to write Force programs and execute them on the Flexible Computer Corporation Flex/32, the Encore Multimax and the Sequent Balance computers. The parallel extension macros are described in detail, but knowledge of Fortran is assumed.
Efficient Approximation Algorithms for Weighted $b$-Matching
DOE Office of Scientific and Technical Information (OSTI.GOV)
Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.
2016-01-01
We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
A mechanism for efficient debugging of parallel programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, B.P.; Choi, J.D.
1988-01-01
This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). The authors describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. The authors introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. The extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions ofmore » the co-operating processes.« less
Memory management and compiler support for rapid recovery from failures in computer systems
NASA Technical Reports Server (NTRS)
Fuchs, W. K.
1991-01-01
This paper describes recent developments in the use of memory management and compiler technology to support rapid recovery from failures in computer systems. The techniques described include cache coherence protocols for user transparent checkpointing in multiprocessor systems, compiler-based checkpoint placement, compiler-based code modification for multiple instruction retry, and forward recovery in distributed systems utilizing optimistic execution.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J.T.
1993-10-01
This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
Memory Benchmarks for SMP-Based High Performance Parallel Computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoo, A B; de Supinski, B; Mueller, F
2001-11-20
As the speed gap between CPU and main memory continues to grow, memory accesses increasingly dominates the performance of many applications. The problem is particularly acute for symmetric multiprocessor (SMP) systems, where the shared memory may be accessed concurrently by a group of threads running on separate CPUs. Unfortunately, several key issues governing memory system performance in current systems are not well understood. Complex interactions between the levels of the memory hierarchy, buses or switches, DRAM back-ends, system software, and application access patterns can make it difficult to pinpoint bottlenecks and determine appropriate optimizations, and the situation is even moremore » complex for SMP systems. To partially address this problem, we formulated a set of multi-threaded microbenchmarks for characterizing and measuring the performance of the underlying memory system in SMP-based high-performance computers. We report our use of these microbenchmarks on two important SMP-based machines. This paper has four primary contributions. First, we introduce a microbenchmark suite to systematically assess and compare the performance of different levels in SMP memory hierarchies. Second, we present a new tool based on hardware performance monitors to determine a wide array of memory system characteristics, such as cache sizes, quickly and easily; by using this tool, memory performance studies can be targeted to the full spectrum of performance regimes with many fewer data points than is otherwise required. Third, we present experimental results indicating that the performance of applications with large memory footprints remains largely constrained by memory. Fourth, we demonstrate that thread-level parallelism further degrades memory performance, even for the latest SMPs with hardware prefetching and switch-based memory interconnects.« less
Multiprocessor system with multiple concurrent modes of execution
Ahn, Daniel; Ceze, Luis H; Chen, Dong; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin
2013-12-31
A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.
Multiprocessor system with multiple concurrent modes of execution
Ahn, Daniel; Ceze, Luis H.; Chen, Dong Chen; Gara, Alan; Heidelberger, Philip; Ohmacht, Martin
2016-11-22
A multiprocessor system supports multiple concurrent modes of speculative execution. Speculation identification numbers (IDs) are allocated to speculative threads from a pool of available numbers. The pool is divided into domains, with each domain being assigned to a mode of speculation. Modes of speculation include TM, TLS, and rollback. Allocation of the IDs is carried out with respect to a central state table and using hardware pointers. The IDs are used for writing different versions of speculative results in different ways of a set in a cache memory.
Composing Data Parallel Code for a SPARQL Graph Engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Castellana, Vito G.; Tumeo, Antonino; Villa, Oreste
Big data analytics process large amount of data to extract knowledge from them. Semantic databases are big data applications that adopt the Resource Description Framework (RDF) to structure metadata through a graph-based representation. The graph based representation provides several benefits, such as the possibility to perform in memory processing with large amounts of parallelism. SPARQL is a language used to perform queries on RDF-structured data through graph matching. In this paper we present a tool that automatically translates SPARQL queries to parallel graph crawling and graph matching operations. The tool also supports complex SPARQL constructs, which requires more than basicmore » graph matching for their implementation. The tool generates parallel code annotated with OpenMP pragmas for x86 Shared-memory Multiprocessors (SMPs). With respect to commercial database systems such as Virtuoso, our approach reduces memory occupation due to join operations and provides higher performance. We show the scaling of the automatically generated graph-matching code on a 48-core SMP.« less
Efficient Parallelization of a Dynamic Unstructured Application on the Tera MTA
NASA Technical Reports Server (NTRS)
Oliker, Leonid; Biswas, Rupak
1999-01-01
The success of parallel computing in solving real-life computationally-intensive problems relies on their efficient mapping and execution on large-scale multiprocessor architectures. Many important applications are both unstructured and dynamic in nature, making their efficient parallel implementation a daunting task. This paper presents the parallelization of a dynamic unstructured mesh adaptation algorithm using three popular programming paradigms on three leading supercomputers. We examine an MPI message-passing implementation on the Cray T3E and the SGI Origin2OOO, a shared-memory implementation using cache coherent nonuniform memory access (CC-NUMA) of the Origin2OOO, and a multi-threaded version on the newly-released Tera Multi-threaded Architecture (MTA). We compare several critical factors of this parallel code development, including runtime, scalability, programmability, and memory overhead. Our overall results demonstrate that multi-threaded systems offer tremendous potential for quickly and efficiently solving some of the most challenging real-life problems on parallel computers.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blanford, M.
1997-12-31
Most commercially-available quasistatic finite element programs assemble element stiffnesses into a global stiffness matrix, then use a direct linear equation solver to obtain nodal displacements. However, for large problems (greater than a few hundred thousand degrees of freedom), the memory size and computation time required for this approach becomes prohibitive. Moreover, direct solution does not lend itself to the parallel processing needed for today`s multiprocessor systems. This talk gives an overview of the iterative solution strategy of JAS3D, the nonlinear large-deformation quasistatic finite element program. Because its architecture is derived from an explicit transient-dynamics code, it does not ever assemblemore » a global stiffness matrix. The author describes the approach he used to implement the solver on multiprocessor computers, and shows examples of problems run on hundreds of processors and more than a million degrees of freedom. Finally, he describes some of the work he is presently doing to address the challenges of iterative convergence for ill-conditioned problems.« less
Cache directory look-up re-use as conflict check mechanism for speculative memory requests
Ohmacht, Martin
2013-09-10
In a cache memory, energy and other efficiencies can be realized by saving a result of a cache directory lookup for sequential accesses to a same memory address. Where the cache is a point of coherence for speculative execution in a multiprocessor system, with directory lookups serving as the point of conflict detection, such saving becomes particularly advantageous.
The Effects of Block Size on the Performance of Coherent Caches in Shared-Memory Multiprocessors
1993-05-01
increase with the bandwidth and latency. For those applications with poor spatial locality, the best choice of cache line size is determined by the...observation was used in the design of two schemes: LimitLESS di- rectories and Tag caches. LimitLESS directories [15] were designed for the ALEWIFE...small packets may be used to avoid network congestion. The most important factor influencing the choice of cache line size for a multipro- cessor is the
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction
Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...
1995-01-01
In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
Principles for problem aggregation and assignment in medium scale multiprocessors
NASA Technical Reports Server (NTRS)
Nicol, David M.; Saltz, Joel H.
1987-01-01
One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior.
A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis
NASA Technical Reports Server (NTRS)
Lim, Chieng-Fai
1991-01-01
The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.
Gara, Alan; Ohmacht, Martin
2014-09-16
In a multiprocessor system with at least two levels of cache, a speculative thread may run on a core processor in parallel with other threads. When the thread seeks to do a write to main memory, this access is to be written through the first level cache to the second level cache. After the write though, the corresponding line is deleted from the first level cache and/or prefetch unit, so that any further accesses to the same location in main memory have to be retrieved from the second level cache. The second level cache keeps track of multiple versions of data, where more than one speculative thread is running in parallel, while the first level cache does not have any of the versions during speculation. A switch allows choosing between modes of operation of a speculation blind first level cache.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1988-01-01
The purpose is to document research to develop strategies for concurrent processing of complex algorithms in data driven architectures. The problem domain consists of decision-free algorithms having large-grained, computationally complex primitive operations. Such are often found in signal processing and control applications. The anticipated multiprocessor environment is a data flow architecture containing between two and twenty computing elements. Each computing element is a processor having local program memory, and which communicates with a common global data memory. A new graph theoretic model called ATAMM which establishes rules for relating a decomposed algorithm to its execution in a data flow architecture is presented. The ATAMM model is used to determine strategies to achieve optimum time performance and to develop a system diagnostic software tool. In addition, preliminary work on a new multiprocessor operating system based on the ATAMM specifications is described.
NASA Astrophysics Data System (ADS)
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Multiprocessor switch with selective pairing
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
2014-03-11
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus
Asynchronous Data Retrieval from an Object-Oriented Database
NASA Astrophysics Data System (ADS)
Gilbert, Jonathan P.; Bic, Lubomir
We present an object-oriented semantic database model which, similar to other object-oriented systems, combines the virtues of four concepts: the functional data model, a property inheritance hierarchy, abstract data types and message-driven computation. The main emphasis is on the last of these four concepts. We describe generic procedures that permit queries to be processed in a purely message-driven manner. A database is represented as a network of nodes and directed arcs, in which each node is a logical processing element, capable of communicating with other nodes by exchanging messages. This eliminates the need for shared memory and for centralized control during query processing. Hence, the model is suitable for implementation on a multiprocessor computer architecture, consisting of large numbers of loosely coupled processing elements.
NASA Technical Reports Server (NTRS)
OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
1998-01-01
This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Multiprocessor and memory architecture of the neurocomputer SYNAPSE-1.
Ramacher, U; Raab, W; Anlauf, J; Hachmann, U; Beichter, J; Brüls, N; Wesseling, M; Sicheneder, E; Männer, R; Glass, J
1993-12-01
A general purpose neurocomputer, SYNAPSE-1, which exhibits a multiprocessor and memory architecture is presented. It offers wide flexibility with respect to neural algorithms and a speed-up factor of several orders of magnitude--including learning. The computational power is provided by a 2-dimensional systolic array of neural signal processors. Since the weights are stored outside these NSPs, memory size and processing power can be adapted individually to the application needs. A neural algorithms programming language, embedded in C(+2) has been defined for the user to cope with the neurocomputer. In a benchmark test, the prototype of SYNAPSE-1 was 8000 times as fast as a standard workstation.
A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories
1989-02-01
frames per second, font generation directly from conic spline descriptions, and rapid calculation of radiosity form factors. The hardware consists of...generality for rendering curved surfaces, volume data, objects dcscri id with Constructive Solid Geometry, for rendering scenes using the radiosity ...f.aces and for computing a spherical radiosity lighting model (see Section 7.6). Custom Memory Chips \\ 208 bits x 128 pixels - Renderer Board ix p o a
Job-mix modeling and system analysis of an aerospace multiprocessor.
NASA Technical Reports Server (NTRS)
Mallach, E. G.
1972-01-01
An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.
Memory interface simulator: A computer design aid
NASA Technical Reports Server (NTRS)
Taylor, D. S.; Williams, T.; Weatherbee, J. E.
1972-01-01
Results are presented of a study conducted with a digital simulation model being used in the design of the Automatically Reconfigurable Modular Multiprocessor System (ARMMS), a candidate computer system for future manned and unmanned space missions. The model simulates the activity involved as instructions are fetched from random access memory for execution in one of the system central processing units. A series of model runs measured instruction execution time under various assumptions pertaining to the CPU's and the interface between the CPU's and RAM. Design tradeoffs are presented in the following areas: Bus widths, CPU microprogram read only memory cycle time, multiple instruction fetch, and instruction mix.
Testing and operating a multiprocessor chip with processor redundancy
Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J
2014-10-21
A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
Recent Improvements in Aerodynamic Design Optimization on Unstructured Meshes
NASA Technical Reports Server (NTRS)
Nielsen, Eric J.; Anderson, W. Kyle
2000-01-01
Recent improvements in an unstructured-grid method for large-scale aerodynamic design are presented. Previous work had shown such computations to be prohibitively long in a sequential processing environment. Also, robust adjoint solutions and mesh movement procedures were difficult to realize, particularly for viscous flows. To overcome these limiting factors, a set of design codes based on a discrete adjoint method is extended to a multiprocessor environment using a shared memory approach. A nearly linear speedup is demonstrated, and the consistency of the linearizations is shown to remain valid. The full linearization of the residual is used to precondition the adjoint system, and a significantly improved convergence rate is obtained. A new mesh movement algorithm is implemented and several advantages over an existing technique are presented. Several design cases are shown for turbulent flows in two and three dimensions.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wasserman, H.J.
1996-02-01
The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
Expert Systems on Multiprocessor Architectures. Volume 3. Technical Reports
1991-06-01
choice of load balancing vs. load sharing 1141. While load balancing strives to keep all sites equally loaded, load sharing merely tries to prevent ...unnecessary idleness. Loo. balancing is appropriate to object- oriented real- time systems because * real-time systems ne ,l to prevent long waits for...oetavir ConClass siy51cr Iz a n ubjeU rephitation ’-enare ir order wo prevent a partic=Lar abiec:;ram heing (ntrlu ~lel Ar iic]en:f etautaan ire chanw
Fault-Tolerant Multiprocessor and VLSI-Based Systems.
1987-03-15
54590 170 Table 1: Statistics for the Benchmark Programs pages are distributed amongst the groups of the reconfigured memory in proportion to the...distances are proportional to only the logarithm of the sure that possesses relevance to a system which consists of alare nmbe ofhomgenouseleent...and comn.unication overhead resulting from faults communicating with all of the other elements in the system the network to degrade proportionately to
Automatic Data Partitioning on Distributed Memory Multiprocessors
1990-10-01
DISTRIBUTED MEMORY MULTIPROCESSORS D 1"’ 1 C . Manish Gupta NOV,1 41990.NOV 1 41990m Prithviraj Banerjee D Coordinated Science Laboratory College of...developed on the partitioning of arrays can as well be applied to other programming languages, such as C . 3 The rest of this paper is organized as follows...value 1, as in Fortran. a) N= 4, N 2 = 1: f(i) = J,(j) = 03 b) =Ni 1, N 2 =4: fA(i) =, f() - c ) NI 2, X) 2: f()=[., f2(j) = [L-.j d) N 1 , N 2 =4: fA(i
Sparse Gaussian elimination with controlled fill-in on a shared memory multiprocessor
NASA Technical Reports Server (NTRS)
Alaghband, Gita; Jordan, Harry F.
1989-01-01
It is shown that in sparse matrices arising from electronic circuits, it is possible to do computations on many diagonal elements simultaneously. A technique for obtaining an ordered compatible set directly from the ordered incompatible table is given. The ordering is based on the Markowitz number of the pivot candidates. This technique generates a set of compatible pivots with the property of generating few fills. A novel heuristic algorithm is presented that combines the idea of an order-compatible set with a limited binary tree search to generate several sets of compatible pivots in linear time. An elimination set for reducing the matrix is generated and selected on the basis of a minimum Markowitz sum number. The parallel pivoting technique presented is a stepwise algorithm and can be applied to any submatrix of the original matrix. Thus, it is not a preordering of the sparse matrix and is applied dynamically as the decomposition proceeds. Parameters are suggested to obtain a balance between parallelism and fill-ins. Results of applying the proposed algorithms on several large application matrices using the HEP multiprocessor (Kowalik, 1985) are presented and analyzed.
Hardware for Accelerating N-Modular Redundant Systems for High-Reliability Computing
NASA Technical Reports Server (NTRS)
Dobbs, Carl, Sr.
2012-01-01
A hardware unit has been designed that reduces the cost, in terms of performance and power consumption, for implementing N-modular redundancy (NMR) in a multiprocessor device. The innovation monitors transactions to memory, and calculates a form of sumcheck on-the-fly, thereby relieving the processors of calculating the sumcheck in software
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gala, Alan; Ohmacht, Martin
A multiprocessor system includes nodes. Each node includes a data path that includes a core, a TLB, and a first level cache implementing disambiguation. The system also includes at least one second level cache and a main memory. For thread memory access requests, the core uses an address associated with an instruction format of the core. The first level cache uses an address format related to the size of the main memory plus an offset corresponding to hardware thread meta data. The second level cache uses a physical main memory address plus software thread meta data to store the memorymore » access request. The second level cache accesses the main memory using the physical address with neither the offset nor the thread meta data after resolving speculation. In short, this system includes mapping of a virtual address to a different physical addresses for value disambiguation for different threads.« less
Parallel computation with the force
NASA Technical Reports Server (NTRS)
Jordan, H. F.
1985-01-01
A methodology, called the force, supports the construction of programs to be executed in parallel by a force of processes. The number of processes in the force is unspecified, but potentially very large. The force idea is embodied in a set of macros which produce multiproceossor FORTRAN code and has been studied on two shared memory multiprocessors of fairly different character. The method has simplified the writing of highly parallel programs within a limited class of parallel algorithms and is being extended to cover a broader class. The individual parallel constructs which comprise the force methodology are discussed. Of central concern are their semantics, implementation on different architectures and performance implications.
Characterizing parallel file-access patterns on a large-scale multiprocessor
NASA Technical Reports Server (NTRS)
Purakayastha, Apratim; Ellis, Carla Schlatter; Kotz, David; Nieuwejaar, Nils; Best, Michael
1994-01-01
Rapid increases in the computational speeds of multiprocessors have not been matched by corresponding performance enhancements in the I/O subsystem. To satisfy the large and growing I/O requirements of some parallel scientific applications, we need parallel file systems that can provide high-bandwidth and high-volume data transfer between the I/O subsystem and thousands of processors. Design of such high-performance parallel file systems depends on a thorough grasp of the expected workload. So far there have been no comprehensive usage studies of multiprocessor file systems. Our CHARISMA project intends to fill this void. The first results from our study involve an iPSC/860 at NASA Ames. This paper presents results from a different platform, the CM-5 at the National Center for Supercomputing Applications. The CHARISMA studies are unique because we collect information about every individual read and write request and about the entire mix of applications running on the machines. The results of our trace analysis lead to recommendations for parallel file system design. First the file system should support efficient concurrent access to many files, and I/O requests from many jobs under varying load conditions. Second, it must efficiently manage large files kept open for long periods. Third, it should expect to see small requests predominantly sequential access patterns, application-wide synchronous access, no concurrent file-sharing between jobs appreciable byte and block sharing between processes within jobs, and strong interprocess locality. Finally, the trace data suggest that node-level write caches and collective I/O request interfaces may be useful in certain environments.
Utilizing Dynamically Coupled Cores to Form a Resilient Chip Multiprocessor
2007-06-01
requires a significant deviation from previous work. For instance, we find that using the relaxed input replication model from Reunion incurs a...Circuit Width Delay Count CRC-16 16 6.65 754 CRC- SDLC -16 16 6.10 888 CRC-32 16 7.28 2260 CRC-32 32 8.60 4240 Table 1. FO4 delay and transistor count for...the operation of our proposed system is the same in all other respects. 4.4 Compatibility Across Memory Consis- tency Models The memory consistency
Two-dimensional systolic-array architecture for pixel-level vision tasks
NASA Astrophysics Data System (ADS)
Vijverberg, Julien A.; de With, Peter H. N.
2010-05-01
This paper presents ongoing work on the design of a two-dimensional (2D) systolic array for image processing. This component is designed to operate on a multi-processor system-on-chip. In contrast with other 2D systolic-array architectures and many other hardware accelerators, we investigate the applicability of executing multiple tasks in a time-interleaved fashion on the Systolic Array (SA). This leads to a lower external memory bandwidth and better load balancing of the tasks on the different processing tiles. To enable the interleaving of tasks, we add a shadow-state register for fast task switching. To reduce the number of accesses to the external memory, we propose to share the communication assist between consecutive tasks. A preliminary, non-functional version of the SA has been synthesized for an XV4S25 FPGA device and yields a maximum clock frequency of 150 MHz requiring 1,447 slices and 5 memory blocks. Mapping tasks from video content-analysis applications from literature on the SA yields reductions in the execution time of 1-2 orders of magnitude compared to the software implementation. We conclude that the choice for an SA architecture is useful, but a scaled version of the SA featuring less logic with fewer processing and pipeline stages yielding a lower clock frequency, would be sufficient for a video analysis system-on-chip.
Adaptive Backoff Synchronization Techniques
1989-07-01
The Simple Code. Technical Report, Lawrence Livermore Laboratory, February 1978. [6] F. Darems-Rogers, D. A. George, V. A. Norton, and G . F. Pfister...Heights, November 1986. 20 [7] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In International...17] Janak H. Patel. Analysis of Multiprocessors with Private Cache Memories. IEEE Transactions on Com- puters, C-31(4):296-304, April 1982. [18] G
The Simulation of Read-time Scalable Coherent Interface
NASA Technical Reports Server (NTRS)
Li, Qiang; Grant, Terry; Grover, Radhika S.
1997-01-01
Scalable Coherent Interface (SCI, IEEE/ANSI Std 1596-1992) (SCI1, SCI2) is a high performance interconnect for shared memory multiprocessor systems. In this project we investigate an SCI Real Time Protocols (RTSCI1) using Directed Flow Control Symbols. We studied the issues of efficient generation of control symbols, and created a simulation model of the protocol on a ring-based SCI system. This report presents the results of the study. The project has been implemented using SES/Workbench. The details that follow encompass aspects of both SCI and Flow Control Protocols, as well as the effect of realistic client/server processing delay. The report is organized as follows. Section 2 provides a description of the simulation model. Section 3 describes the protocol implementation details. The next three sections of the report elaborate on the workload, results and conclusions. Appended to the report is a description of the tool, SES/Workbench, used in our simulation, and internal details of our implementation of the protocol.
Using parallel banded linear system solvers in generalized eigenvalue problems
NASA Technical Reports Server (NTRS)
Zhang, Hong; Moss, William F.
1993-01-01
Subspace iteration is a reliable and cost effective method for solving positive definite banded symmetric generalized eigenproblems, especially in the case of large scale problems. This paper discusses an algorithm that makes use of two parallel banded solvers in subspace iteration. A shift is introduced to decompose the banded linear systems into relatively independent subsystems and to accelerate the iterations. With this shift, an eigenproblem is mapped efficiently into the memories of a multiprocessor and a high speed-up is obtained for parallel implementations. An optimal shift is a shift that balances total computation and communication costs. Under certain conditions, we show how to estimate an optimal shift analytically using the decay rate for the inverse of a banded matrix, and how to improve this estimate. Computational results on iPSC/2 and iPSC/860 multiprocessors are presented.
NASA Technical Reports Server (NTRS)
Lala, J. H.; Smith, T. B., III
1983-01-01
The software developed for the Fault-Tolerant Multiprocessor (FTMP) is described. The FTMP executive is a timer-interrupt driven dispatcher that schedules iterative tasks which run at 3.125, 12.5, and 25 Hz. Major tasks which run under the executive include system configuration control, flight control, and display. The flight control task includes autopilot and autoland functions for a jet transport aircraft. System Displays include status displays of all hardware elements (processors, memories, I/O ports, buses), failure log displays showing transient and hard faults, and an autopilot display. All software is in a higher order language (AED, an ALGOL derivative). The executive is a fully distributed general purpose executive which automatically balances the load among available processor triads. Provisions for graceful performance degradation under processing overload are an integral part of the scheduling algorithms.
NASA Technical Reports Server (NTRS)
Choudhary, Alok Nidhi; Leung, Mun K.; Huang, Thomas S.; Patel, Janak H.
1989-01-01
Several techniques to perform static and dynamic load balancing techniques for vision systems are presented. These techniques are novel in the sense that they capture the computational requirements of a task by examining the data when it is produced. Furthermore, they can be applied to many vision systems because many algorithms in different systems are either the same, or have similar computational characteristics. These techniques are evaluated by applying them on a parallel implementation of the algorithms in a motion estimation system on a hypercube multiprocessor system. The motion estimation system consists of the following steps: (1) extraction of features; (2) stereo match of images in one time instant; (3) time match of images from different time instants; (4) stereo match to compute final unambiguous points; and (5) computation of motion parameters. It is shown that the performance gains when these data decomposition and load balancing techniques are used are significant and the overhead of using these techniques is minimal.
The MIT Alewife Machine: A Large-Scale Distributed-Memory Multiprocessor
1991-06-01
Symposium on Compiler Construction, June 1986. [14] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In...Directory Methods. In Proceedings 17th Annual International Symposium on Computer Architecture, June 1990. [31] G . M. Papadopoulos and D.E. Culler...Monsoon: An Explicit Token-Store Ar- chitecture. In Proceedings 17th Annual International Symposium on Computer Architecture, June 1990. [32] G . F
Adaptive Backoff Synchronization Techniques
1989-06-01
The Simple Code. Technical Report, Lawrence Livermore Laboratory, February 1978. [6J F. Darems-Rogers, D. A. George, V. A. Norton, and G . F. Pfister...Heights, November 1986. 20 [7] Daniel Gajski , David Kuck, Duncan Lawrie, and Ahmed Saleh. Cedar - A Large Scale Multiprocessor. In International Conference...17] Janak H. Patel. Analysis of Multiprocessors with Private Cache Memories. IEEE Transactions on Com- puters, C-31(4):296-304, April 1982. [18] G
NASA Astrophysics Data System (ADS)
Kutt, P. H.; Balamuth, D. P.
1989-10-01
Summary form only given, as follows. A multiprocessor system based on commercially available VMEbus components has been developed for the acquisition and reduction of event-mode data in nuclear physics experiments. The system contains seven 68000 CPUs and 14 Mbyte of memory. A minimal operating system handles data transfer and task allocation, and a compiler for a specially designed event analysis language produces code for the processors. The system has been in operation for four years at the University of Pennsylvania Tandem Accelerator Laboratory. Computation rates over three times that of a MicroVAX II have been achieved at a fraction of the cost. The use of WORM optical disks for event recording allows the processing of gigabyte data sets without operator intervention. A more powerful system is being planned which will make use of recently developed RISC (reduced instruction set computer) processors to obtain an order of magnitude increase in computing power per node.
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations
NASA Technical Reports Server (NTRS)
Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos
2009-01-01
A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
NASA Technical Reports Server (NTRS)
Keyes, David E.; Smooke, Mitchell D.
1987-01-01
A parallelized finite difference code based on the Newton method for systems of nonlinear elliptic boundary value problems in two dimensions is analyzed in terms of computational complexity and parallel efficiency. An approximate cost function depending on 15 dimensionless parameters is derived for algorithms based on stripwise and boxwise decompositions of the domain and a one-to-one assignment of the strip or box subdomains to processors. The sensitivity of the cost functions to the parameters is explored in regions of parameter space corresponding to model small-order systems with inexpensive function evaluations and also a coupled system of nineteen equations with very expensive function evaluations. The algorithm was implemented on the Intel Hypercube, and some experimental results for the model problems with stripwise decompositions are presented and compared with the theory. In the context of computational combustion problems, multiprocessors of either message-passing or shared-memory type may be employed with stripwise decompositions to realize speedup of O(n), where n is mesh resolution in one direction, for reasonable n.
NASA Technical Reports Server (NTRS)
Lala, J. H.; Smith, T. B., III
1983-01-01
The experimental test and evaluation of the Fault-Tolerant Multiprocessor (FTMP) is described. Major objectives of this exercise include expanding validation envelope, building confidence in the system, revealing any weaknesses in the architectural concepts and in their execution in hardware and software, and in general, stressing the hardware and software. To this end, pin-level faults were injected into one LRU of the FTMP and the FTMP response was measured in terms of fault detection, isolation, and recovery times. A total of 21,055 stuck-at-0, stuck-at-1 and invert-signal faults were injected in the CPU, memory, bus interface circuits, Bus Guardian Units, and voters and error latches. Of these, 17,418 were detected. At least 80 percent of undetected faults are estimated to be on unused pins. The multiprocessor identified all detected faults correctly and recovered successfully in each case. Total recovery time for all faults averaged a little over one second. This can be reduced to half a second by including appropriate self-tests.
Support for non-locking parallel reception of packets belonging to a single memory reception FIFO
Chen, Dong [Yorktown Heights, NY; Heidelberger, Philip [Yorktown Heights, NY; Salapura, Valentina [Yorktown Heights, NY; Senger, Robert M [Yorktown Heights, NY; Steinmacher-Burow, Burkhard [Boeblingen, DE; Sugawara, Yutaka [Yorktown Heights, NY
2011-01-27
A method and apparatus for distributed parallel messaging in a parallel computing system. A plurality of DMA engine units are configured in a multiprocessor system to operate in parallel, one DMA engine unit for transferring a current packet received at a network reception queue to a memory location in a memory FIFO (rmFIFO) region of a memory. A control unit implements logic to determine whether any prior received packet destined for that rmFIFO is still in a process of being stored in the associated memory by another DMA engine unit of the plurality, and prevent the one DMA engine unit from indicating completion of storing the current received packet in the reception memory FIFO (rmFIFO) until all prior received packets destined for that rmFIFO are completely stored by the other DMA engine units. Thus, there is provided non-locking support so that multiple packets destined for a single rmFIFO are transferred and stored in parallel to predetermined locations in a memory.
Validation of multiprocessor systems
NASA Technical Reports Server (NTRS)
Siewiorek, D. P.; Segall, Z.; Kong, T.
1982-01-01
Experiments that can be used to validate fault free performance of multiprocessor systems in aerospace systems integrating flight controls and avionics are discussed. Engineering prototypes for two fault tolerant multiprocessors are tested.
Checkpointing Shared Memory Programs at the Application-level
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bronevetsky, G; Schulz, M; Szwed, P
2004-09-08
Trends in high-performance computing are making it necessary for long-running applications to tolerate hardware faults. The most commonly used approach is checkpoint and restart(CPR)-the state of the computation is saved periodically on disk, and when a failure occurs, the computation is restarted from the last saved state. At present, it is the responsibility of the programmer to instrument applications for CPR. Our group is investigating the use of compiler technology to instrument codes to make them self-checkpointing and self-restarting, thereby providing an automatic solution to the problem of making long-running scientific applications resilient to hardware faults. Our previous work focusedmore » on message-passing programs. In this paper, we describe such a system for shared-memory programs running on symmetric multiprocessors. The system has two components: (i)a pre-compiler for source-to-source modification of applications, and (ii) a runtime system that implements a protocol for coordinating CPR among the threads of the parallel application. For the sake of concreteness, we focus on a non-trivial subset of OpenMP that includes barriers and locks. One of the advantages of this approach is that the ability to tolerate faults becomes embedded within the application itself, so applications become self-checkpointing and self-restarting on any platform. We demonstrate this by showing that our transformed benchmarks can checkpoint and restart on three different platforms (Windows/x86, Linux/x86, and Tru64/Alpha). Our experiments show that the overhead introduced by this approach is usually quite small; they also suggest ways in which the current implementation can be tuned to reduced overheads further.« less
CUDA Optimization Strategies for Compute- and Memory-Bound Neuroimaging Algorithms
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W.
2011-01-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. PMID:21159404
CUDA optimization strategies for compute- and memory-bound neuroimaging algorithms.
Lee, Daren; Dinov, Ivo; Dong, Bin; Gutman, Boris; Yanovsky, Igor; Toga, Arthur W
2012-06-01
As neuroimaging algorithms and technology continue to grow faster than CPU performance in complexity and image resolution, data-parallel computing methods will be increasingly important. The high performance, data-parallel architecture of modern graphical processing units (GPUs) can reduce computational times by orders of magnitude. However, its massively threaded architecture introduces challenges when GPU resources are exceeded. This paper presents optimization strategies for compute- and memory-bound algorithms for the CUDA architecture. For compute-bound algorithms, the registers are reduced through variable reuse via shared memory and the data throughput is increased through heavier thread workloads and maximizing the thread configuration for a single thread block per multiprocessor. For memory-bound algorithms, fitting the data into the fast but limited GPU resources is achieved through reorganizing the data into self-contained structures and employing a multi-pass approach. Memory latencies are reduced by selecting memory resources whose cache performance are optimized for the algorithm's access patterns. We demonstrate the strategies on two computationally expensive algorithms and achieve optimized GPU implementations that perform up to 6× faster than unoptimized ones. Compared to CPU implementations, we achieve peak GPU speedups of 129× for the 3D unbiased nonlinear image registration technique and 93× for the non-local means surface denoising algorithm. Copyright © 2010 Elsevier Ireland Ltd. All rights reserved.
Optoelectronic-cache memory system architecture.
Chiarulli, D M; Levitan, S P
1996-05-10
We present an investigation of the architecture of an optoelectronic cache that can integrate terabit optical memories with the electronic caches associated with high-performance uniprocessors and multiprocessors. The use of optoelectronic-cache memories enables these terabit technologies to provide transparently low-latency secondary memory with frame sizes comparable with disk pages but with latencies that approach those of electronic secondary-cache memories. This enables the implementation of terabit memories with effective access times comparable with the cycle times of current microprocessors. The cache design is based on the use of a smart-pixel array and combines parallel free-space optical input-output to-and-from optical memory with conventional electronic communication to the processor caches. This cache and the optical memory system to which it will interface provide a large random-access memory space that has a lower overall latency than that of magnetic disks and disk arrays. In addition, as a consequence of the high-bandwidth parallel input-output capabilities of optical memories, fault service times for the optoelectronic cache are substantially less than those currently achievable with any rotational media.
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment*†
Khan, Md. Ashfaquzzaman; Herbordt, Martin C.
2011-01-01
Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations. PMID:21822327
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment.
Khan, Md Ashfaquzzaman; Herbordt, Martin C
2011-07-20
Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations.
Embedded Multiprocessor Technology for VHSIC Insertion
NASA Technical Reports Server (NTRS)
Hayes, Paul J.
1990-01-01
Viewgraphs on embedded multiprocessor technology for VHSIC insertion are presented. The objective was to develop multiprocessor system technology providing user-selectable fault tolerance, increased throughput, and ease of application representation for concurrent operation. The approach was to develop graph management mapping theory for proper performance, model multiprocessor performance, and demonstrate performance in selected hardware systems.
Design and evaluation of a fault-tolerant multiprocessor using hardware recovery blocks
NASA Technical Reports Server (NTRS)
Lee, Y. H.; Shin, K. G.
1982-01-01
A fault-tolerant multiprocessor with a rollback recovery mechanism is discussed. The rollback mechanism is based on the hardware recovery block which is a hardware equivalent to the software recovery block. The hardware recovery block is constructed by consecutive state-save operations and several state-save units in every processor and memory module. When a fault is detected, the multiprocessor reconfigures itself to replace the faulty component and then the process originally assigned to the faulty component retreats to one of the previously saved states in order to resume fault-free execution. A mathematical model is proposed to calculate both the coverage of multi-step rollback recovery and the risk of restart. A performance evaluation in terms of task execution time is also presented.
Flexible language constructs for large parallel programs
NASA Technical Reports Server (NTRS)
Rosing, Matthew; Schnabel, Robert
1993-01-01
The goal of the research described is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (MIMD) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include SIMD (Single Instruction Multiple Data), SPMD (Single Program Multiple Data), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression of the variety of algorithms that occur in large scientific computations. An overview of a new language that combines many of these programming models in a clean manner is given. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. An overview of the language and discussion of some of the critical implementation details is given.
Insertion of coherence requests for debugging a multiprocessor
Blumrich, Matthias A.; Salapura, Valentina
2010-02-23
A method and system are disclosed to insert coherence events in a multiprocessor computer system, and to present those coherence events to the processors of the multiprocessor computer system for analysis and debugging purposes. The coherence events are inserted in the computer system by adding one or more special insert registers. By writing into the insert registers, coherence events are inserted in the multiprocessor system as if they were generated by the normal coherence protocol. Once these coherence events are processed, the processing of coherence events can continue in the normal operation mode.
Distributed Virtual System (DIVIRS) Project
NASA Technical Reports Server (NTRS)
Schorr, Herbert; Neuman, B. Clifford
1993-01-01
As outlined in our continuation proposal 92-ISI-50R (revised) on contract NCC 2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to program parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the virtual system model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
DIstributed VIRtual System (DIVIRS) project
NASA Technical Reports Server (NTRS)
Schorr, Herbert; Neuman, B. Clifford
1994-01-01
As outlined in our continuation proposal 92-ISI-. OR (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
DIstributed VIRtual System (DIVIRS) project
NASA Technical Reports Server (NTRS)
Schorr, Herbert; Neuman, Clifford B.
1995-01-01
As outlined in our continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC2-539, we are (1) developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; (2) developing communications routines that support the abstractions implemented in item one; (3) continuing the development of file and information systems based on the Virtual System Model; and (4) incorporating appropriate security measures to allow the mechanisms developed in items 1 through 3 to be used on an open network. The goal throughout our work is to provide a uniform model that can be applied to both parallel and distributed systems. We believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. Our work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
Distributed Virtual System (DIVIRS) project
NASA Technical Reports Server (NTRS)
Schorr, Herbert; Neuman, B. Clifford
1993-01-01
As outlined in the continuation proposal 92-ISI-50R (revised) on NASA cooperative agreement NCC 2-539, the investigators are developing software, including a system manager and a job manager, that will manage available resources and that will enable programmers to develop and execute parallel applications in terms of a virtual configuration of processors, hiding the mapping to physical nodes; developing communications routines that support the abstractions implemented; continuing the development of file and information systems based on the Virtual System Model; and incorporating appropriate security measures to allow the mechanisms developed to be used on an open network. The goal throughout the work is to provide a uniform model that can be applied to both parallel and distributed systems. The authors believe that multiprocessor systems should exist in the context of distributed systems, allowing them to be more easily shared by those that need them. The work provides the mechanisms through which nodes on multiprocessors are allocated to jobs running within the distributed system and the mechanisms through which files needed by those jobs can be located and accessed.
State recovery and lockstep execution restart in a system with multiprocessor pairing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switchmore » or a bus. Each selectively paired processor core is includes a transactional execution facility, whereing the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.« less
A manual for PARTI runtime primitives
NASA Technical Reports Server (NTRS)
Berryman, Harry; Saltz, Joel
1990-01-01
Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak
1996-01-01
Solving the hard Satisfiability Problem is time consuming even for modest-sized problem instances. Solving the Random L-SAT Problem is especially difficult due to the ratio of clauses to variables. This report presents a parallel synchronous simulated annealing method for solving the Random L-SAT Problem on a large-scale distributed-memory multiprocessor. In particular, we use a parallel synchronous simulated annealing procedure, called Generalized Speculative Computation, which guarantees the same decision sequence as sequential simulated annealing. To demonstrate the performance of the parallel method, we have selected problem instances varying in size from 100-variables/425-clauses to 5000-variables/21,250-clauses. Experimental results on the AP1000 multiprocessor indicate that our approach can satisfy 99.9 percent of the clauses while giving almost a 70-fold speedup on 500 processors.
NASA Technical Reports Server (NTRS)
Arenstorf, Norbert S.; Jordan, Harry F.
1987-01-01
A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor, that support these conclusions are detailed.
Operating system for a real-time multiprocessor propulsion system simulator. User's manual
NASA Technical Reports Server (NTRS)
Cole, G. L.
1985-01-01
The NASA Lewis Research Center is developing and evaluating experimental hardware and software systems to help meet future needs for real-time, high-fidelity simulations of air-breathing propulsion systems. Specifically, the real-time multiprocessor simulator project focuses on the use of multiple microprocessors to achieve the required computing speed and accuracy at relatively low cost. Operating systems for such hardware configurations are generally not available. A real time multiprocessor operating system (RTMPOS) that supports a variety of multiprocessor configurations was developed at Lewis. With some modification, RTMPOS can also support various microprocessors. RTMPOS, by means of menus and prompts, provides the user with a versatile, user-friendly environment for interactively loading, running, and obtaining results from a multiprocessor-based simulator. The menu functions are described and an example simulation session is included to demonstrate the steps required to go from the simulation loading phase to the execution phase.
Cache Coherence Protocols for Large-Scale Multiprocessors
1990-09-01
and is compared with the other protocols for large-scale machines. In later analysis, this coherence method is designated by the acronym OCPD , which...private read misses 2 6 6 ( OCPD ) private write misses 2 6 6 Table 4.2: Transaction Types and Costs. the performance of the memory system. These...methodologies. Figure 4-2 shows the processor utiliza- tions of the Weather program, with special code in the dyn-nic post-mortem sched- 94 OCPD DlrINB
1980-08-31
DEPARTMENT OF THE ARMY POSITION, UNLESS SO DESIGNATED BY OTHER AUTHORIZED DOCUMENTS. -I , ! unclassified SECURITY CLASSIICATION Of THIS PAGE ’Whm bate...ADORE.SE4I dFfemtar CIatblftaE Office) IS. SECURITY CLASS. (of ftio #sNt) unclassified IS. DOCL ASSI FICATION/DOWNGRADING SCHEDULENA INA IS. DISTRIBUTION...In this section, we describe the characteristics of the access sequence of a pipelined processor. A pipelined organization in the most general sense
A manual for PARTI runtime primitives, revision 1
NASA Technical Reports Server (NTRS)
Das, Raja; Saltz, Joel; Berryman, Harry
1991-01-01
Primitives are presented that are designed to help users efficiently program irregular problems (e.g., unstructured mesh sweeps, sparse matrix codes, adaptive mesh partial differential equations solvers) on distributed memory machines. These primitives are also designed for use in compilers for distributed memory multiprocessors. Communications patterns are captured at runtime, and the appropriate send and receive messages are automatically generated.
File-System Workload on a Scientific Multiprocessor
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1995-01-01
Many scientific applications have intense computational and I/O requirements. Although multiprocessors have permitted astounding increases in computational performance, the formidable I/O needs of these applications cannot be met by current multiprocessors a their I/O subsystems. To prevent I/O subsystems from forever bottlenecking multiprocessors and limiting the range of feasible applications, new I/O subsystems must be designed. The successful design of computer systems (both hardware and software) depends on a thorough understanding of their intended use. A system designer optimizes the policies and mechanisms for the cases expected to most common in the user's workload. In the case of multiprocessor file systems, however, designers have been forced to build file systems based only on speculation about how they would be used, extrapolating from file-system characterizations of general-purpose workloads on uniprocessor and distributed systems or scientific workloads on vector supercomputers (see sidebar on related work). To help these system designers, in June 1993 we began the Charisma Project, so named because the project sought to characterize 1/0 in scientific multiprocessor applications from a variety of production parallel computing platforms and sites. The Charisma project is unique in recording individual read and write requests-in live, multiprogramming, parallel workloads (rather than from selected or nonparallel applications). In this article, we present the first results from the project: a characterization of the file-system workload an iPSC/860 multiprocessor running production, parallel scientific applications at NASA's Ames Research Center.
VME rollback hardware for time warp multiprocessor systems
NASA Technical Reports Server (NTRS)
Robb, Michael J.; Buzzell, Calvin A.
1992-01-01
The purpose of the research effort is to develop and demonstrate innovative hardware to implement specific rollback and timing functions required for efficient queue management and precision timekeeping in multiprocessor discrete event simulations. The previously completed phase 1 effort demonstrated the technical feasibility of building hardware modules which eliminate the state saving overhead of the Time Warp paradigm used in distributed simulations on multiprocessor systems. The current phase 2 effort will build multiple pre-production rollback hardware modules integrated with a network of Sun workstations, and the integrated system will be tested by executing a Time Warp simulation. The rollback hardware will be designed to interface with the greatest number of multiprocessor systems possible. The authors believe that the rollback hardware will provide for significant speedup of large scale discrete event simulation problems and allow multiprocessors using Time Warp to dramatically increase performance.
Power-Aware Compiler Controllable Chip Multiprocessor
NASA Astrophysics Data System (ADS)
Shikano, Hiroaki; Shirako, Jun; Wada, Yasutaka; Kimura, Keiji; Kasahara, Hironori
A power-aware compiler controllable chip multiprocessor (CMP) is presented and its performance and power consumption are evaluated with the optimally scheduled advanced multiprocessor (OSCAR) parallelizing compiler. The CMP is equipped with power control registers that change clock frequency and power supply voltage to functional units including processor cores, memories, and an interconnection network. The OSCAR compiler carries out coarse-grain task parallelization of programs and reduces power consumption using architectural power control support and the compiler's power saving scheme. The performance evaluation shows that MPEG-2 encoding on the proposed CMP with four CPUs results in 82.6% power reduction in real-time execution mode with a deadline constraint on its sequential execution time. Furthermore, MP3 encoding on a heterogeneous CMP with four CPUs and four accelerators results in 53.9% power reduction at 21.1-fold speed-up in performance against its sequential execution in the fastest execution mode.
A large-grain mapping approach for multiprocessor systems through data flow model. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Kim, Hwa-Soo
1991-01-01
A large-grain level mapping method is presented of numerical oriented applications onto multiprocessor systems. The method is based on the large-grain data flow representation of the input application and it assumes a general interconnection topology of the multiprocessor system. The large-grain data flow model was used because such representation best exhibits inherited parallelism in many important applications, e.g., CFD models based on partial differential equations can be presented in large-grain data flow format, very effectively. A generalized interconnection topology of the multiprocessor architecture is considered, including such architectural issues as interprocessor communication cost, with the aim to identify the 'best matching' between the application and the multiprocessor structure. The objective is to minimize the total execution time of the input algorithm running on the target system. The mapping strategy consists of the following: (1) large-grain data flow graph generation from the input application using compilation techniques; (2) data flow graph partitioning into basic computation blocks; and (3) physical mapping onto the target multiprocessor using a priority allocation scheme for the computation blocks.
A model for the distributed storage and processing of large arrays
NASA Technical Reports Server (NTRS)
Mehrota, P.; Pratt, T. W.
1983-01-01
A conceptual model for parallel computations on large arrays is developed. The model provides a set of language concepts appropriate for processing arrays which are generally too large to fit in the primary memories of a multiprocessor system. The semantic model is used to represent arrays on a concurrent architecture in such a way that the performance realities inherent in the distributed storage and processing can be adequately represented. An implementation of the large array concept as an Ada package is also described.
Software/hardware distributed processing network supporting the Ada environment
NASA Astrophysics Data System (ADS)
Wood, Richard J.; Pryk, Zen
1993-09-01
A high-performance, fault-tolerant, distributed network has been developed, tested, and demonstrated. The network is based on the MIPS Computer Systems, Inc. R3000 Risc for processing, VHSIC ASICs for high speed, reliable, inter-node communications and compatible commercial memory and I/O boards. The network is an evolution of the Advanced Onboard Signal Processor (AOSP) architecture. It supports Ada application software with an Ada- implemented operating system. A six-node implementation (capable of expansion up to 256 nodes) of the RISC multiprocessor architecture provides 120 MIPS of scalar throughput, 96 Mbytes of RAM and 24 Mbytes of non-volatile memory. The network provides for all ground processing applications, has merit for space-qualified RISC-based network, and interfaces to advanced Computer Aided Software Engineering (CASE) tools for application software development.
Overview of the LINCS architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
Fletcher, J.G.; Watson, R.W.
1982-01-13
Computing at the Lawrence Livermore National Laboratory (LLNL) has evolved over the past 15 years with a computer network based resource sharing environment. The increasing use of low cost and high performance micro, mini and midi computers and commercially available local networking systems will accelerate this trend. Further, even the large scale computer systems, on which much of the LLNL scientific computing depends, are evolving into multiprocessor systems. It is our belief that the most cost effective use of this environment will depend on the development of application systems structured into cooperating concurrent program modules (processes) distributed appropriately over differentmore » nodes of the environment. A node is defined as one or more processors with a local (shared) high speed memory. Given the latter view, the environment can be characterized as consisting of: multiple nodes communicating over noisy channels with arbitrary delays and throughput, heterogenous base resources and information encodings, no single administration controlling all resources, distributed system state, and no uniform time base. The system design problem is - how to turn the heterogeneous base hardware/firmware/software resources of this environment into a coherent set of resources that facilitate development of cost effective, reliable, and human engineered applications. We believe the answer lies in developing a layered, communication oriented distributed system architecture; layered and modular to support ease of understanding, reconfiguration, extensibility, and hiding of implementation or nonessential local details; communication oriented because that is a central feature of the environment. The Livermore Interactive Network Communication System (LINCS) is a hierarchical architecture designed to meet the above needs. While having characteristics in common with other architectures, it differs in several respects.« less
Flexible Language Constructs for Large Parallel Programs
Rosing, Matt; Schnabel, Robert
1994-01-01
The goal of the research described in this article is to develop flexible language constructs for writing large data parallel numerical programs for distributed memory (multiple instruction multiple data [MIMD]) multiprocessors. Previously, several models have been developed to support synchronization and communication. Models for global synchronization include single instruction multiple data (SIMD), single program multiple data (SPMD), and sequential programs annotated with data distribution statements. The two primary models for communication include implicit communication based on shared memory and explicit communication based on messages. None of these models by themselves seem sufficient to permit the natural and efficient expression ofmore » the variety of algorithms that occur in large scientific computations. In this article, we give an overview of a new language that combines many of these programming models in a clean manner. This is done in a modular fashion such that different models can be combined to support large programs. Within a module, the selection of a model depends on the algorithm and its efficiency requirements. In this article, we give an overview of the language and discuss some of the critical implementation details.« less
A fault-tolerant multiprocessor architecture for aircraft, volume 1. [autopilot configuration
NASA Technical Reports Server (NTRS)
Smith, T. B.; Hopkins, A. L.; Taylor, W.; Ausrotas, R. A.; Lala, J. H.; Hanley, L. D.; Martin, J. H.
1978-01-01
A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed.
USC orthogonal multiprocessor for image processing with neural networks
NASA Astrophysics Data System (ADS)
Hwang, Kai; Panda, Dhabaleswar K.; Haddadi, Navid
1990-07-01
This paper presents the architectural features and imaging applications of the Orthogonal MultiProcessor (OMP) system, which is under construction at the University of Southern California with research funding from NSF and assistance from several industrial partners. The prototype OMP is being built with 16 Intel i860 RISC microprocessors and 256 parallel memory modules using custom-designed spanning buses, which are 2-D interleaved and orthogonally accessed without conflicts. The 16-processor OMP prototype is targeted to achieve 430 MIPS and 600 Mflops, which have been verified by simulation experiments based on the design parameters used. The prototype OMP machine will be initially applied for image processing, computer vision, and neural network simulation applications. We summarize important vision and imaging algorithms that can be restructured with neural network models. These algorithms can efficiently run on the OMP hardware with linear speedup. The ultimate goal is to develop a high-performance Visual Computer (Viscom) for integrated low- and high-level image processing and vision tasks.
Queueing analysis of a canonical model of real-time multiprocessors
NASA Technical Reports Server (NTRS)
Krishna, C. M.; Shin, K. G.
1983-01-01
A logical classification of multiprocessor structures from the point of view of control applications is presented. A computation of the response time distribution for a canonical model of a real time multiprocessor is presented. The multiprocessor is approximated by a blocking model. Two separate models are derived: one created from the system's point of view, and the other from the point of view of an incoming task.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
Kirk, Andrew G; Plant, David V; Szymanski, Ted H; Vranesic, Zvonko G; Tooley, Frank A P; Rolston, David R; Ayliffe, Michael H; Lacroix, Frederic K; Robertson, Brian; Bernier, Eric; Brosseau, Daniel F
2003-05-10
Design and implementation of a free-space optical backplane for multiprocessor applications is presented. The system is designed to interconnect four multiprocessor nodes that communicate by using multiplexed 32-bit packets. Each multiprocessor node is electrically connected to an optoelectronic VLSI chip which implements the hyperplane interconnection architecture. The chips each contain 256 optical transmitters (implemented as dual-rail multiple quantum-well modulators) and 256 optical receivers. A rigid free-space microoptical interconnection system that interconnects the transceiver chips in a 512-channel unidirectional ring is implemented. Full design, implementation, and operational details are provided.
NASA Astrophysics Data System (ADS)
Kirk, Andrew G.; Plant, David V.; Szymanski, Ted H.; Vranesic, Zvonko G.; Tooley, Frank A. P.; Rolston, David R.; Ayliffe, Michael H.; Lacroix, Frederic K.; Robertson, Brian; Bernier, Eric; Brosseau, Daniel F.
2003-05-01
Design and implementation of a free-space optical backplane for multiprocessor applications is presented. The system is designed to interconnect four multiprocessor nodes that communicate by using multiplexed 32-bit packets. Each multiprocessor node is electrically connected to an optoelectronic VLSI chip which implements the hyperplane interconnection architecture. The chips each contain 256 optical transmitters (implemented as dual-rail multiple quantum-well modulators) and 256 optical receivers. A rigid free-space microoptical interconnection system that interconnects the transceiver chips in a 512-channel unidirectional ring is implemented. Full design, implementation, and operational details are provided.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2011-01-11
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-02-21
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Design and implementation of highly parallel pipelined VLSI systems
NASA Astrophysics Data System (ADS)
Delange, Alphonsus Anthonius Jozef
A methodology and its realization as a prototype CAD (Computer Aided Design) system for the design and analysis of complex multiprocessor systems is presented. The design is an iterative process in which the behavioral specifications of the system components are refined into structural descriptions consisting of interconnections and lower level components etc. A model for the representation and analysis of multiprocessor systems at several levels of abstraction and an implementation of a CAD system based on this model are described. A high level design language, an object oriented development kit for tool design, a design data management system, and design and analysis tools such as a high level simulator and graphics design interface which are integrated into the prototype system and graphics interface are described. Procedures for the synthesis of semiregular processor arrays, and to compute the switching of input/output signals, memory management and control of processor array, and sequencing and segmentation of input/output data streams due to partitioning and clustering of the processor array during the subsequent synthesis steps, are described. The architecture and control of a parallel system is designed and each component mapped to a module or module generator in a symbolic layout library, compacted for design rules of VLSI (Very Large Scale Integration) technology. An example of the design of a processor that is a useful building block for highly parallel pipelined systems in the signal/image processing domains is given.
Two Fundamental Issues in Multiprocessing.
1987-10-01
Structural Model of a Multiprocessor 6 Figure 5: Operational Model of a Multiprocessor 7 Figure 6: The von Neumann Processor (from Gajski and Peir [201) 10...Computer Society, June, 1983. 20. Gajski , D. D. & J-K. Peir. "Essential Issues in Multiprocessor Systems". Computer 18, 6 (June 1985), 9-27. 21. Gurd
The Galley Parallel File System
NASA Technical Reports Server (NTRS)
Nieuwejaar, Nils; Kotz, David
1996-01-01
Most current multiprocessor file systems are designed to use multiple disks in parallel, using the high aggregate bandwidth to meet the growing I/0 requirements of parallel scientific applications. Many multiprocessor file systems provide applications with a conventional Unix-like interface, allowing the application to access multiple disks transparently. This interface conceals the parallelism within the file system, increasing the ease of programmability, but making it difficult or impossible for sophisticated programmers and libraries to use knowledge about their I/O needs to exploit that parallelism. In addition to providing an insufficient interface, most current multiprocessor file systems are optimized for a different workload than they are being asked to support. We introduce Galley, a new parallel file system that is intended to efficiently support realistic scientific multiprocessor workloads. We discuss Galley's file structure and application interface, as well as the performance advantages offered by that interface.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Spaceborne VHSIC multiprocessor system for AI applications
NASA Technical Reports Server (NTRS)
Lum, Henry, Jr.; Shrobe, Howard E.; Aspinall, John G.
1988-01-01
A multiprocessor system, under design for space-station applications, makes use of the latest generation symbolic processor and packaging technology. The result will be a compact, space-qualified system two to three orders of magnitude more powerful than present-day symbolic processing systems.
Mapping of H.264 decoding on a multiprocessor architecture
NASA Astrophysics Data System (ADS)
van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.
2003-05-01
Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the overall speedup.
NASA Astrophysics Data System (ADS)
Lohn, Stefan B.; Dong, Xin; Carminati, Federico
2012-12-01
Chip-Multiprocessors are going to support massive parallelism by many additional physical and logical cores. Improving performance can no longer be obtained by increasing clock-frequency because the technical limits are almost reached. Instead, parallel execution must be used to gain performance. Resources like main memory, the cache hierarchy, bandwidth of the memory bus or links between cores and sockets are not going to be improved as fast. Hence, parallelism can only result into performance gains if the memory usage is optimized and the communication between threads is minimized. Besides concurrent programming has become a domain for experts. Implementing multi-threading is error prone and labor-intensive. A full reimplementation of the whole AliRoot source-code is unaffordable. This paper describes the effort to evaluate the adaption of AliRoot to the needs of multi-threading and to provide the capability of parallel processing by using a semi-automatic source-to-source transformation to address the problems as described before and to provide a straight-forward way of parallelization with almost no interference between threads. This makes the approach simple and reduces the required manual changes in the code. In a first step, unconditional thread-safety will be introduced to bring the original sequential and thread unaware source-code into the position of utilizing multi-threading. Afterwards further investigations have to be performed to point out candidates of classes that are useful to share amongst threads. Then in a second step, the transformation has to change the code to share these classes and finally to verify if there are anymore invalid interferences between threads.
Simplifying and speeding the management of intra-node cache coherence
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-04-17
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A; Chen, Dong; Coteus, Paul W
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Communication Studies of DMP and SMP Machines
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
Understanding the interplay between machines and problems is key to obtaining high performance on parallel machines. This paper investigates the interplay between programming paradigms and communication capabilities of parallel machines. In particular, we explicate the communication capabilities of the IBM SP-2 distributed-memory multiprocessor and the SGI PowerCHALLENGEarray symmetric multiprocessor. Two benchmark problems of bitonic sorting and Fast Fourier Transform are selected for experiments. Communication-efficient algorithms are developed to exploit the overlapping capabilities of the machines. Programs are written in Message-Passing Interface for portability and identical codes are used for both machines. Various data sizes and message sizes are used to test the machines' communication capabilities. Experimental results indicate that the communication performance of the multiprocessors are consistent with the size of messages. The SP-2 is sensitive to message size but yields a much higher communication overlapping because of the communication co-processor. The PowerCHALLENGEarray is not highly sensitive to message size and yields a low communication overlapping. Bitonic sorting yields lower performance compared to FFT due to a smaller computation-to-communication ratio.
NASA Technical Reports Server (NTRS)
Stehle, Roy H.; Ogier, Richard G.
1993-01-01
Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
NASA Technical Reports Server (NTRS)
Clune, E.; Segall, Z.; Siewiorek, D.
1984-01-01
A program of experiments has been conducted at NASA-Langley to test the fault-free performance of a Fault-Tolerant Multiprocessor (FTMP) avionics system for next-generation aircraft. Baseline measurements of an operating FTMP system were obtained with respect to the following parameters: instruction execution time, frame size, and the variation of clock ticks. The mechanisms of frame stretching were also investigated. The experimental results are summarized in a table. Areas of interest for future tests are identified, with emphasis given to the implementation of a synthetic workload generation mechanism on FTMP.
Low Latency Messages on Distributed Memory Multiprocessors
Rosing, Matt; Saltz, Joel
1995-01-01
This article describes many of the issues in developing an efficient interface for communication on distributed memory machines. Although the hardware component of message latency is less than 1 ws on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 μs. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. This article describes several tests performed and many of the issues involvedmore » in supporting low latency messages on distributed memory machines.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.
1997-12-31
The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
Reproducibility in a multiprocessor system
Bellofatto, Ralph A; Chen, Dong; Coteus, Paul W; Eisley, Noel A; Gara, Alan; Gooding, Thomas M; Haring, Rudolf A; Heidelberger, Philip; Kopcsay, Gerard V; Liebsch, Thomas A; Ohmacht, Martin; Reed, Don D; Senger, Robert M; Steinmacher-Burow, Burkhard; Sugawara, Yutaka
2013-11-26
Fixing a problem is usually greatly aided if the problem is reproducible. To ensure reproducibility of a multiprocessor system, the following aspects are proposed; a deterministic system start state, a single system clock, phase alignment of clocks in the system, system-wide synchronization events, reproducible execution of system components, deterministic chip interfaces, zero-impact communication with the system, precise stop of the system and a scan of the system state.
Low latency messages on distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Rosing, Matthew; Saltz, Joel
1993-01-01
Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Modelling parallel programs and multiprocessor architectures with AXE
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Fineman, Charles E.
1991-01-01
AXE, An Experimental Environment for Parallel Systems, was designed to model and simulate for parallel systems at the process level. It provides an integrated environment for specifying computation models, multiprocessor architectures, data collection, and performance visualization. AXE is being used at NASA-Ames for developing resource management strategies, parallel problem formulation, multiprocessor architectures, and operating system issues related to the High Performance Computing and Communications Program. AXE's simple, structured user-interface enables the user to model parallel programs and machines precisely and efficiently. Its quick turn-around time keeps the user interested and productive. AXE models multicomputers. The user may easily modify various architectural parameters including the number of sites, connection topologies, and overhead for operating system activities. Parallel computations in AXE are represented as collections of autonomous computing objects known as players. Their use and behavior is described. Performance data of the multiprocessor model can be observed on a color screen. These include CPU and message routing bottlenecks, and the dynamic status of the software.
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor
NASA Technical Reports Server (NTRS)
Gilliland, M. C.; Smith, B. J.; Calvert, W.
1976-01-01
The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
Software Mechanisms for Multiprocessor TLB Consistency
1989-12-01
thanks. Raj V aswani implemented the DASH message-passing system. Ramesh Govindan implemented part of the DASH virtual memory system. G . Scott...1 Latency (ma) 5 ---·-r··-·r···r··-T·-·-r··-·r····r····-r··-·r··- . g ~~~R=r 18 -----1·--·-r··-T·----1·--··r·---1·----1·--··r·---T·----1 16...model development. Synchronizing TLBs is similar to updating replicated data in a distributed environment. Lee and Garcia-Molina both used an M/ G /1
High Performance, Dependable Multiprocessor
NASA Technical Reports Server (NTRS)
Ramos, Jeremy; Samson, John R.; Troxel, Ian; Subramaniyan, Rajagopal; Jacobs, Adam; Greco, James; Cieslewski, Grzegorz; Curreri, John; Fischer, Michael; Grobelny, Eric;
2006-01-01
With the ever increasing demand for higher bandwidth and processing capacity of today's space exploration, space science, and defense missions, the ability to efficiently apply commercial-off-the-shelf (COTS) processors for on-board computing is now a critical need. In response to this need, NASA's New Millennium Program office has commissioned the development of Dependable Multiprocessor (DM) technology for use in payload and robotic missions. The Dependable Multiprocessor technology is a COTS-based, power efficient, high performance, highly dependable, fault tolerant cluster computer. To date, Honeywell has successfully demonstrated a TRL4 prototype of the Dependable Multiprocessor [I], and is now working on the development of a TRLS prototype. For the present effort Honeywell has teamed up with the University of Florida's High-performance Computing and Simulation (HCS) Lab, and together the team has demonstrated major elements of the Dependable Multiprocessor TRLS system.
Shared Semantics and the Use of Organizational Memories for E-Mail Communications.
ERIC Educational Resources Information Center
Schwartz, David G.
1998-01-01
Examines the use of shared semantics information to link concepts in an organizational memory to e-mail communications. Presents a framework for determining shared semantics based on organizational and personal user profiles. Illustrates how shared semantics are used by the HyperMail system to help link organizational memories (OM) content to…
Multiprocessor programming environment
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, M.B.; Fornaro, R.
Programming tools and techniques have been well developed for traditional uniprocessor computer systems. The focus of this research project is on the development of a programming environment for a high speed real time heterogeneous multiprocessor system, with special emphasis on languages and compilers. The new tools and techniques will allow a smooth transition for programmers with experience only on single processor systems.
NASA Astrophysics Data System (ADS)
Lai, Siyan; Xu, Ying; Shao, Bo; Guo, Menghan; Lin, Xiaola
2017-04-01
In this paper we study on Monte Carlo method for solving systems of linear algebraic equations (SLAE) based on shared memory. Former research demostrated that GPU can effectively speed up the computations of this issue. Our purpose is to optimize Monte Carlo method simulation on GPUmemoryachritecture specifically. Random numbers are organized to storein shared memory, which aims to accelerate the parallel algorithm. Bank conflicts can be avoided by our Collaborative Thread Arrays(CTA)scheme. The results of experiments show that the shared memory based strategy can speed up the computaions over than 3X at most.
An efficient parallel-processing method for transposing large matrices in place.
Portnoff, M R
1999-01-01
We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.
OSCAR API for Real-Time Low-Power Multicores and Its Performance on Multicores and SMP Servers
NASA Astrophysics Data System (ADS)
Kimura, Keiji; Mase, Masayoshi; Mikami, Hiroki; Miyamoto, Takamichi; Shirako, Jun; Kasahara, Hironori
OSCAR (Optimally Scheduled Advanced Multiprocessor) API has been designed for real-time embedded low-power multicores to generate parallel programs for various multicores from different vendors by using the OSCAR parallelizing compiler. The OSCAR API has been developed by Waseda University in collaboration with Fujitsu Laboratory, Hitachi, NEC, Panasonic, Renesas Technology, and Toshiba in an METI/NEDO project entitled "Multicore Technology for Realtime Consumer Electronics." By using the OSCAR API as an interface between the OSCAR compiler and backend compilers, the OSCAR compiler enables hierarchical multigrain parallel processing with memory optimization under capacity restriction for cache memory, local memory, distributed shared memory, and on-chip/off-chip shared memory; data transfer using a DMA controller; and power reduction control using DVFS (Dynamic Voltage and Frequency Scaling), clock gating, and power gating for various embedded multicores. In addition, a parallelized program automatically generated by the OSCAR compiler with OSCAR API can be compiled by the ordinary OpenMP compilers since the OSCAR API is designed on a subset of the OpenMP. This paper describes the OSCAR API and its compatibility with the OSCAR compiler by showing code examples. Performance evaluations of the OSCAR compiler and the OSCAR API are carried out using an IBM Power5+ workstation, an IBM Power6 high-end SMP server, and a newly developed consumer electronics multicore chip RP2 by Renesas, Hitachi and Waseda. From the results of scalability evaluation, it is found that on an average, the OSCAR compiler with the OSCAR API can exploit 5.8 times speedup over the sequential execution on the Power5+ workstation with eight cores and 2.9 times speedup on RP2 with four cores, respectively. In addition, the OSCAR compiler can accelerate an IBM XL Fortran compiler up to 3.3 times on the Power6 SMP server. Due to low-power optimization on RP2, the OSCAR compiler with the OSCAR API achieves a maximum power reduction of 84% in the real-time execution mode.
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Supinski, B.; Caliga, D.
2017-09-28
The primary objective of this project was to develop memory optimization technology to efficiently deliver data to, and distribute data within, the SRC-6's Field Programmable Gate Array- ("FPGA") based Multi-Adaptive Processors (MAPs). The hardware/software approach was to explore efficient MAP configurations and generate the compiler technology to exploit those configurations. This memory accessing technology represents an important step towards making reconfigurable symmetric multi-processor (SMP) architectures that will be a costeffective solution for large-scale scientific computing.
Parallel program debugging with flowback analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Choi, Jongdeok.
1989-01-01
This thesis describes the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors. The goal of the debugging system is to present to the programmer a graphical view of the dynamic program dependences while keeping the execution-time overhead low. The author first describes the use of flowback analysis to provide information on causal relationship between events in a programs' execution without re-executing the program for debugging. Execution time overhead is kept low by recording only a small amount of trace during a program's execution. He uses semantic analysis and a technique called incrementalmore » tracing to keep the time and space overhead low. As part of the semantic analysis, he uses a static program dependence graph structure that reduces the amount of work done at compile time and takes advantage of the dynamic information produced during execution time. The cornerstone of the incremental tracing concept is to generate a coarse trace during execution and fill incrementally, during the interactive portion of the debugging session, the gap between the information gathered in the coarse trace and the information needed to do the flowback analysis using the coarse trace. Then, he describes how to extend the flowback analysis to parallel programs. The flowback analysis can span process boundaries; i.e., the most recent modification to a shared variable might be traced to a different process than the one that contains the current reference. The static and dynamic program dependence graphs of the individual processes are tied together with synchronization and data dependence information to form complete graphs that represent the entire program.« less
NASA Technical Reports Server (NTRS)
Farhat, Charbel; Lesoinne, Michel
1993-01-01
Most of the recently proposed computational methods for solving partial differential equations on multiprocessor architectures stem from the 'divide and conquer' paradigm and involve some form of domain decomposition. For those methods which also require grids of points or patches of elements, it is often necessary to explicitly partition the underlying mesh, especially when working with local memory parallel processors. In this paper, a family of cost-effective algorithms for the automatic partitioning of arbitrary two- and three-dimensional finite element and finite difference meshes is presented and discussed in view of a domain decomposed solution procedure and parallel processing. The influence of the algorithmic aspects of a solution method (implicit/explicit computations), and the architectural specifics of a multiprocessor (SIMD/MIMD, startup/transmission time), on the design of a mesh partitioning algorithm are discussed. The impact of the partitioning strategy on load balancing, operation count, operator conditioning, rate of convergence and processor mapping is also addressed. Finally, the proposed mesh decomposition algorithms are demonstrated with realistic examples of finite element, finite volume, and finite difference meshes associated with the parallel solution of solid and fluid mechanics problems on the iPSC/2 and iPSC/860 multiprocessors.
Monitoring of computing resource use of active software releases at ATLAS
NASA Astrophysics Data System (ADS)
Limosani, Antonio; ATLAS Collaboration
2017-10-01
The LHC is the world’s most powerful particle accelerator, colliding protons at centre of mass energy of 13 TeV. As the energy and frequency of collisions has grown in the search for new physics, so too has demand for computing resources needed for event reconstruction. We will report on the evolution of resource usage in terms of CPU and RAM in key ATLAS offline reconstruction workflows at the TierO at CERN and on the WLCG. Monitoring of workflows is achieved using the ATLAS PerfMon package, which is the standard ATLAS performance monitoring system running inside Athena jobs. Systematic daily monitoring has recently been expanded to include all workflows beginning at Monte Carlo generation through to end-user physics analysis, beyond that of event reconstruction. Moreover, the move to a multiprocessor mode in production jobs has facilitated the use of tools, such as “MemoryMonitor”, to measure the memory shared across processors in jobs. Resource consumption is broken down into software domains and displayed in plots generated using Python visualization libraries and collected into pre-formatted auto-generated Web pages, which allow the ATLAS developer community to track the performance of their algorithms. This information is however preferentially filtered to domain leaders and developers through the use of JIRA and via reports given at ATLAS software meetings. Finally, we take a glimpse of the future by reporting on the expected CPU and RAM usage in benchmark workflows associated with the High Luminosity LHC and anticipate the ways performance monitoring will evolve to understand and benchmark future workflows.
Method and apparatus for efficiently tracking queue entries relative to a timestamp
Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Heidelberger, Philip; Ohmacht, Martin; Salapura, Velentina; Vranas, Pavlos
2014-06-17
An apparatus and method for tracking coherence event signals transmitted in a multiprocessor system. The apparatus comprises a coherence logic unit, each unit having a plurality of queue structures with each queue structure associated with a respective sender of event signals transmitted in the system. A timing circuit associated with a queue structure controls enqueuing and dequeuing of received coherence event signals, and, a counter tracks a number of coherence event signals remaining enqueued in the queue structure and dequeued since receipt of a timestamp signal. A counter mechanism generates an output signal indicating that all of the coherence event signals present in the queue structure at the time of receipt of the timestamp signal have been dequeued. In one embodiment, the timestamp signal is asserted at the start of a memory synchronization operation and, the output signal indicates that all coherence events present when the timestamp signal was asserted have completed. This signal can then be used as part of the completion condition for the memory synchronization operation.
S-Band POSIX Device Drivers for RTEMS
NASA Technical Reports Server (NTRS)
Lux, James P.; Lang, Minh; Peters, Kenneth J.; Taylor, Gregory H.
2011-01-01
This is a set of POSIX device driver level abstractions in the RTEMS RTOS (Real-Time Executive for Multiprocessor Systems real-time operating system) to SBand radio hardware devices that have been instantiated in an FPGA (field-programmable gate array). These include A/D (analog-to-digital) sample capture, D/A (digital-to-analog) sample playback, PLL (phase-locked-loop) tuning, and PWM (pulse-width-modulation)-controlled gain. This software interfaces to Sband radio hardware in an attached Xilinx Virtex-2 FPGA. It uses plug-and-play device discovery to map memory to device IDs. Instead of interacting with hardware devices directly, using direct-memory mapped access at the application level, this driver provides an application programming interface (API) offering that easily uses standard POSIX function calls. This simplifies application programming, enables portability, and offers an additional level of protection to the hardware. There are three separate device drivers included in this package: sband_device (ADC capture and DAC playback), pll_device (RF front end PLL tuning), and pwm_device (RF front end AGC control).
Static Scheduler for Hard Real-Time Tasks on Multiprocessor Systems
1992-09-01
Foundation of Computer Science, 1980 . [SIM83] Simons, B., "Multiprocessor Scheduling of Unit-Time Jobs with Arbitrary Release Times and Deadlines", SIAM...Research Office Attn: Dr. David Hislop P. O. Box 12211 Research Triangle Park, NC 27709-2211 31. Persistent Data Systems 75 W. Chapel Ridge Road Attn: Dr
A comparison of multiprocessor scheduling methods for iterative data flow architectures
NASA Technical Reports Server (NTRS)
Storch, Matthew
1993-01-01
A comparative study is made between the Algorithm to Architecture Mapping Model (ATAMM) and three other related multiprocessing models from the published literature. The primary focus of all four models is the non-preemptive scheduling of large-grain iterative data flow graphs as required in real-time systems, control applications, signal processing, and pipelined computations. Important characteristics of the models such as injection control, dynamic assignment, multiple node instantiations, static optimum unfolding, range-chart guided scheduling, and mathematical optimization are identified. The models from the literature are compared with the ATAMM for performance, scheduling methods, memory requirements, and complexity of scheduling and design procedures.
A new taxonomy for distributed computer systems based upon operating system structure
NASA Technical Reports Server (NTRS)
Foudriat, E. C.
1985-01-01
Characteristics of the resource structure found in the operating system are considered as a mechanism for classifying distributed computer systems. Since the operating system resources, themselves, are too diversified to provide a consistent classification, the structure upon which resources are built and shared are examined. The location and control character of this indivisibility provides the taxonomy for separating uniprocessors, computer networks, network computers (fully distributed processing systems or decentralized computers) and algorithm and/or data control multiprocessors. The taxonomy is important because it divides machines into a classification that is relevant or important to the client and not the hardware architect. It also defines the character of the kernel O/S structure needed for future computer systems. What constitutes an operating system for a fully distributed processor is discussed in detail.
2015-05-01
LLC and DRAM banks. For each µB task and isolation configuration, we ran experiments with all 256 possible LLC area sizes (given by 1 to 16 ways and 1...isolation on multicoore platforms. In RTAS ’14. [29] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha . Memory access control in multiprocessor
Realtime multiprocessor for mobile ad hoc networks
NASA Astrophysics Data System (ADS)
Jungeblut, T.; Grünewald, M.; Porrmann, M.; Rückert, U.
2008-05-01
This paper introduces a real-time Multiprocessor System-On-Chip (MPSoC) for low power wireless applications. The multiprocessor is based on eight 32bit RISC processors that are connected via an Network-On-Chip (NoC). The NoC follows a novel approach with guaranteed bandwidth to the application that meets hard realtime requirements. At a clock frequency of 100 MHz the total power consumption of the MPSoC that has been fabricated in 180 nm UMC standard cell technology is 772 mW.
A Multiprocessor Operating System Simulator
NASA Technical Reports Server (NTRS)
Johnston, Gary M.; Campbell, Roy H.
1988-01-01
This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall semester of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT&T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows that of the 'Choices' family of operating systems for loosely- and tightly-coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.
The Minerva Multi-Microprocessor.
A multiprocessor system is described which is an experiment in low cost, extensible, multiprocessor architectures. Global issues such as inclusion of a central bus, design of the bus arbiter, and methods of interrupt handling are considered. The system initially includes two processor types, based on microprocessors, and these are discussed. Methods for reducing processor demand for the central bus are described.
Operating system for a real-time multiprocessor propulsion system simulator
NASA Technical Reports Server (NTRS)
Cole, G. L.
1984-01-01
The success of the Real Time Multiprocessor Operating System (RTMPOS) in the development and evaluation of experimental hardware and software systems for real time interactive simulation of air breathing propulsion systems was evaluated. The Real Time Multiprocessor Operating System (RTMPOS) provides the user with a versatile, interactive means for loading, running, debugging and obtaining results from a multiprocessor based simulator. A front end processor (FEP) serves as the simulator controller and interface between the user and the simulator. These functions are facilitated by the RTMPOS which resides on the FEP. The RTMPOS acts in conjunction with the FEP's manufacturer supplied disk operating system that provides typical utilities like an assembler, linkage editor, text editor, file handling services, etc. Once a simulation is formulated, the RTMPOS provides for engineering level, run time operations such as loading, modifying and specifying computation flow of programs, simulator mode control, data handling and run time monitoring. Run time monitoring is a powerful feature of RTMPOS that allows the user to record all actions taken during a simulation session and to receive advisories from the simulator via the FEP. The RTMPOS is programmed mainly in PASCAL along with some assembly language routines. The RTMPOS software is easily modified to be applicable to hardware from different manufacturers.
Multiprocessor smalltalk: Implementation, performance, and analysis
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pallas, J.I.
1990-01-01
Multiprocessor Smalltalk demonstrates the value of object-oriented programming on a multiprocessor. Its implementation and analysis shed light on three areas: concurrent programming in an object oriented language without special extensions, implementation techniques for adapting to multiprocessors, and performance factors in the resulting system. Adding parallelism to Smalltalk code is easy, because programs already use control abstractions like iterators. Smalltalk's basic control and concurrency primitives (lambda expressions, processes and semaphores) can be used to build parallel control abstractions, including parallel iterators, parallel objects, atomic objects, and futures. Language extensions for concurrency are not required. This implementation demonstrates that it is possiblemore » to build an efficient parallel object-oriented programming system and illustrates techniques for doing so. Three modification tools-serialization, replication, and reorganization-adapted the Berkeley Smalltalk interpreter to the Firefly multiprocessor. Multiprocessor Smalltalk's performance shows that the combination of multiprocessing and object-oriented programming can be effective: speedups (relative to the original serial version) exceed 2.0 for five processors on all the benchmarks; the median efficiency is 48%. Analysis shows both where performance is lost and how to improve and generalize the experimental results. Changes in the interpreter to support concurrency add at most 12% overhead; better access to per-process variables could eliminate much of that. Changes in the user code to express concurrency add as much as 70% overhead; this overhead could be reduced to 54% if blocks (lambda expressions) were reentrant. Performance is also lost when the program cannot keep all five processors busy.« less
Comparison of two paradigms for distributed shared memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Levelt, W.G.; Kaashoek, M.F.; Bal, H.E.
1990-08-01
The paper compares two paradigms for Distributed Shared Memory on loosely coupled computing systems: the shared data-object model as used in Orca, a programming language specially designed for loosely coupled computing systems and the Shared Virtual Memory model. For both paradigms the authors have implemented two systems, one using only point-to-point messages, the other using broadcasting as well. They briefly describe these two paradigms and their implementations. Then they compare their performance on four applications: the traveling salesman problem, alpha-beta search, matrix multiplication and the all pairs shortest paths problem. The measurements show that both paradigms can be used efficientlymore » for programming large-grain parallel applications. Significant speedups were obtained on all applications. The unstructured Shared Virtual Memory paradigm achieves the best absolute performance, although this is largely due to the preliminary nature of the Orca compiler used. The structured shared data-object model achieves the highest speedups and is much easier to program and to debug.« less
Modeling and measurement of fault-tolerant multiprocessors
NASA Technical Reports Server (NTRS)
Shin, K. G.; Woodbury, M. H.; Lee, Y. H.
1985-01-01
The workload effects on computer performance are addressed first for a highly reliable unibus multiprocessor used in real-time control. As an approach to studing these effects, a modified Stochastic Petri Net (SPN) is used to describe the synchronous operation of the multiprocessor system. From this model the vital components affecting performance can be determined. However, because of the complexity in solving the modified SPN, a simpler model, i.e., a closed priority queuing network, is constructed that represents the same critical aspects. The use of this model for a specific application requires the partitioning of the workload into job classes. It is shown that the steady state solution of the queuing model directly produces useful results. The use of this model in evaluating an existing system, the Fault Tolerant Multiprocessor (FTMP) at the NASA AIRLAB, is outlined with some experimental results. Also addressed is the technique of measuring fault latency, an important microscopic system parameter. Most related works have assumed no or a negligible fault latency and then performed approximate analyses. To eliminate this deficiency, a new methodology for indirectly measuring fault latency is presented.
Optical interconnection networks for high-performance computing systems
NASA Astrophysics Data System (ADS)
Biberman, Aleksandr; Bergman, Keren
2012-04-01
Enabled by silicon photonic technology, optical interconnection networks have the potential to be a key disruptive technology in computing and communication industries. The enduring pursuit of performance gains in computing, combined with stringent power constraints, has fostered the ever-growing computational parallelism associated with chip multiprocessors, memory systems, high-performance computing systems and data centers. Sustaining these parallelism growths introduces unique challenges for on- and off-chip communications, shifting the focus toward novel and fundamentally different communication approaches. Chip-scale photonic interconnection networks, enabled by high-performance silicon photonic devices, offer unprecedented bandwidth scalability with reduced power consumption. We demonstrate that the silicon photonic platforms have already produced all the high-performance photonic devices required to realize these types of networks. Through extensive empirical characterization in much of our work, we demonstrate such feasibility of waveguides, modulators, switches and photodetectors. We also demonstrate systems that simultaneously combine many functionalities to achieve more complex building blocks. We propose novel silicon photonic devices, subsystems, network topologies and architectures to enable unprecedented performance of these photonic interconnection networks. Furthermore, the advantages of photonic interconnection networks extend far beyond the chip, offering advanced communication environments for memory systems, high-performance computing systems, and data centers.
Fast neural net simulation with a DSP processor array.
Muller, U A; Gunzinger, A; Guggenbuhl, W
1995-01-01
This paper describes the implementation of a fast neural net simulator on a novel parallel distributed-memory computer. A 60-processor system, named MUSIC (multiprocessor system with intelligent communication), is operational and runs the backpropagation algorithm at a speed of 330 million connection updates per second (continuous weight update) using 32-b floating-point precision. This is equal to 1.4 Gflops sustained performance. The complete system with 3.8 Gflops peak performance consumes less than 800 W of electrical power and fits into a 19-in rack. While reaching the speed of modern supercomputers, MUSIC still can be used as a personal desktop computer at a researcher's own disposal. In neural net simulation, this gives a computing performance to a single user which was unthinkable before. The system's real-time interfaces make it especially useful for embedded applications.
Apparatus for multiprocessor-based control of a multiagent robot
NASA Technical Reports Server (NTRS)
Peters, II, Richard Alan (Inventor)
2009-01-01
An architecture for robot intelligence enables a robot to learn new behaviors and create new behavior sequences autonomously and interact with a dynamically changing environment. Sensory information is mapped onto a Sensory Ego-Sphere (SES) that rapidly identifies important changes in the environment and functions much like short term memory. Behaviors are stored in a DBAM that creates an active map from the robot's current state to a goal state and functions much like long term memory. A dream state converts recent activities stored in the SES and creates or modifies behaviors in the DBAM.
Portable parallel stochastic optimization for the design of aeropropulsion components
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Rhodes, G. S.
1994-01-01
This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
A Real-Time Linux for Multicore Platforms
2013-12-20
under ARO support) to obtain a fully-functional OS for supporting real-time workloads on multicore platforms. This system, called LITMUS -RT...to be specified as plugin components. LITMUS -RT is open-source software (available at The views, opinions and/or findings contained in this report... LITMUS -RT (LInux Testbed for MUltiprocessor Scheduling in Real-Time systems), allows different multiprocessor real-time scheduling and
A divide and conquer approach to the nonsymmetric eigenvalue problem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1991-01-01
Serial computation combined with high communication costs on distributed-memory multiprocessors make parallel implementations of the QR method for the nonsymmetric eigenvalue problem inefficient. This paper introduces an alternative algorithm for the nonsymmetric tridiagonal eigenvalue problem based on rank two tearing and updating of the matrix. The parallelism of this divide and conquer approach stems from independent solution of the updating problems. 11 refs.
Dataflow computing approach in high-speed digital simulation
NASA Technical Reports Server (NTRS)
Ercegovac, M. D.; Karplus, W. J.
1984-01-01
New computational tools and methodologies for the digital simulation of continuous systems were explored. Programmability, and cost effective performance in multiprocessor organizations for real time simulation was investigated. Approach is based on functional style languages and data flow computing principles, which allow for the natural representation of parallelism in algorithms and provides a suitable basis for the design of cost effective high performance distributed systems. The objectives of this research are to: (1) perform comparative evaluation of several existing data flow languages and develop an experimental data flow language suitable for real time simulation using multiprocessor systems; (2) investigate the main issues that arise in the architecture and organization of data flow multiprocessors for real time simulation; and (3) develop and apply performance evaluation models in typical applications.
Supporting shared data structures on distributed memory architectures
NASA Technical Reports Server (NTRS)
Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John
1990-01-01
Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Safe and Efficient Support for Embeded Multi-Processors in ADA
NASA Astrophysics Data System (ADS)
Ruiz, Jose F.
2010-08-01
New software demands increasing processing power, and multi-processor platforms are spreading as the answer to achieve the required performance. Embedded real-time systems are also subject to this trend, but in the case of real-time mission-critical systems, the properties of reliability, predictability and analyzability are also paramount. The Ada 2005 language defined a subset of its tasking model, the Ravenscar profile, that provides the basis for the implementation of deterministic and time analyzable applications on top of a streamlined run-time system. This Ravenscar tasking profile, originally designed for single processors, has proven remarkably useful for modelling verifiable real-time single-processor systems. This paper proposes a simple extension to the Ravenscar profile to support multi-processor systems using a fully partitioned approach. The implementation of this scheme is simple, and it can be used to develop applications amenable to schedulability analysis.
System and method for programmable bank selection for banked memory subsystems
Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan
2010-09-07
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
Distributed simulation using a real-time shared memory network
NASA Technical Reports Server (NTRS)
Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.
1993-01-01
The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.
Numerical methods on some structured matrix algebra problems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1996-06-01
This proposal concerned the design, analysis, and implementation of serial and parallel algorithms for certain structured matrix algebra problems. It emphasized large order problems and so focused on methods that can be implemented efficiently on distributed-memory MIMD multiprocessors. Such machines supply the computing power and extensive memory demanded by the large order problems. We proposed to examine three classes of matrix algebra problems: the symmetric and nonsymmetric eigenvalue problems (especially the tridiagonal cases) and the solution of linear systems with specially structured coefficient matrices. As all of these are of practical interest, a major goal of this work was tomore » translate our research in linear algebra into useful tools for use by the computational scientists interested in these and related applications. Thus, in addition to software specific to the linear algebra problems, we proposed to produce a programming paradigm and library to aid in the design and implementation of programs for distributed-memory MIMD computers. We now report on our progress on each of the problems and on the programming tools.« less
A CAMAC-VME-Macintosh data acquisition system for nuclear experiments
NASA Astrophysics Data System (ADS)
Anzalone, A.; Giustolisi, F.
1989-10-01
A multiprocessor system for data acquisition and analysis in low-energy nuclear physics has been realized. The system is built around CAMAC, the VMEbus, and the Macintosh PC. Multiprocessor software has been developed, using RTF, MACsys, and CERN cross-software. The execution of several programs that run on several VME CPUs and on an external PC is coordinated by a mailbox protocol. No operating system is used on the VME CPUs. The hardware, software, and system performance are described.
Transactive memory systems scale for couples: development and validation
Hewitt, Lauren Y.; Roberts, Lynne D.
2015-01-01
People in romantic relationships can develop shared memory systems by pooling their cognitive resources, allowing each person access to more information but with less cognitive effort. Research examining such memory systems in romantic couples largely focuses on remembering word lists or performing lab-based tasks, but these types of activities do not capture the processes underlying couples’ transactive memory systems, and may not be representative of the ways in which romantic couples use their shared memory systems in everyday life. We adapted an existing measure of transactive memory systems for use with romantic couples (TMSS-C), and conducted an initial validation study. In total, 397 participants who each identified as being a member of a romantic relationship of at least 3 months duration completed the study. The data provided a good fit to the anticipated three-factor structure of the components of couples’ transactive memory systems (specialization, credibility and coordination), and there was reasonable evidence of both convergent and divergent validity, as well as strong evidence of test–retest reliability across a 2-week period. The TMSS-C provides a valuable tool that can quickly and easily capture the underlying components of romantic couples’ transactive memory systems. It has potential to help us better understand this intriguing feature of romantic relationships, and how shared memory systems might be associated with other important features of romantic relationships. PMID:25999873
An effective write policy for software coherence schemes
NASA Technical Reports Server (NTRS)
Chen, Yung-Chin; Veidenbaum, Alexander V.
1992-01-01
The authors study the write behavior and evaluate the performance of various write strategies and buffering techniques for a MIN-based multiprocessor system using the simple software coherence scheme. Hit ratios, memory latencies, total execution time, and total write traffic are used as the performance indices. The write-through write-allocate no-fetch cache using a write-back write buffer is shown to have a better performance than both write-through and write-back caches. This type of write buffer is effective in reducing the volume as well as bursts of write traffic. On average, the use of a write-back cache reduces by 60 percent the total write traffic generated by a write-through cache.
Bibliography On Multiprocessors And Distributed Processing
NASA Technical Reports Server (NTRS)
Miya, Eugene N.
1988-01-01
Multiprocessor and Distributed Processing Bibliography package consists of large machine-readable bibliographic data base, which in addition to usual keyword searches, used for producing citations, indexes, and cross-references. Data base contains UNIX(R) "refer" -formatted ASCII data and implemented on any computer running under UNIX(R) operating system. Easily convertible to other operating systems. Requires approximately one megabyte of secondary storage. Bibliography compiled in 1985.
A multiprocessor operating system simulator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Johnston, G.M.; Campbell, R.H.
1988-01-01
This paper describes a multiprocessor operating system simulator that was developed by the authors in the Fall of 1987. The simulator was built in response to the need to provide students with an environment in which to build and test operating system concepts as part of the coursework of a third-year undergraduate operating systems course. Written in C++, the simulator uses the co-routine style task package that is distributed with the AT and T C++ Translator to provide a hierarchy of classes that represents a broad range of operating system software and hardware components. The class hierarchy closely follows thatmore » of the Choices family of operating systems for loosely and tightly coupled multiprocessors. During an operating system course, these classes are refined and specialized by students in homework assignments to facilitate experimentation with different aspects of operating system design and policy decisions. The current implementation runs on the IBM RT PC under 4.3bsd UNIX.« less
The MSFC UNIVAC 1108 EXEC 8 simulation model
NASA Technical Reports Server (NTRS)
Williams, T. G.; Richards, F. M.; Weatherbee, J. E.; Paul, L. K.
1972-01-01
A model is presented which simulates the MSFC Univac 1108 multiprocessor system. The hardware/operating system is described to enable a good statistical measurement of the system behavior. The performance of the 1108 is evaluated by performing twenty-four different experiments designed to locate system bottlenecks and also to test the sensitivity of system throughput with respect to perturbation of the various Exec 8 scheduling algorithms. The model is implemented in the general purpose system simulation language and the techniques described can be used to assist in the design, development, and evaluation of multiprocessor systems.
Memory Network For Distributed Data Processors
NASA Technical Reports Server (NTRS)
Bolen, David; Jensen, Dean; Millard, ED; Robinson, Dave; Scanlon, George
1992-01-01
Universal Memory Network (UMN) is modular, digital data-communication system enabling computers with differing bus architectures to share 32-bit-wide data between locations up to 3 km apart with less than one millisecond of latency. Makes it possible to design sophisticated real-time and near-real-time data-processing systems without data-transfer "bottlenecks". This enterprise network permits transmission of volume of data equivalent to an encyclopedia each second. Facilities benefiting from Universal Memory Network include telemetry stations, simulation facilities, power-plants, and large laboratories or any facility sharing very large volumes of data. Main hub of UMN is reflection center including smaller hubs called Shared Memory Interfaces.
Dynamic file-access characteristics of a production parallel scientific workload
NASA Technical Reports Server (NTRS)
Kotz, David; Nieuwejaar, Nils
1994-01-01
Multiprocessors have permitted astounding increases in computational performance, but many cannot meet the intense I/O requirements of some scientific applications. An important component of any solution to this I/O bottleneck is a parallel file system that can provide high-bandwidth access to tremendous amounts of data in parallel to hundreds or thousands of processors. Most successful systems are based on a solid understanding of the expected workload, but thus far there have been no comprehensive workload characterizations of multiprocessor file systems. This paper presents the results of a three week tracing study in which all file-related activity on a massively parallel computer was recorded. Our instrumentation differs from previous efforts in that it collects information about every I/O request and about the mix of jobs running in a production environment. We also present the results of a trace-driven caching simulation and recommendations for designers of multiprocessor file systems.
Visual monitoring of autonomous life sciences experimentation
NASA Technical Reports Server (NTRS)
Blank, G. E.; Martin, W. N.
1987-01-01
The design and implementation of a computerized visual monitoring system to aid in the monitoring and control of life sciences experiments on board a space station was investigated. A likely multiprocessor design was chosen, a plausible life science experiment with which to work was defined, the theoretical issues involved in the programming of a visual monitoring system for the experiment was considered on the multiprocessor, a system for monitoring the experiment was designed, and simulations of such a system was implemented on a network of Apollo workstations.
CELLFS: TAKING THE "DMA" OUT OF CELL PROGRAMMING
DOE Office of Scientific and Technical Information (OSTI.GOV)
IONKOV, LATCHESAR A.; MIRTCHOVSKI, ANDREY A.; NYRHINEN, AKI M.
In this paper we present a new programming model for the Cell BE architecture of scalar multiprocessors. They call this programming model CellFS. CellFS aims at simplifying the task of managing I/O between the local store of the processing units and main memory. The CellFS support library provides the means for transferring data via simple file I/O operations between the PPU and the SPU.
Real-Time Multiprocessor Programming Language (RTMPL) user's manual
NASA Technical Reports Server (NTRS)
Arpasi, D. J.
1985-01-01
A real-time multiprocessor programming language (RTMPL) has been developed to provide for high-order programming of real-time simulations on systems of distributed computers. RTMPL is a structured, engineering-oriented language. The RTMPL utility supports a variety of multiprocessor configurations and types by generating assembly language programs according to user-specified targeting information. Many programming functions are assumed by the utility (e.g., data transfer and scaling) to reduce the programming chore. This manual describes RTMPL from a user's viewpoint. Source generation, applications, utility operation, and utility output are detailed. An example simulation is generated to illustrate many RTMPL features.
A Measurement and Simulation Based Methodology for Cache Performance Modeling and Tuning
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
We present a cache performance modeling methodology that facilitates the tuning of uniprocessor cache performance for applications executing on shared memory multiprocessors by accurately predicting the effects of source code level modifications. Measurements on a single processor are initially used for identifying parts of code where cache utilization improvements may significantly impact the overall performance. Cache simulation based on trace-driven techniques can be carried out without gathering detailed address traces. Minimal runtime information for modeling cache performance of a selected code block includes: base virtual addresses of arrays, virtual addresses of variables, and loop bounds for that code block. Rest of the information is obtained from the source code. We show that the cache performance predictions are as reliable as those obtained through trace-driven simulations. This technique is particularly helpful to the exploration of various "what-if' scenarios regarding the cache performance impact for alternative code structures. We explain and validate this methodology using a simple matrix-matrix multiplication program. We then apply this methodology to predict and tune the cache performance of two realistic scientific applications taken from the Computational Fluid Dynamics (CFD) domain.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
With the advent of parallel hardware and software technologies users are faced with the challenge to choose a programming paradigm best suited for the underlying computer architecture. With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors (SMP), parallel programming techniques have evolved to support parallelism beyond a single level. Which programming paradigm is the best will depend on the nature of the given problem, the hardware architecture, and the available software. In this study we will compare different programming paradigms for the parallelization of a selected benchmark application on a cluster of SMP nodes. We compare the timings of different implementations of the same CFD benchmark application employing the same numerical algorithm on a cluster of Sun Fire SMP nodes. The rest of the paper is structured as follows: In section 2 we briefly discuss the programming models under consideration. We describe our compute platform in section 3. The different implementations of our benchmark code are described in section 4 and the performance results are presented in section 5. We conclude our study in section 6.
NASA Astrophysics Data System (ADS)
Georgiev, K.; Zlatev, Z.
2010-11-01
The Danish Eulerian Model (DEM) is an Eulerian model for studying the transport of air pollutants on large scale. Originally, the model was developed at the National Environmental Research Institute of Denmark. The model computational domain covers Europe and some neighbour parts belong to the Atlantic Ocean, Asia and Africa. If DEM model is to be applied by using fine grids, then its discretization leads to a huge computational problem. This implies that such a model as DEM must be run only on high-performance computer architectures. The implementation and tuning of such a complex large-scale model on each different computer is a non-trivial task. Here, some comparison results of running of this model on different kind of vector (CRAY C92A, Fujitsu, etc.), parallel computers with distributed memory (IBM SP, CRAY T3E, Beowulf clusters, Macintosh G4 clusters, etc.), parallel computers with shared memory (SGI Origin, SUN, etc.) and parallel computers with two levels of parallelism (IBM SMP, IBM BlueGene/P, clusters of multiprocessor nodes, etc.) will be presented. The main idea in the parallel version of DEM is domain partitioning approach. Discussions according to the effective use of the cache and hierarchical memories of the modern computers as well as the performance, speed-ups and efficiency achieved will be done. The parallel code of DEM, created by using MPI standard library, appears to be highly portable and shows good efficiency and scalability on different kind of vector and parallel computers. Some important applications of the computer model output are presented in short.
Fault tree models for fault tolerant hypercube multiprocessors
NASA Technical Reports Server (NTRS)
Boyd, Mark A.; Tuazon, Jezus O.
1991-01-01
Three candidate fault tolerant hypercube architectures are modeled, their reliability analyses are compared, and the resulting implications of these methods of incorporating fault tolerance into hypercube multiprocessors are discussed. In the course of performing the reliability analyses, the use of HARP and fault trees in modeling sequence dependent system behaviors is demonstrated.
Backend Control Processor for a Multi-Processor Relational Database Computer System.
1984-12-01
SCHOOL OF ENGI. UNCRSIFID MPONTIFF DEC 84 AFXT/GCS/ENG/84D-22 F/O 9/2 L ommhhhhmhhml mhhhommhhhhhm i-2 8 -- U0. 11111= Q. 2 111.8IIII- 1111111..6...THESIS Presented to the Faculty of the School of Engineering of the Air Force Institute of Technology Air University In Partial Fulfillment of the...development of a Backend Multi-Processor Relational Database Computer System. This thesis addresses a single component of this system, the Backend Control
NASA Technical Reports Server (NTRS)
Pordes, Ruth (Editor)
1989-01-01
Papers on real-time computer applications in nuclear, particle, and plasma physics are presented, covering topics such as expert systems tactics in testing FASTBUS segment interconnect modules, trigger control in a high energy physcis experiment, the FASTBUS read-out system for the Aleph time projection chamber, a multiprocessor data acquisition systems, DAQ software architecture for Aleph, a VME multiprocessor system for plasma control at the JT-60 upgrade, and a multiasking, multisinked, multiprocessor data acquisition front end. Other topics include real-time data reduction using a microVAX processor, a transputer based coprocessor for VEDAS, simulation of a macropipelined multi-CPU event processor for use in FASTBUS, a distributed VME control system for the LISA superconducting Linac, a distributed system for laboratory process automation, and a distributed system for laboratory process automation. Additional topics include a structure macro assembler for the event handler, a data acquisition and control system for Thomson scattering on ATF, remote procedure execution software for distributed systems, and a PC-based graphic display real-time particle beam uniformity.
Reducing Response Time Bounds for DAG-Based Task Systems on Heterogeneous Multicore Platforms
2016-01-01
synchronous parallel tasks on multicore platforms. In 25th ECRTS, 2013. [10] U. Devi. Soft Real - Time Scheduling on Multiprocessors. PhD thesis...report, Washington University in St Louis, 2014. [18] C. Liu and J. Anderson. Supporting soft real - time DAG-based sys- tems on multiprocessors with...analysis for DAG-based real - time task systems im- plemented on heterogeneous multicore platforms. The spe- cific analysis problem that is considered was
Study of Thread Level Parallelism in a Video Encoding Application for Chip Multiprocessor Design
NASA Astrophysics Data System (ADS)
Debes, Eric; Kaine, Greg
2002-11-01
In media applications there is a high level of available thread level parallelism (TLP). In this paper we study the intra TLP in a video encoder. We show that a well-distributed highly optimized encoder running on a symmetric multiprocessor (SMP) system can run 3.2 faster on a 4-way SMP machine than on a single processor. The multithreaded encoder running on an SMP system is then used to understand the requirements of a chip multiprocessor (CMP) architecture, which is one possible architectural direction to better exploit TLP. In the framework of this study, we use a software approach to evaluate the dataflow between processors for the video encoder running on an SMP system. An estimation of the dataflow is done with L2 cache miss event counters using Intel® VTuneTM performance analyzer. The experimental measurements are compared to theoretical results.
Measuring Transactiving Memory Systems Using Network Analysis
ERIC Educational Resources Information Center
King, Kylie Goodell
2017-01-01
Transactive memory systems (TMSs) describe the structures and processes that teams use to share information, work together, and accomplish shared goals. First introduced over three decades ago, TMSs have been measured in a variety of ways. This dissertation proposes the use of network analysis in measuring TMS. This is accomplished by describing…
The Refinement-Tree Partition for Parallel Solution of Partial Differential Equations
Mitchell, William F.
1998-01-01
Dynamic load balancing is considered in the context of adaptive multilevel methods for partial differential equations on distributed memory multiprocessors. An approach that periodically repartitions the grid is taken. The important properties of a partitioning algorithm are presented and discussed in this context. A partitioning algorithm based on the refinement tree of the adaptive grid is presented and analyzed in terms of these properties. Theoretical and numerical results are given. PMID:28009355
The Refinement-Tree Partition for Parallel Solution of Partial Differential Equations.
Mitchell, William F
1998-01-01
Dynamic load balancing is considered in the context of adaptive multilevel methods for partial differential equations on distributed memory multiprocessors. An approach that periodically repartitions the grid is taken. The important properties of a partitioning algorithm are presented and discussed in this context. A partitioning algorithm based on the refinement tree of the adaptive grid is presented and analyzed in terms of these properties. Theoretical and numerical results are given.
DARPA Status Report - November 1988
1988-11-01
style used in the applic4#ons reference to that block was by processor j. where j It. We was influenced by it. MACH is a multiprocessor operating S call...it can be order they occurred. However. the exact time at which the treated specially in memory management , and so most of the reference wa, made is...on cache consistency performance, sophisti- peak can be explained as clinging references that occur when cated cache management schemes that take
Fast adaptive composite grid methods on distributed parallel architectures
NASA Technical Reports Server (NTRS)
Lemke, Max; Quinlan, Daniel
1992-01-01
The fast adaptive composite (FAC) grid method is compared with the adaptive composite method (AFAC) under variety of conditions including vectorization and parallelization. Results are given for distributed memory multiprocessor architectures (SUPRENUM, Intel iPSC/2 and iPSC/860). It is shown that the good performance of AFAC and its superiority over FAC in a parallel environment is a property of the algorithm and not dependent on peculiarities of any machine.
Ultra-fast fluence optimization for beam angle selection algorithms
NASA Astrophysics Data System (ADS)
Bangert, M.; Ziegenhein, P.; Oelfke, U.
2014-03-01
Beam angle selection (BAS) including fluence optimization (FO) is among the most extensive computational tasks in radiotherapy. Precomputed dose influence data (DID) of all considered beam orientations (up to 100 GB for complex cases) has to be handled in the main memory and repeated FOs are required for different beam ensembles. In this paper, the authors describe concepts accelerating FO for BAS algorithms using off-the-shelf multiprocessor workstations. The FO runtime is not dominated by the arithmetic load of the CPUs but by the transportation of DID from the RAM to the CPUs. On multiprocessor workstations, however, the speed of data transportation from the main memory to the CPUs is non-uniform across the RAM; every CPU has a dedicated memory location (node) with minimum access time. We apply a thread node binding strategy to ensure that CPUs only access DID from their preferred node. Ideal load balancing for arbitrary beam ensembles is guaranteed by distributing the DID of every candidate beam equally to all nodes. Furthermore we use a custom sorting scheme of the DID to minimize the overall data transportation. The framework is implemented on an AMD Opteron workstation. One FO iteration comprising dose, objective function, and gradient calculation takes between 0.010 s (9 beams, skull, 0.23 GB DID) and 0.070 s (9 beams, abdomen, 1.50 GB DID). Our overall FO time is < 1 s for small cases, larger cases take ~ 4 s. BAS runs including FOs for 1000 different beam ensembles take ~ 15-70 min, depending on the treatment site. This enables an efficient clinical evaluation of different BAS algorithms.
A multiprocessor airborne lidar data system
NASA Technical Reports Server (NTRS)
Wright, C. W.; Bailey, S. A.; Heath, G. E.; Piazza, C. R.
1988-01-01
A new multiprocessor data acquisition system was developed for the existing Airborne Oceanographic Lidar (AOL). This implementation simultaneously utilizes five single board 68010 microcomputers, the UNIX system V operating system, and the real time executive VRTX. The original data acquisition system was implemented on a Hewlett Packard HP 21-MX 16 bit minicomputer using a multi-tasking real time operating system and a mixture of assembly and FORTRAN languages. The present collection of data sources produce data at widely varied rates and require varied amounts of burdensome real time processing and formatting. It was decided to replace the aging HP 21-MX minicomputer with a multiprocessor system. A new and flexible recording format was devised and implemented to accommodate the constantly changing sensor configuration. A central feature of this data system is the minimization of non-remote sensing bus traffic. Therefore, it is highly desirable that each micro be capable of functioning as much as possible on-card or via private peripherals. The bus is used primarily for the transfer of remote sensing data to or from the buffer queue.
SAHAYOG: A Testbed for Load Sharing under Failure,
1987-07-01
messages, shared memory and semaphores . To communicate using messages, processes create message queues using system-provided prim- itives. The message...The size of the memory that is to be shared is decided by the process when it makes a request for memory allocation. The semaphore option of IPC can be...used to prevent two or more concurrent processes from executing their critical sections at the same time. Semaphores must be used when the processes
System and method for memory allocation in a multiclass memory system
Loh, Gabriel; Meswani, Mitesh; Ignatowski, Michael; Nutter, Mark
2016-06-28
A system for memory allocation in a multiclass memory system includes a processor coupleable to a plurality of memories sharing a unified memory address space, and a library store to store a library of software functions. The processor identifies a type of a data structure in response to a memory allocation function call to the library for allocating memory to the data structure. Using the library, the processor allocates portions of the data structure among multiple memories of the multiclass memory system based on the type of the data structure.
High Performance Programming Using Explicit Shared Memory Model on Cray T3D1
NASA Technical Reports Server (NTRS)
Simon, Horst D.; Saini, Subhash; Grassi, Charles
1994-01-01
The Cray T3D system is the first-phase system in Cray Research, Inc.'s (CRI) three-phase massively parallel processing (MPP) program. This system features a heterogeneous architecture that closely couples DEC's Alpha microprocessors and CRI's parallel-vector technology, i.e., the Cray Y-MP and Cray C90. An overview of the Cray T3D hardware and available programming models is presented. Under Cray Research adaptive Fortran (CRAFT) model four programming methods (data parallel, work sharing, message-passing using PVM, and explicit shared memory model) are available to the users. However, at this time data parallel and work sharing programming models are not available to the user community. The differences between standard PVM and CRI's PVM are highlighted with performance measurements such as latencies and communication bandwidths. We have found that the performance of neither standard PVM nor CRI s PVM exploits the hardware capabilities of the T3D. The reasons for the bad performance of PVM as a native message-passing library are presented. This is illustrated by the performance of NAS Parallel Benchmarks (NPB) programmed in explicit shared memory model on Cray T3D. In general, the performance of standard PVM is about 4 to 5 times less than obtained by using explicit shared memory model. This degradation in performance is also seen on CM-5 where the performance of applications using native message-passing library CMMD on CM-5 is also about 4 to 5 times less than using data parallel methods. The issues involved (such as barriers, synchronization, invalidating data cache, aligning data cache etc.) while programming in explicit shared memory model are discussed. Comparative performance of NPB using explicit shared memory programming model on the Cray T3D and other highly parallel systems such as the TMC CM-5, Intel Paragon, Cray C90, IBM-SP1, etc. is presented.
Optical backplane interconnect switch for data processors and computers
NASA Technical Reports Server (NTRS)
Hendricks, Herbert D.; Benz, Harry F.; Hammer, Jacob M.
1989-01-01
An optoelectronic integrated device design is reported which can be used to implement an all-optical backplane interconnect switch. The switch is sized to accommodate an array of processors and memories suitable for direct replacement into the basic avionic multiprocessor backplane. The optical backplane interconnect switch is also suitable for direct replacement of the PI bus traffic switch and at the same time, suitable for supporting pipelining of the processor and memory. The 32 bidirectional switchable interconnects are configured with broadcast capability for controls, reconfiguration, and messages. The approach described here can handle a serial interconnection of data processors or a line-to-link interconnection of data processors. An optical fiber demonstration of this approach is presented.
Enabling Next-Generation Multicore Platforms in Embedded Applications
2014-04-01
mapping to sets 129 − 256 ) to the second page in memory, color 2 (sets 257 − 384) to the third page, and so on. Then, after the 32nd page, all 212 sets...the Real-Time Nested Locking Protocol (RNLP) [56], a recently developed multiprocessor real-time locking protocol that optimally supports the...RELEASE; DISTRIBUTION UNLIMITED 15 In general, the problems of optimally assigning tasks to processors and colors to tasks are both NP-hard in the
Domain Decomposition: A Bridge between Nature and Parallel Computers
1992-09-01
B., "Domain Decomposition Algorithms for Indefinite Elliptic Problems," S"IAM Journal of S; cientific and Statistical (’omputing, Vol. 13, 1992, pp...AD-A256 575 NASA Contractor Report 189709 ICASE Report No. 92-44 ICASE DOMAIN DECOMPOSITION: A BRIDGE BETWEEN NATURE AND PARALLEL COMPUTERS DTIC dE...effectively implemented on dis- tributed memory multiprocessors. In 1990 (as reported in Ref. 38 using the tile algo- rithm), a 103,201-unknown 2D elliptic
Luckey, Chance John; Bhattacharya, Deepta; Goldrath, Ananda W.; Weissman, Irving L.; Benoist, Christophe; Mathis, Diane
2006-01-01
The only cells of the hematopoietic system that undergo self-renewal for the lifetime of the organism are long-term hematopoietic stem cells and memory T and B cells. To determine whether there is a shared transcriptional program among these self-renewing populations, we first compared the gene-expression profiles of naïve, effector and memory CD8+ T cells with those of long-term hematopoietic stem cells, short-term hematopoietic stem cells, and lineage-committed progenitors. Transcripts augmented in memory CD8+ T cells relative to naïve and effector T cells were selectively enriched in long-term hematopoietic stem cells and were progressively lost in their short-term and lineage-committed counterparts. Furthermore, transcripts selectively decreased in memory CD8+ T cells were selectively down-regulated in long-term hematopoietic stem cells and progressively increased with differentiation. To confirm that this pattern was a general property of immunologic memory, we turned to independently generated gene expression profiles of memory, naïve, germinal center, and plasma B cells. Once again, memory-enriched and -depleted transcripts were also appropriately augmented and diminished in long-term hematopoietic stem cells, and their expression correlated with progressive loss of self-renewal function. Thus, there appears to be a common signature of both up- and down-regulated transcripts shared between memory T cells, memory B cells, and long-term hematopoietic stem cells. This signature was not consistently enriched in neural or embryonic stem cell populations and, therefore, appears to be restricted to the hematopoeitic system. These observations provide evidence that the shared phenotype of self-renewal in the hematopoietic system is linked at the molecular level. PMID:16492737
A shared resource between declarative memory and motor memory.
Keisler, Aysha; Shadmehr, Reza
2010-11-03
The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and nondeclarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/nondeclarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system.
A shared resource between declarative memory and motor memory
Keisler, Aysha; Shadmehr, Reza
2010-01-01
The neural systems that support motor adaptation in humans are thought to be distinct from those that support the declarative system. Yet, during motor adaptation changes in motor commands are supported by a fast adaptive process that has important properties (rapid learning, fast decay) that are usually associated with the declarative system. The fast process can be contrasted to a slow adaptive process that also supports motor memory, but learns gradually and shows resistance to forgetting. Here we show that after people stop performing a motor task, the fast motor memory can be disrupted by a task that engages declarative memory, but the slow motor memory is immune from this interference. Furthermore, we find that the fast/declarative component plays a major role in the consolidation of the slow motor memory. Because of the competitive nature of declarative and non-declarative memory during consolidation, impairment of the fast/declarative component leads to improvements in the slow/non-declarative component. Therefore, the fast process that supports formation of motor memory is not only neurally distinct from the slow process, but it shares critical resources with the declarative memory system. PMID:21048140
Energy-efficient fault tolerance in multiprocessor real-time systems
NASA Astrophysics Data System (ADS)
Guo, Yifeng
The recent progress in the multiprocessor/multicore systems has important implications for real-time system design and operation. From vehicle navigation to space applications as well as industrial control systems, the trend is to deploy multiple processors in real-time systems: systems with 4 -- 8 processors are common, and it is expected that many-core systems with dozens of processing cores will be available in near future. For such systems, in addition to general temporal requirement common for all real-time systems, two additional operational objectives are seen as critical: energy efficiency and fault tolerance. An intriguing dimension of the problem is that energy efficiency and fault tolerance are typically conflicting objectives, due to the fact that tolerating faults (e.g., permanent/transient) often requires extra resources with high energy consumption potential. In this dissertation, various techniques for energy-efficient fault tolerance in multiprocessor real-time systems have been investigated. First, the Reliability-Aware Power Management (RAPM) framework, which can preserve the system reliability with respect to transient faults when Dynamic Voltage Scaling (DVS) is applied for energy savings, is extended to support parallel real-time applications with precedence constraints. Next, the traditional Standby-Sparing (SS) technique for dual processor systems, which takes both transient and permanent faults into consideration while saving energy, is generalized to support multiprocessor systems with arbitrary number of identical processors. Observing the inefficient usage of slack time in the SS technique, a Preference-Oriented Scheduling Framework is designed to address the problem where tasks are given preferences for being executed as soon as possible (ASAP) or as late as possible (ALAP). A preference-oriented earliest deadline (POED) scheduler is proposed and its application in multiprocessor systems for energy-efficient fault tolerance is investigated, where tasks' main copies are executed ASAP while backup copies ALAP to reduce the overlapped execution of main and backup copies of the same task and thus reduce energy consumption. All proposed techniques are evaluated through extensive simulations and compared with other state-of-the-art approaches. The simulation results confirm that the proposed schemes can preserve the system reliability while still achieving substantial energy savings. Finally, for both SS and POED based Energy-Efficient Fault-Tolerant (EEFT) schemes, a series of recovery strategies are designed when more than one (transient and permanent) faults need to be tolerated.
Optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1979-01-01
High capacity optical memories with relatively-high data-transfer rate and multiport simultaneous access capability may serve as basis for new computer architectures. Several computer structures that might profitably use memories are: a) simultaneous record-access system, b) simultaneously-shared memory computer system, and c) parallel digital processing structure.
NASA Astrophysics Data System (ADS)
Urfianto, Mohammad Zalfany; Isshiki, Tsuyoshi; Khan, Arif Ullah; Li, Dongju; Kunieda, Hiroaki
This paper presentss a Multiprocessor System-on-Chips (MPSoC) architecture used as an execution platform for the new C-language based MPSoC design framework we are currently developing. The MPSoC architecture is based on an existing SoC platform with a commercial RISC core acting as the host CPU. We extend the existing SoC with a multiprocessor-array block that is used as the main engine to run parallel applications modeled in our design framework. Utilizing several optimizations provided by our compiler, an efficient inter-communication between processing elements with minimum overhead is implemented. A host-interface is designed to integrate the existing RISC core to the multiprocessor-array. The experimental results show that an efficacious integration is achieved, proving that the designed communication module can be used to efficiently incorporate off-the-shelf processors as a processing element for MPSoC architectures designed using our framework.
NASA Technical Reports Server (NTRS)
Russo, Vincent; Johnston, Gary; Campbell, Roy
1988-01-01
The programming of the interrupt handling mechanisms, process switching primitives, scheduling mechanism, and synchronization primitives of an operating system for a multiprocessor require both efficient code in order to support the needs of high- performance or real-time applications and careful organization to facilitate maintenance. Although many advantages have been claimed for object-oriented class hierarchical languages and their corresponding design methodologies, the application of these techniques to the design of the primitives within an operating system has not been widely demonstrated. To investigate the role of class hierarchical design in systems programming, the authors have constructed the Choices multiprocessor operating system architecture the C++ programming language. During the implementation, it was found that many operating system design concerns can be represented advantageously using a class hierarchical approach, including: the separation of mechanism and policy; the organization of an operating system into layers, each of which represents an abstract machine; and the notions of process and exception management. In this paper, we discuss an implementation of the low-level primitives of this system and outline the strategy by which we developed our solution.
Crosetto, D.B.
1996-12-31
The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Crosetto, Dario B.
1996-01-01
The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
A parallel solver for huge dense linear systems
NASA Astrophysics Data System (ADS)
Badia, J. M.; Movilla, J. L.; Climente, J. I.; Castillo, M.; Marqués, M.; Mayo, R.; Quintana-Ortí, E. S.; Planelles, J.
2011-11-01
HDSS (Huge Dense Linear System Solver) is a Fortran Application Programming Interface (API) to facilitate the parallel solution of very large dense systems to scientists and engineers. The API makes use of parallelism to yield an efficient solution of the systems on a wide range of parallel platforms, from clusters of processors to massively parallel multiprocessors. It exploits out-of-core strategies to leverage the secondary memory in order to solve huge linear systems O(100.000). The API is based on the parallel linear algebra library PLAPACK, and on its Out-Of-Core (OOC) extension POOCLAPACK. Both PLAPACK and POOCLAPACK use the Message Passing Interface (MPI) as the communication layer and BLAS to perform the local matrix operations. The API provides a friendly interface to the users, hiding almost all the technical aspects related to the parallel execution of the code and the use of the secondary memory to solve the systems. In particular, the API can automatically select the best way to store and solve the systems, depending of the dimension of the system, the number of processes and the main memory of the platform. Experimental results on several parallel platforms report high performance, reaching more than 1 TFLOP with 64 cores to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors. New version program summaryProgram title: Huge Dense System Solver (HDSS) Catalogue identifier: AEHU_v1_1 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEHU_v1_1.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 87 062 No. of bytes in distributed program, including test data, etc.: 1 069 110 Distribution format: tar.gz Programming language: Fortran90, C Computer: Parallel architectures: multiprocessors, computer clusters Operating system: Linux/Unix Has the code been vectorized or parallelized?: Yes, includes MPI primitives. RAM: Tested for up to 190 GB Classification: 6.5 External routines: MPI ( http://www.mpi-forum.org/), BLAS ( http://www.netlib.org/blas/), PLAPACK ( http://www.cs.utexas.edu/~plapack/), POOCLAPACK ( ftp://ftp.cs.utexas.edu/pub/rvdg/PLAPACK/pooclapack.ps) (code for PLAPACK and POOCLAPACK is included in the distribution). Catalogue identifier of previous version: AEHU_v1_0 Journal reference of previous version: Comput. Phys. Comm. 182 (2011) 533 Does the new version supersede the previous version?: Yes Nature of problem: Huge scale dense systems of linear equations, Ax=B, beyond standard LAPACK capabilities. Solution method: The linear systems are solved by means of parallelized routines based on the LU factorization, using efficient secondary storage algorithms when the available main memory is insufficient. Reasons for new version: In many applications we need to guarantee a high accuracy in the solution of very large linear systems and we can do it by using double-precision arithmetic. Summary of revisions: Version 1.1 Can be used to solve linear systems using double-precision arithmetic. New version of the initialization routine. The user can choose the kind of arithmetic and the values of several parameters of the environment. Running time: About 5 hours to solve a system with more than 200 000 equations and more than 10 000 right-hand side vectors using double-precision arithmetic on an eight-node commodity cluster with a total of 64 Intel cores.
Hierarchically Parallelized Constrained Nonlinear Solvers with Automated Substructuring
NASA Technical Reports Server (NTRS)
Padovan, Joe; Kwang, Abel
1994-01-01
This paper develops a parallelizable multilevel multiple constrained nonlinear equation solver. The substructuring process is automated to yield appropriately balanced partitioning of each succeeding level. Due to the generality of the procedure,_sequential, as well as partially and fully parallel environments can be handled. This includes both single and multiprocessor assignment per individual partition. Several benchmark examples are presented. These illustrate the robustness of the procedure as well as its capability to yield significant reductions in memory utilization and calculational effort due both to updating and inversion.
FTMP - A highly reliable Fault-Tolerant Multiprocessor for aircraft
NASA Technical Reports Server (NTRS)
Hopkins, A. L., Jr.; Smith, T. B., III; Lala, J. H.
1978-01-01
The FTMP (Fault-Tolerant Multiprocessor) is a complex multiprocessor computer that employs a form of redundancy related to systems considered by Mathur (1971), in which each major module can substitute for any other module of the same type. Despite the conceptual simplicity of the redundancy form, the implementation has many intricacies owing partly to the low target failure rate, and partly to the difficulty of eliminating single-fault vulnerability. An extensive analysis of the computer through the use of such modeling techniques as Markov processes and combinatorial mathematics shows that for random hard faults the computer can meet its requirements. It is also shown that the maintenance scheduled at intervals of 200 hr or more can be adequate most of the time.
Centrally managed unified shared virtual address space
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wilkes, John
Systems, apparatuses, and methods for managing a unified shared virtual address space. A host may execute system software and manage a plurality of nodes coupled to the host. The host may send work tasks to the nodes, and for each node, the host may externally manage the node's view of the system's virtual address space. Each node may have a central processing unit (CPU) style memory management unit (MMU) with an internal translation lookaside buffer (TLB). In one embodiment, the host may be coupled to a given node via an input/output memory management unit (IOMMU) interface, where the IOMMU frontendmore » interface shares the TLB with the given node's MMU. In another embodiment, the host may control the given node's view of virtual address space via memory-mapped control registers.« less
NASA Astrophysics Data System (ADS)
Chase, Patrick; Vondran, Gary
2011-01-01
Tetrahedral interpolation is commonly used to implement continuous color space conversions from sparse 3D and 4D lookup tables. We investigate the implementation and optimization of tetrahedral interpolation algorithms for GPUs, and compare to the best known CPU implementations as well as to a well known GPU-based trilinear implementation. We show that a 500 NVIDIA GTX-580 GPU is 3x faster than a 1000 Intel Core i7 980X CPU for 3D interpolation, and 9x faster for 4D interpolation. Performance-relevant GPU attributes are explored including thread scheduling, local memory characteristics, global memory hierarchy, and cache behaviors. We consider existing tetrahedral interpolation algorithms and tune based on the structure and branching capabilities of current GPUs. Global memory performance is improved by reordering and expanding the lookup table to ensure optimal access behaviors. Per multiprocessor local memory is exploited to implement optimally coalesced global memory accesses, and local memory addressing is optimized to minimize bank conflicts. We explore the impacts of lookup table density upon computation and memory access costs. Also presented are CPU-based 3D and 4D interpolators, using SSE vector operations that are faster than any previously published solution.
Compiler analysis for irregular problems in FORTRAN D
NASA Technical Reports Server (NTRS)
Vonhanxleden, Reinhard; Kennedy, Ken; Koelbel, Charles; Das, Raja; Saltz, Joel
1992-01-01
We developed a dataflow framework which provides a basis for rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same off-processor memory locations. Our runtime support gives us a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-processor data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.
LU Factorization with Partial Pivoting for a Multi-CPU, Multi-GPU Shared Memory System
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kurzak, Jakub; Luszczek, Pitior; Faverge, Mathieu
2012-03-01
LU factorization with partial pivoting is a canonical numerical procedure and the main component of the High Performance LINPACK benchmark. This article presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
Rapid solution of large-scale systems of equations
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.
1994-01-01
The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
Communication Strategies for Shared-Bus Embedded Multiprocessors
2005-09-01
target architecture [10]. We utilize the task execution model in [11], where each task vi in the task graph G = (V,E) is associated with three possible...predictability is therefore an interesting and important direction for further study. REFERENCES [1] T. Kogel, M. Doerper, A. Wieferink, R. Leupers, G ...Proceedings of Real-Time Technology and Applications Symposium, 1995, pp. 164–173. [11] S. Hua, G . Qu, and S. Bhattacharyya, “Energy reduction technique
A measurement-based performability model for a multiprocessor system
NASA Technical Reports Server (NTRS)
Ilsueh, M. C.; Iyer, Ravi K.; Trivedi, K. S.
1987-01-01
A measurement-based performability model based on real error-data collected on a multiprocessor system is described. Model development from the raw errror-data to the estimation of cumulative reward is described. Both normal and failure behavior of the system are characterized. The measured data show that the holding times in key operational and failure states are not simple exponential and that semi-Markov process is necessary to model the system behavior. A reward function, based on the service rate and the error rate in each state, is then defined in order to estimate the performability of the system and to depict the cost of different failure types and recovery procedures.
Scalable parallel communications
NASA Technical Reports Server (NTRS)
Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.
1992-01-01
Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Biegel, Bryan A. (Technical Monitor); Jost, G.; Jin, H.; Labarta J.; Gimenez, J.; Caubet, J.
2003-01-01
Parallel programming paradigms include process level parallelism, thread level parallelization, and multilevel parallelism. This viewgraph presentation describes a detailed performance analysis of these paradigms for Shared Memory Architecture (SMA). This analysis uses the Paraver Performance Analysis System. The presentation includes diagrams of a flow of useful computations.
Parallelization of KENO-Va Monte Carlo code
NASA Astrophysics Data System (ADS)
Ramón, Javier; Peña, Jorge
1995-07-01
KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Advanced data management system architectures testbed
NASA Technical Reports Server (NTRS)
Grant, Terry
1990-01-01
The objective of the Architecture and Tools Testbed is to provide a working, experimental focus to the evolving automation applications for the Space Station Freedom data management system. Emphasis is on defining and refining real-world applications including the following: the validation of user needs; understanding system requirements and capabilities; and extending capabilities. The approach is to provide an open, distributed system of high performance workstations representing both the standard data processors and networks and advanced RISC-based processors and multiprocessor systems. The system provides a base from which to develop and evaluate new performance and risk management concepts and for sharing the results. Participants are given a common view of requirements and capability via: remote login to the testbed; standard, natural user interfaces to simulations and emulations; special attention to user manuals for all software tools; and E-mail communication. The testbed elements which instantiate the approach are briefly described including the workstations, the software simulation and monitoring tools, and performance and fault tolerance experiments.
NASA Technical Reports Server (NTRS)
Padilla, Peter A.
1991-01-01
An investigation was made in AIRLAB of the fault handling performance of the Fault Tolerant MultiProcessor (FTMP). Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once in every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles Byzantine or lying faults. Byzantine faults behave such that the faulted unit points to a working unit as the source of errors. The design's problems involve: (1) the design and interface between the simplex error detection hardware and the error processing software, (2) the functional capabilities of the FTMP system bus, and (3) the communication requirements of a multiprocessor architecture. These weak areas in the FTMP's design increase the probability that, for any hardware fault, a good line replacement unit (LRU) is mistakenly disabled by the fault management software.
NASA Technical Reports Server (NTRS)
Kasahara, Hironori; Honda, Hiroki; Narita, Seinosuke
1989-01-01
Parallel processing of real-time dynamic systems simulation on a multiprocessor system named OSCAR is presented. In the simulation of dynamic systems, generally, the same calculation are repeated every time step. However, we cannot apply to Do-all or the Do-across techniques for parallel processing of the simulation since there exist data dependencies from the end of an iteration to the beginning of the next iteration and furthermore data-input and data-output are required every sampling time period. Therefore, parallelism inside the calculation required for a single time step, or a large basic block which consists of arithmetic assignment statements, must be used. In the proposed method, near fine grain tasks, each of which consists of one or more floating point operations, are generated to extract the parallelism from the calculation and assigned to processors by using optimal static scheduling at compile time in order to reduce large run time overhead caused by the use of near fine grain tasks. The practicality of the scheme is demonstrated on OSCAR (Optimally SCheduled Advanced multiprocessoR) which has been developed to extract advantageous features of static scheduling algorithms to the maximum extent.
1990-06-01
RAM and ROM output enable signals. Figure C.7 shows the logic for the interrupt priority level (IPLO* through IPL2 *) and the interrupt acknowledge...IACK681* signal is sent to the DUART when a level one interrupt acknowledge is output by the CPU. The logic for the IACK681* and the IPLO* through IPL2 ...signals are actually implemented with an EPLD. Listing D.4 in Appendix D presents the Abel description of the IACK681* and IPLO* through IPL2
Programmable architecture for pixel level processing tasks in lightweight strapdown IR seekers
NASA Astrophysics Data System (ADS)
Coates, James L.
1993-06-01
Typical processing tasks associated with missile IR seeker applications are described, and a straw man suite of algorithms is presented. A fully programmable multiprocessor architecture is realized on a multimedia video processor (MVP) developed by Texas Instruments. The MVP combines the elements of RISC, floating point, advanced DSPs, graphics processors, display and acquisition control, RAM, and external memory. Front end pixel level tasks typical of missile interceptor applications, operating on 256 x 256 sensor imagery, can be processed at frame rates exceeding 100 Hz in a single MVP chip.
Automatic Generation of Directive-Based Parallel Programs for Shared Memory Parallel Systems
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Yan, Jerry; Frumkin, Michael
2000-01-01
The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. As great progress was made in hardware and software technologies, performance of parallel programs with compiler directives has demonstrated large improvement. The introduction of OpenMP directives, the industrial standard for shared-memory programming, has minimized the issue of portability. Due to its ease of programming and its good performance, the technique has become very popular. In this study, we have extended CAPTools, a computer-aided parallelization toolkit, to automatically generate directive-based, OpenMP, parallel programs. We outline techniques used in the implementation of the tool and present test results on the NAS parallel benchmarks and ARC3D, a CFD application. This work demonstrates the great potential of using computer-aided tools to quickly port parallel programs and also achieve good performance.
State-of-the-art review of computational fluid dynamics modeling for fluid-solids systems
NASA Astrophysics Data System (ADS)
Lyczkowski, R. W.; Bouillard, J. X.; Ding, J.; Chang, S. L.; Burge, S. W.
1994-05-01
As the result of 15 years of research (50 staff years of effort) Argonne National Laboratory (ANL), through its involvement in fluidized-bed combustion, magnetohydrodynamics, and a variety of environmental programs, has produced extensive computational fluid dynamics (CFD) software and models to predict the multiphase hydrodynamic and reactive behavior of fluid-solids motions and interactions in complex fluidized-bed reactors (FBR's) and slurry systems. This has resulted in the FLUFIX, IRF, and SLUFIX computer programs. These programs are based on fluid-solids hydrodynamic models and can predict information important to the designer of atmospheric or pressurized bubbling and circulating FBR, fluid catalytic cracking (FCC) and slurry units to guarantee optimum efficiency with minimum release of pollutants into the environment. This latter issue will become of paramount importance with the enactment of the Clean Air Act Amendment (CAAA) of 1995. Solids motion is also the key to understanding erosion processes. Erosion rates in FBR's and pneumatic and slurry components are computed by ANL's EROSION code to predict the potential metal wastage of FBR walls, intervals, feed distributors, and cyclones. Only the FLUFIX and IRF codes will be reviewed in the paper together with highlights of the validations because of length limitations. It is envisioned that one day, these codes with user-friendly pre- and post-processor software and tailored for massively parallel multiprocessor shared memory computational platforms will be used by industry and researchers to assist in reducing and/or eliminating the environmental and economic barriers which limit full consideration of coal, shale, and biomass as energy sources; to retain energy security; and to remediate waste and ecological problems.
Iterative algorithms for tridiagonal matrices on a WSI-multiprocessor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gajski, D.D.; Sameh, A.H.; Wisniewski, J.A.
1982-01-01
With the rapid advances in semiconductor technology, the construction of Wafer Scale Integration (WSI)-multiprocessors consisting of a large number of processors is now feasible. We illustrate the implementation of some basic linear algebra algorithms on such multiprocessors.
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak; Simon, Horst D.
1996-01-01
The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
The potential of multi-port optical memories in digital computing
NASA Technical Reports Server (NTRS)
Alford, C. O.; Gaylord, T. K.
1975-01-01
A high-capacity memory with a relatively high data transfer rate and multi-port simultaneous access capability may serve as the basis for new computer architectures. The implementation of a multi-port optical memory is discussed. Several computer structures are presented that might profitably use such a memory. These structures include (1) a simultaneous record access system, (2) a simultaneously shared memory computer system, and (3) a parallel digital processing structure.
Performance Evaluation of Parallel Algorithms and Architectures in Concurrent Multiprocessor Systems
1988-09-01
HEP and Other Parallel processors, Report no. ANL-83-97, Argonne National Laboratory, Argonne, Ill. 1983. [19] Davidson, G . S. A Practical Paradigm for...IEEE Comp. Soc., 1986. [241 Peir, Jih-kwon, and D. Gajski , "CAMP: A Programming Aide For Multiprocessors," Proc. 1986 ICPP, IEEE Comp. Soc., pp475...482. [251 Pfister, G . F., and V. A. Norton, "Hot Spot Contention and Combining in Multistage Interconnection Networks,"IEEE Trans. Comp., C-34, Oct
Austin, John R
2003-10-01
Previous research on transactive memory has found a positive relationship between transactive memory system development and group performance in single project laboratory and ad hoc groups. Closely related research on shared mental models and expertise recognition supports these findings. In this study, the author examined the relationship between transactive memory systems and performance in mature, continuing groups. A group's transactive memory system, measured as a combination of knowledge stock, knowledge specialization, transactive memory consensus, and transactive memory accuracy, is positively related to group goal performance, external group evaluations, and internal group evaluations. The positive relationship with group performance was found to hold for both task and external relationship transactive memory systems.
ERIC Educational Resources Information Center
Schweppe, Judith; Rummer, Ralf
2007-01-01
The general idea of language-based accounts of short-term memory is that retention of linguistic materials is based on representations within the language processing system. In the present sentence recall study, we address the question whether the assumption of shared representations holds for morphosyntactic information (here: grammatical gender…
Probabilistic evaluation of on-line checks in fault-tolerant multiprocessor systems
NASA Technical Reports Server (NTRS)
Nair, V. S. S.; Hoskote, Yatin V.; Abraham, Jacob A.
1992-01-01
The analysis of fault-tolerant multiprocessor systems that use concurrent error detection (CED) schemes is much more difficult than the analysis of conventional fault-tolerant architectures. Various analytical techniques have been proposed to evaluate CED schemes deterministically. However, these approaches are based on worst-case assumptions related to the failure of system components. Often, the evaluation results do not reflect the actual fault tolerance capabilities of the system. A probabilistic approach to evaluate the fault detecting and locating capabilities of on-line checks in a system is developed. The various probabilities associated with the checking schemes are identified and used in the framework of the matrix-based model. Based on these probabilistic matrices, estimates for the fault tolerance capabilities of various systems are derived analytically.
Automated quantitative muscle biopsy analysis system
NASA Technical Reports Server (NTRS)
Castleman, Kenneth R. (Inventor)
1980-01-01
An automated system to aid the diagnosis of neuromuscular diseases by producing fiber size histograms utilizing histochemically stained muscle biopsy tissue. Televised images of the microscopic fibers are processed electronically by a multi-microprocessor computer, which isolates, measures, and classifies the fibers and displays the fiber size distribution. The architecture of the multi-microprocessor computer, which is iterated to any required degree of complexity, features a series of individual microprocessors P.sub.n each receiving data from a shared memory M.sub.n-1 and outputing processed data to a separate shared memory M.sub.n+1 under control of a program stored in dedicated memory M.sub.n.
Virtual memory support for distributed computing environments using a shared data object model
NASA Astrophysics Data System (ADS)
Huang, F.; Bacon, J.; Mapp, G.
1995-12-01
Conventional storage management systems provide one interface for accessing memory segments and another for accessing secondary storage objects. This hinders application programming and affects overall system performance due to mandatory data copying and user/kernel boundary crossings, which in the microkernel case may involve context switches. Memory-mapping techniques may be used to provide programmers with a unified view of the storage system. This paper extends such techniques to support a shared data object model for distributed computing environments in which good support for coherence and synchronization is essential. The approach is based on a microkernel, typed memory objects, and integrated coherence control. A microkernel architecture is used to support multiple coherence protocols and the addition of new protocols. Memory objects are typed and applications can choose the most suitable protocols for different types of object to avoid protocol mismatch. Low-level coherence control is integrated with high-level concurrency control so that the number of messages required to maintain memory coherence is reduced and system-wide synchronization is realized without severely impacting the system performance. These features together contribute a novel approach to the support for flexible coherence under application control.
Performance Analysis of Multilevel Parallel Applications on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Haoqiang; Labarta, Jesus; Gimenez, Judit; Caubet, Jordi; Biegel, Bryan A. (Technical Monitor)
2002-01-01
In this paper we describe how to apply powerful performance analysis techniques to understand the behavior of multilevel parallel applications. We use the Paraver/OMPItrace performance analysis system for our study. This system consists of two major components: The OMPItrace dynamic instrumentation mechanism, which allows the tracing of processes and threads and the Paraver graphical user interface for inspection and analyses of the generated traces. We describe how to use the system to conduct a detailed comparative study of a benchmark code implemented in five different programming paradigms applicable for shared memory
Multiprocessor Neural Network in Healthcare.
Godó, Zoltán Attila; Kiss, Gábor; Kocsis, Dénes
2015-01-01
A possible way of creating a multiprocessor artificial neural network is by the use of microcontrollers. The RISC processors' high performance and the large number of I/O ports mean they are greatly suitable for creating such a system. During our research, we wanted to see if it is possible to efficiently create interaction between the artifical neural network and the natural nervous system. To achieve as much analogy to the living nervous system as possible, we created a frequency-modulated analog connection between the units. Our system is connected to the living nervous system through 128 microelectrodes. Two-way communication is provided through A/D transformation, which is even capable of testing psychopharmacons. The microcontroller-based analog artificial neural network can play a great role in medical singal processing, such as ECG, EEG etc.
Characterizing parallel file-access patterns on a large-scale multiprocessor
NASA Technical Reports Server (NTRS)
Purakayastha, A.; Ellis, Carla; Kotz, David; Nieuwejaar, Nils; Best, Michael L.
1995-01-01
High-performance parallel file systems are needed to satisfy tremendous I/O requirements of parallel scientific applications. The design of such high-performance parallel file systems depends on a comprehensive understanding of the expected workload, but so far there have been very few usage studies of multiprocessor file systems. This paper is part of the CHARISMA project, which intends to fill this void by measuring real file-system workloads on various production parallel machines. In particular, we present results from the CM-5 at the National Center for Supercomputing Applications. Our results are unique because we collect information about nearly every individual I/O request from the mix of jobs running on the machine. Analysis of the traces leads to various recommendations for parallel file-system design.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Dritz, K.W.; Boyle, J.M.
This paper addresses the problem of measuring and analyzing the performance of fine-grained parallel programs running on shared-memory multiprocessors. Such processors use locking (either directly in the application program, or indirectly in a subroutine library or the operating system) to serialize accesses to global variables. Given sufficiently high rates of locking, the chief factor preventing linear speedup (besides lack of adequate inherent parallelism in the application) is lock contention - the blocking of processes that are trying to acquire a lock currently held by another process. We show how a high-resolution, low-overhead clock may be used to measure both lockmore » contention and lack of parallel work. Several ways of presenting the results are covered, culminating in a method for calculating, in a single multiprocessing run, both the speedup actually achieved and the speedup lost to contention for each lock and to lack of parallel work. The speedup losses are reported in the same units, ''processor-equivalents,'' as the speedup achieved. Both are obtained without having to perform the usual one-process comparison run. We chronicle also a variety of experiments motivated by actual results obtained with our measurement method. The insights into program performance that we gained from these experiments helped us to refine the parts of our programs concerned with communication and synchronization. Ultimately these improvements reduced lock contention to a negligible amount and yielded nearly linear speedup in applications not limited by lack of parallel work. We describe two generally applicable strategies (''code motion out of critical regions'' and ''critical-region fissioning'') for reducing lock contention and one (''lock/variable fusion'') applicable only on certain architectures.« less
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks.
Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-Hoon
2017-05-22
In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads.
Howe, Piers D. L.
2017-01-01
To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources. PMID:28410383
Lapierre, Mark D; Cropper, Simon J; Howe, Piers D L
2017-01-01
To understand how the visual system represents multiple moving objects and how those representations contribute to tracking, it is essential that we understand how the processes of attention and working memory interact. In the work described here we present an investigation of that interaction via a series of tracking and working memory dual-task experiments. Previously, it has been argued that tracking is resistant to disruption by a concurrent working memory task and that any apparent disruption is in fact due to observers making a response to the working memory task, rather than due to competition for shared resources. Contrary to this, in our experiments we find that when task order and response order confounds are avoided, all participants show a similar decrease in both tracking and working memory performance. However, if task and response order confounds are not adequately controlled for we find substantial individual differences, which could explain the previous conflicting reports on this topic. Our results provide clear evidence that tracking and working memory tasks share processing resources.
NASA Technical Reports Server (NTRS)
Huynh, Loc C.; Duval, R. W.
1986-01-01
The use of Redundant Asynchronous Multiprocessor System to achieve ultrareliable Fault Tolerant Control Systems shows great promise. The development has been hampered by the inability to determine whether differences in the outputs of redundant CPU's are due to failures or to accrued error built up by slight differences in CPU clock intervals. This study derives an analytical dynamic model of the difference between redundant CPU's due to differences in their clock intervals and uses this model with on-line parameter identification to idenitify the differences in the clock intervals. The ability of this methodology to accurately track errors due to asynchronisity generate an error signal with the effect of asynchronisity removed and this signal may be used to detect and isolate actual system failures.
Meeting the memory challenges of brain-scale network simulation.
Kunkel, Susanne; Potjans, Tobias C; Eppler, Jochen M; Plesser, Hans Ekkehard; Morrison, Abigail; Diesmann, Markus
2011-01-01
The development of high-performance simulation software is crucial for studying the brain connectome. Using connectome data to generate neurocomputational models requires software capable of coping with models on a variety of scales: from the microscale, investigating plasticity, and dynamics of circuits in local networks, to the macroscale, investigating the interactions between distinct brain regions. Prior to any serious dynamical investigation, the first task of network simulations is to check the consistency of data integrated in the connectome and constrain ranges for yet unknown parameters. Thanks to distributed computing techniques, it is possible today to routinely simulate local cortical networks of around 10(5) neurons with up to 10(9) synapses on clusters and multi-processor shared-memory machines. However, brain-scale networks are orders of magnitude larger than such local networks, in terms of numbers of neurons and synapses as well as in terms of computational load. Such networks have been investigated in individual studies, but the underlying simulation technologies have neither been described in sufficient detail to be reproducible nor made publicly available. Here, we discover that as the network model sizes approach the regime of meso- and macroscale simulations, memory consumption on individual compute nodes becomes a critical bottleneck. This is especially relevant on modern supercomputers such as the Blue Gene/P architecture where the available working memory per CPU core is rather limited. We develop a simple linear model to analyze the memory consumption of the constituent components of neuronal simulators as a function of network size and the number of cores used. This approach has multiple benefits. The model enables identification of key contributing components to memory saturation and prediction of the effects of potential improvements to code before any implementation takes place. As a consequence, development cycles can be shorter and less expensive. Applying the model to our freely available Neural Simulation Tool (NEST), we identify the software components dominant at different scales, and develop general strategies for reducing the memory consumption, in particular by using data structures that exploit the sparseness of the local representation of the network. We show that these adaptations enable our simulation software to scale up to the order of 10,000 processors and beyond. As memory consumption issues are likely to be relevant for any software dealing with complex connectome data on such architectures, our approach and our findings should be useful for researchers developing novel neuroinformatics solutions to the challenges posed by the connectome project.
A language comparison for scientific computing on MIMD architectures
NASA Technical Reports Server (NTRS)
Jones, Mark T.; Patrick, Merrell L.; Voigt, Robert G.
1989-01-01
Choleski's method for solving banded symmetric, positive definite systems is implemented on a multiprocessor computer using three FORTRAN based parallel programming languages, the Force, PISCES and Concurrent FORTRAN. The capabilities of the language for expressing parallelism and their user friendliness are discussed, including readability of the code, debugging assistance offered, and expressiveness of the languages. The performance of the different implementations is compared. It is argued that PISCES, using the Force for medium-grained parallelism, is the appropriate choice for programming Choleski's method on the multiprocessor computer, Flex/32.
Komarov, Ivan; D'Souza, Roshan M
2012-01-01
The Gillespie Stochastic Simulation Algorithm (GSSA) and its variants are cornerstone techniques to simulate reaction kinetics in situations where the concentration of the reactant is too low to allow deterministic techniques such as differential equations. The inherent limitations of the GSSA include the time required for executing a single run and the need for multiple runs for parameter sweep exercises due to the stochastic nature of the simulation. Even very efficient variants of GSSA are prohibitively expensive to compute and perform parameter sweeps. Here we present a novel variant of the exact GSSA that is amenable to acceleration by using graphics processing units (GPUs). We parallelize the execution of a single realization across threads in a warp (fine-grained parallelism). A warp is a collection of threads that are executed synchronously on a single multi-processor. Warps executing in parallel on different multi-processors (coarse-grained parallelism) simultaneously generate multiple trajectories. Novel data-structures and algorithms reduce memory traffic, which is the bottleneck in computing the GSSA. Our benchmarks show an 8×-120× performance gain over various state-of-the-art serial algorithms when simulating different types of models.
Recent Progress on the Parallel Implementation of Moving-Body Overset Grid Schemes
NASA Technical Reports Server (NTRS)
Wissink, Andrew; Allen, Edwin (Technical Monitor)
1998-01-01
Viscous calculations about geometrically complex bodies in which there is relative motion between component parts is one of the most computationally demanding problems facing CFD researchers today. This presentation documents results from the first two years of a CHSSI-funded effort within the U.S. Army AFDD to develop scalable dynamic overset grid methods for unsteady viscous calculations with moving-body problems. The first pan of the presentation will focus on results from OVERFLOW-D1, a parallelized moving-body overset grid scheme that employs traditional Chimera methodology. The two processes that dominate the cost of such problems are the flow solution on each component and the intergrid connectivity solution. Parallel implementations of the OVERFLOW flow solver and DCF3D connectivity software are coupled with a proposed two-part static-dynamic load balancing scheme and tested on the IBM SP and Cray T3E multi-processors. The second part of the presentation will cover some recent results from OVERFLOW-D2, a new flow solver that employs Cartesian grids with various levels of refinement, facilitating solution adaption. A study of the parallel performance of the scheme on large distributed- memory multiprocessor computer architectures will be reported.
Runtime support for parallelizing data mining algorithms
NASA Astrophysics Data System (ADS)
Jin, Ruoming; Agrawal, Gagan
2002-03-01
With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of common data mining algorithms. In addition, we propose a reduction-object based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the technique we have developed starting from a common specification of the algorithm.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Multiprocessor computer overset grid method and apparatus
Barnette, Daniel W.; Ober, Curtis C.
2003-01-01
A multiprocessor computer overset grid method and apparatus comprises associating points in each overset grid with processors and using mapped interpolation transformations to communicate intermediate values between processors assigned base and target points of the interpolation transformations. The method allows a multiprocessor computer to operate with effective load balance on overset grid applications.
DMA shared byte counters in a parallel computer
Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos
2010-04-06
A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Combining Distributed and Shared Memory Models: Approach and Evolution of the Global Arrays Toolkit
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nieplocha, Jarek; Harrison, Robert J.; Kumar, Mukul
2002-07-29
Both shared memory and distributed memory models have advantages and shortcomings. Shared memory model is much easier to use but it ignores data locality/placement. Given the hierarchical nature of the memory subsystems in the modern computers this characteristic might have a negative impact on performance and scalability. Various techniques, such as code restructuring to increase data reuse and introducing blocking in data accesses, can address the problem and yield performance competitive with message passing[Singh], however at the cost of compromising the ease of use feature. Distributed memory models such as message passing or one-sided communication offer performance and scalability butmore » they compromise the ease-of-use. In this context, the message-passing model is sometimes referred to as?assembly programming for the scientific computing?. The Global Arrays toolkit[GA1, GA2] attempts to offer the best features of both models. It implements a shared-memory programming model in which data locality is managed explicitly by the programmer. This management is achieved by explicit calls to functions that transfer data between a global address space (a distributed array) and local storage. In this respect, the GA model has similarities to the distributed shared-memory models that provide an explicit acquire/release protocol. However, the GA model acknowledges that remote data is slower to access than local data and allows data locality to be explicitly specified and hence managed. The GA model exposes to the programmer the hierarchical memory of modern high-performance computer systems, and by recognizing the communication overhead for remote data transfer, it promotes data reuse and locality of reference. This paper describes the characteristics of the Global Arrays programming model, capabilities of the toolkit, and discusses its evolution.« less
System architecture for asynchronous multi-processor robotic control system
NASA Technical Reports Server (NTRS)
Steele, Robert D.; Long, Mark; Backes, Paul
1993-01-01
The architecture for the Modular Telerobot Task Execution System (MOTES) as implemented in the Supervisory Telerobotics (STELER) Laboratory is described. MOTES is the software component of the remote site of a local-remote telerobotic system which is being developed for NASA for space applications, in particular Space Station Freedom applications. The system is being developed to provide control and supervised autonomous control to support both space based operation and ground-remote control with time delay. The local-remote architecture places task planning responsibilities at the local site and task execution responsibilities at the remote site. This separation allows the remote site to be designed to optimize task execution capability within a limited computational environment such as is expected in flight systems. The local site task planning system could be placed on the ground where few computational limitations are expected. MOTES is written in the Ada programming language for a multiprocessor environment.
Symbiosis of executive and selective attention in working memory
Vandierendonck, André
2014-01-01
The notion of working memory (WM) was introduced to account for the usage of short-term memory resources by other cognitive tasks such as reasoning, mental arithmetic, language comprehension, and many others. This collaboration between memory and other cognitive tasks can only be achieved by a dedicated WM system that controls task coordination. To that end, WM models include executive control. Nevertheless, other attention control systems may be involved in coordination of memory and cognitive tasks calling on memory resources. The present paper briefly reviews the evidence concerning the role of selective attention in WM activities. A model is proposed in which selective attention control is directly linked to the executive control part of the WM system. The model assumes that apart from storage of declarative information, the system also includes an executive WM module that represents the current task set. Control processes are automatically triggered when particular conditions in these modules are met. As each task set represents the parameter settings and the actions needed to achieve the task goal, it will depend on the specific settings and actions whether selective attention control will have to be shared among the active tasks. Only when such sharing is required, task performance will be affected by the capacity limits of the control system involved. PMID:25152723
Symbiosis of executive and selective attention in working memory.
Vandierendonck, André
2014-01-01
The notion of working memory (WM) was introduced to account for the usage of short-term memory resources by other cognitive tasks such as reasoning, mental arithmetic, language comprehension, and many others. This collaboration between memory and other cognitive tasks can only be achieved by a dedicated WM system that controls task coordination. To that end, WM models include executive control. Nevertheless, other attention control systems may be involved in coordination of memory and cognitive tasks calling on memory resources. The present paper briefly reviews the evidence concerning the role of selective attention in WM activities. A model is proposed in which selective attention control is directly linked to the executive control part of the WM system. The model assumes that apart from storage of declarative information, the system also includes an executive WM module that represents the current task set. Control processes are automatically triggered when particular conditions in these modules are met. As each task set represents the parameter settings and the actions needed to achieve the task goal, it will depend on the specific settings and actions whether selective attention control will have to be shared among the active tasks. Only when such sharing is required, task performance will be affected by the capacity limits of the control system involved.
NASA Technical Reports Server (NTRS)
Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.
1994-01-01
The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
Instrumentation, performance visualization, and debugging tools for multiprocessors
NASA Technical Reports Server (NTRS)
Yan, Jerry C.; Fineman, Charles E.; Hontalas, Philip J.
1991-01-01
The need for computing power has forced a migration from serial computation on a single processor to parallel processing on multiprocessor architectures. However, without effective means to monitor (and visualize) program execution, debugging, and tuning parallel programs becomes intractably difficult as program complexity increases with the number of processors. Research on performance evaluation tools for multiprocessors is being carried out at ARC. Besides investigating new techniques for instrumenting, monitoring, and presenting the state of parallel program execution in a coherent and user-friendly manner, prototypes of software tools are being incorporated into the run-time environments of various hardware testbeds to evaluate their impact on user productivity. Our current tool set, the Ames Instrumentation Systems (AIMS), incorporates features from various software systems developed in academia and industry. The execution of FORTRAN programs on the Intel iPSC/860 can be automatically instrumented and monitored. Performance data collected in this manner can be displayed graphically on workstations supporting X-Windows. We have successfully compared various parallel algorithms for computational fluid dynamics (CFD) applications in collaboration with scientists from the Numerical Aerodynamic Simulation Systems Division. By performing these comparisons, we show that performance monitors and debuggers such as AIMS are practical and can illuminate the complex dynamics that occur within parallel programs.
Advanced Development of Certified OS Kernels
2015-06-01
It provides an infrastructure to map a physical page into multiple processes’ page maps in different address spaces. Their ownership mechanism ensures...of their shared memory infrastructure . Trap module The trap module specifies the behaviors of exception handlers and mCertiKOS system calls. In...layers), 1 pm for the shared memory infrastructure (3 layers), 3.5 pm for the thread management (10 layers), 1 pm for the process management (4 layers
Performing an allreduce operation using shared memory
Archer, Charles J [Rochester, MN; Dozsa, Gabor [Ardsley, NY; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Performing an allreduce operation using shared memory
Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E
2014-06-10
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Multitasking kernel for the C and Fortran programming languages
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brooks, E.D. III
1984-09-01
A multitasking kernel for the C and Fortran programming languages which runs on the Unix operating system is presented. The kernel provides a multitasking environment which serves two purposes. The first is to provide an efficient portable environment for the coding, debugging and execution of production multiprocessor programs. The second is to provide a means of evaluating the performance of a multitasking program on model multiprocessors. The performance evaluation features require no changes in the source code of the application and are implemented as a set of compile and run time options in the kernel.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Shuangshuang; Chen, Yousu; Wu, Di
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines
NASA Technical Reports Server (NTRS)
1999-01-01
Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
DOE Office of Scientific and Technical Information (OSTI.GOV)
G.A. Pope; K. Sephernoori; D.C. McKinney
1996-03-15
This report describes the application of distributed-memory parallel programming techniques to a compositional simulator called UTCHEM. The University of Texas Chemical Flooding reservoir simulator (UTCHEM) is a general-purpose vectorized chemical flooding simulator that models the transport of chemical species in three-dimensional, multiphase flow through permeable media. The parallel version of UTCHEM addresses solving large-scale problems by reducing the amount of time that is required to obtain the solution as well as providing a flexible and portable programming environment. In this work, the original parallel version of UTCHEM was modified and ported to CRAY T3D and CRAY T3E, distributed-memory, multiprocessor computersmore » using CRAY-PVM as the interprocessor communication library. Also, the data communication routines were modified such that the portability of the original code across different computer architectures was mad possible.« less
Cache write generate for parallel image processing on shared memory architectures.
Wittenbrink, C M; Somani, A K; Chen, C H
1996-01-01
We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. Kent
1994-01-01
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.
Oyarzún, Javiera P; Morís, Joaquín; Luque, David; de Diego-Balaguer, Ruth; Fuentemilla, Lluís
2017-08-09
System memory consolidation is conceptualized as an active process whereby newly encoded memory representations are strengthened through selective memory reactivation during sleep. However, our learning experience is highly overlapping in content (i.e., shares common elements), and memories of these events are organized in an intricate network of overlapping associated events. It remains to be explored whether and how selective memory reactivation during sleep has an impact on these overlapping memories acquired during awake time. Here, we test in a group of adult women and men the prediction that selective memory reactivation during sleep entails the reactivation of associated events and that this may lead the brain to adaptively regulate whether these associated memories are strengthened or pruned from memory networks on the basis of their relative associative strength with the shared element. Our findings demonstrate the existence of efficient regulatory neural mechanisms governing how complex memory networks are shaped during sleep as a function of their associative memory strength. SIGNIFICANCE STATEMENT Numerous studies have demonstrated that system memory consolidation is an active, selective, and sleep-dependent process in which only subsets of new memories become stabilized through their reactivation. However, the learning experience is highly overlapping in content and thus events are encoded in an intricate network of related memories. It remains to be explored whether and how memory reactivation has an impact on overlapping memories acquired during awake time. Here, we show that sleep memory reactivation promotes strengthening and weakening of overlapping memories based on their associative memory strength. These results suggest the existence of an efficient regulatory neural mechanism that avoids the formation of cluttered memory representation of multiple events and promotes stabilization of complex memory networks. Copyright © 2017 the authors 0270-6474/17/377748-11$15.00/0.
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes
NASA Technical Reports Server (NTRS)
DiNucci, David C.; Balley, David H. (Technical Monitor)
1997-01-01
Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
Programming distributed memory architectures using Kali
NASA Technical Reports Server (NTRS)
Mehrotra, Piyush; Vanrosendale, John
1990-01-01
Programming nonshared memory systems is more difficult than programming shared memory systems, in part because of the relatively low level of current programming environments for such machines. A new programming environment is presented, Kali, which provides a global name space and allows direct access to remote data values. In order to retain efficiency, Kali provides a system on annotations, allowing the user to control those aspects of the program critical to performance, such as data distribution and load balancing. The primitives and constructs provided by the language is described, and some of the issues raised in translating a Kali program for execution on distributed memory systems are also discussed.
Multiple-User, Multitasking, Virtual-Memory Computer System
NASA Technical Reports Server (NTRS)
Generazio, Edward R.; Roth, Don J.; Stang, David B.
1993-01-01
Computer system designed and programmed to serve multiple users in research laboratory. Provides for computer control and monitoring of laboratory instruments, acquisition and anlaysis of data from those instruments, and interaction with users via remote terminals. System provides fast access to shared central processing units and associated large (from megabytes to gigabytes) memories. Underlying concept of system also applicable to monitoring and control of industrial processes.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures
2017-10-04
Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
Closed-form solutions of performability. [modeling of a degradable buffer/multiprocessor system
NASA Technical Reports Server (NTRS)
Meyer, J. F.
1981-01-01
Methods which yield closed form performability solutions for continuous valued variables are developed. The models are similar to those employed in performance modeling (i.e., Markovian queueing models) but are extended so as to account for variations in structure due to faults. In particular, the modeling of a degradable buffer/multiprocessor system is considered whose performance Y is the (normalized) average throughput rate realized during a bounded interval of time. To avoid known difficulties associated with exact transient solutions, an approximate decomposition of the model is employed permitting certain submodels to be solved in equilibrium. These solutions are then incorporated in a model with fewer transient states and by solving the latter, a closed form solution of the system's performability is obtained. In conclusion, some applications of this solution are discussed and illustrated, including an example of design optimization.
Brandon, Nicole R; Beike, Denise R; Cole, Holly E
2017-07-01
Autobiographical memories (AMs) can be used to create and maintain closeness with others [Alea, N., & Bluck, S. (2003). Why are you telling me that? A conceptual model of the social function of autobiographical memory. Memory, 11(2), 165-178]. However, the differential effects of memory specificity are not well established. Two studies with 148 participants tested whether the order in which autobiographical knowledge (AK) and specific episodic AM (EAM) are shared affects feelings of closeness. Participants read two memories hypothetically shared by each of four strangers. The strangers first shared either AK or an EAM, and then shared either AK or an EAM. Participants were randomly assigned to read either positive or negative AMs from the strangers. Findings suggest that people feel closer to those who share positive AMs in the same way they construct memories: starting with general and moving to specific.
NASA Technical Reports Server (NTRS)
Botts, Michael E.; Phillips, Ron J.; Parker, John V.; Wright, Patrick D.
1992-01-01
Five scientists at MSFC/ESAD have EOS SCF investigator status. Each SCF has unique tasks which require the establishment of a computing facility dedicated to accomplishing those tasks. A SCF Working Group was established at ESAD with the charter of defining the computing requirements of the individual SCFs and recommending options for meeting these requirements. The primary goal of the working group was to determine which computing needs can be satisfied using either shared resources or separate but compatible resources, and which needs require unique individual resources. The requirements investigated included CPU-intensive vector and scalar processing, visualization, data storage, connectivity, and I/O peripherals. A review of computer industry directions and a market survey of computing hardware provided information regarding important industry standards and candidate computing platforms. It was determined that the total SCF computing requirements might be most effectively met using a hierarchy consisting of shared and individual resources. This hierarchy is composed of five major system types: (1) a supercomputer class vector processor; (2) a high-end scalar multiprocessor workstation; (3) a file server; (4) a few medium- to high-end visualization workstations; and (5) several low- to medium-range personal graphics workstations. Specific recommendations for meeting the needs of each of these types are presented.
Reder, Lynne M.; Park, Heekyeong; Kieffaber, Paul D.
2009-01-01
There is a popular hypothesis that performance on implicit and explicit memory tasks reflects 2 distinct memory systems. Explicit memory is said to store those experiences that can be consciously recollected, and implicit memory is said to store experiences and affect subsequent behavior but to be unavailable to conscious awareness. Although this division based on awareness is a useful taxonomy for memory tasks, the authors review the evidence that the unconscious character of implicit memory does not necessitate that it be treated as a separate system of human memory. They also argue that some implicit and explicit memory tasks share the same memory representations and that the important distinction is whether the task (implicit or explicit) requires the formation of a new association. The authors review and critique dissociations from the behavioral, amnesia, and neuroimaging literatures that have been advanced in support of separate explicit and implicit memory systems by highlighting contradictory evidence and by illustrating how the data can be accounted for using a simple computational memory model that assumes the same memory representation for those disparate tasks. PMID:19210052
Ordering of guarded and unguarded stores for no-sync I/O
Gara, Alan; Ohmacht, Martin
2013-06-25
A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
Parallel solution of the symmetric tridiagonal eigenproblem. Research report
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1989-10-01
This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed-memory Multiple Instruction, Multiple Data multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speed up, and accuracy. Experiments on an IPSC hypercube multiprocessor reveal that Cuppen's method ismore » the most accurate approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effect of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptions of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Parallel solution of the symmetric tridiagonal eigenproblem
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jessup, E.R.
1989-01-01
This thesis discusses methods for computing all eigenvalues and eigenvectors of a symmetric tridiagonal matrix on a distributed memory MIMD multiprocessor. Only those techniques having the potential for both high numerical accuracy and significant large-grained parallelism are investigated. These include the QL method or Cuppen's divide and conquer method based on rank-one updating to compute both eigenvalues and eigenvectors, bisection to determine eigenvalues, and inverse iteration to compute eigenvectors. To begin, the methods are compared with respect to computation time, communication time, parallel speedup, and accuracy. Experiments on an iPSC hyper-cube multiprocessor reveal that Cuppen's method is the most accuratemore » approach, but bisection with inverse iteration is the fastest and most parallel. Because the accuracy of the latter combination is determined by the quality of the computed eigenvectors, the factors influencing the accuracy of inverse iteration are examined. This includes, in part, statistical analysis of the effects of a starting vector with random components. These results are used to develop an implementation of inverse iteration producing eigenvectors with lower residual error and better orthogonality than those generated by the EISPACK routine TINVIT. This thesis concludes with adaptations of methods for the symmetric tridiagonal eigenproblem to the related problem of computing the singular value decomposition (SVD) of a bidiagonal matrix.« less
Three-Dimensional Nacelle Aeroacoustics Code With Application to Impedance Education
NASA Technical Reports Server (NTRS)
Watson, Willie R.
2000-01-01
A three-dimensional nacelle acoustics code that accounts for uniform mean flow and variable surface impedance liners is developed. The code is linked to a commercial version of the NASA-developed General Purpose Solver (for solution of linear systems of equations) in order to obtain the capability to study high frequency waves that may require millions of grid points for resolution. Detailed, single-processor statistics for the performance of the solver in rigid and soft-wall ducts are presented. Over the range of frequencies of current interest in nacelle liner research, noise attenuation levels predicted from the code were in excellent agreement with those predicted from mode theory. The equation solver is memory efficient, requiring only a small fraction of the memory available on modern computers. As an application, the code is combined with an optimization algorithm and used to reduce the impedance spectrum of a ceramic liner. The primary problem with using the code to perform optimization studies at frequencies above I1kHz is the excessive CPU time (a major portion of which is matrix assembly). The research recommends that research be directed toward development of a rapid sparse assembler and exploitation of the multiprocessor capability of the solver to further reduce CPU time.
Synchronizing Data-Bus Messages
NASA Technical Reports Server (NTRS)
Harris, L. H.
1985-01-01
Adapter allows communications among as many as 30 data processors without central bus controller. Adapter improves reliability of multiprocessor system by eliminating point of failure that causes entire system to fail. Scheme prevents data collisions and eliminates nonessential polling, thereby reducing power consumption.
A Family of ACO Routing Protocols for Mobile Ad Hoc Networks
Rupérez Cañas, Delfín; Sandoval Orozco, Ana Lucila; García Villalba, Luis Javier; Kim, Tai-hoon
2017-01-01
In this work, an ACO routing protocol for mobile ad hoc networks based on AntHocNet is specified. As its predecessor, this new protocol, called AntOR, is hybrid in the sense that it contains elements from both reactive and proactive routing. Specifically, it combines a reactive route setup process with a proactive route maintenance and improvement process. Key aspects of the AntOR protocol are the disjoint-link and disjoint-node routes, separation between the regular pheromone and the virtual pheromone in the diffusion process and the exploration of routes, taking into consideration the number of hops in the best routes. In this work, a family of ACO routing protocols based on AntOR is also specified. These protocols are based on protocol successive refinements. In this work, we also present a parallelized version of AntOR that we call PAntOR. Using programming multiprocessor architectures based on the shared memory protocol, PAntOR allows running tasks in parallel using threads. This parallelization is applicable in the route setup phase, route local repair process and link failure notification. In addition, a variant of PAntOR that consists of having more than one interface, which we call PAntOR-MI (PAntOR-Multiple Interface), is specified. This approach parallelizes the sending of broadcast messages by interface through threads. PMID:28531159
Himawari Support In The CSPP-GEO Direct Broadcast Package
NASA Astrophysics Data System (ADS)
Cureton, G. P.; Martin, G.
2016-12-01
The Cooperative Institute for Meteorological Satellite Studies (CIMSS) has a long history of supporting the Direct Broadcast (DB) community for various sensors, recently with the International MODIS/AIRS Processing Package (IMAPP) for the NASA EOS polar orbiters Terra and Aqua, and the Community Satellite Processing Package (CSPP) for the NOAA polar orbiter Suomi-NPP. CSPP has been significant in encouraging the early usage of Suomi-NPP data by US and international weather agencies, and it is hoped that a new package, CSPP-GEO, will similarly encourage usage of DB data from GOES-R, Himawari, and other geostationary satellites. The support of Himawari-8 provides several challenges for the CSPP-GEO-Geocat package, which generally revolve around the greatly increased data rate associated with the subsatellite point footprint approaching 1km. CSPP-GEO-Geocat takes advantage of python shared-memory multiprocessor support to divide Himawari data into managable pieces, which are then farmed out to indvidual cores for processing by the underlying geocat code. The resulting product segments are then stitched together to make the final product NetCDF4 files. CSPP-GEO-Geocat will support high-data-rate HRIT input, as well as the reduced resolution HimwariCast direct broadcast data stream. Products supported by CSPP-GEO-Geocat include the level-1 reflective and emissive bands, as well as level-2 products like cloud mask, cloud type, optical depth and particle size, cloud top temperature and pressure.
Shared Values as Anchors of a Learning Community: A Case Study in Information Systems Design
ERIC Educational Resources Information Center
Giordano, Daniela
2004-01-01
This paper examines the role in both individual and organizational learning of the system of values sustained by a community undertaking a design task. The discussion is based on the results of a longitudinal study of a community of novice information system designers supported by a Web-based shared design memory which allows reuse of design…
A microeconomic scheduler for parallel computers
NASA Technical Reports Server (NTRS)
Stoica, Ion; Abdel-Wahab, Hussein; Pothen, Alex
1995-01-01
We describe a scheduler based on the microeconomic paradigm for scheduling on-line a set of parallel jobs in a multiprocessor system. In addition to the classical objectives of increasing the system throughput and reducing the response time, we consider fairness in allocating system resources among the users, and providing the user with control over the relative performances of his jobs. We associate with every user a savings account in which he receives money at a constant rate. When a user wants to run a job, he creates an expense account for that job to which he transfers money from his savings account. The job uses the funds in its expense account to obtain the system resources it needs for execution. The share of the system resources allocated to the user is directly related to the rate at which the user receives money; the rate at which the user transfers money into a job expense account controls the job's performance. We prove that starvation is not possible in our model. Simulation results show that our scheduler improves both system and user performances in comparison with two different variable partitioning policies. It is also shown to be effective in guaranteeing fairness and providing control over the performance of jobs.
Shared prefetching to reduce execution skew in multi-threaded systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Eichenberger, Alexandre E; Gunnels, John A
Mechanisms are provided for optimizing code to perform prefetching of data into a shared memory of a computing device that is shared by a plurality of threads that execute on the computing device. A memory stream of a portion of code that is shared by the plurality of threads is identified. A set of prefetch instructions is distributed across the plurality of threads. Prefetch instructions are inserted into the instruction sequences of the plurality of threads such that each instruction sequence has a separate sub-portion of the set of prefetch instructions, thereby generating optimized code. Executable code is generated basedmore » on the optimized code and stored in a storage device. The executable code, when executed, performs the prefetches associated with the distributed set of prefetch instructions in a shared manner across the plurality of threads.« less
Fault recovery characteristics of the fault tolerant multi-processor
NASA Technical Reports Server (NTRS)
Padilla, Peter A.
1990-01-01
The fault handling performance of the fault tolerant multiprocessor (FTMP) was investigated. Fault handling errors detected during fault injection experiments were characterized. In these fault injection experiments, the FTMP disabled a working unit instead of the faulted unit once every 500 faults, on the average. System design weaknesses allow active faults to exercise a part of the fault management software that handles byzantine or lying faults. It is pointed out that these weak areas in the FTMP's design increase the probability that, for any hardware fault, a good LRU (line replaceable unit) is mistakenly disabled by the fault management software. It is concluded that fault injection can help detect and analyze the behavior of a system in the ultra-reliable regime. Although fault injection testing cannot be exhaustive, it has been demonstrated that it provides a unique capability to unmask problems and to characterize the behavior of a fault-tolerant system.
Optimized Infrastructure for the Earth System Prediction Capability
2013-09-30
for referencing memory between its native coupling datatype (MCT Attribute Vectors) and ESMF Arrays. This will reduce the copies required and will...introduced ability within CESM to share memory between ESMF and MCT datatypes makes using both tools together much easier. Using both is appealing
NASA Technical Reports Server (NTRS)
Jensen, E. Douglas
1988-01-01
Alpha is a new kind of operating system that is unique in two highly significant ways. First, it is decentralized transparently providing reliable resource management across physically dispersed nodes, so that distributed applications programming can be done largely as though it were centralized. And second, it provides comprehensive, high technology support for real-time system integration and operation, an application area which consists predominately of aperiodic activities having critical time constraints such as deadlines. Alpha is extremely adaptable so that it can be easily optimized for a wide range of problem-specific functionality, performance, and cost. Alpha is the first systems effort of the Archons Project, and the prototype was created at Carnegie-Mellon University directly on modified Sun multiprocessor workstation hardware. It has been demonstrated with a real-time C(sup 2) application. Continuing research is leading to a series of enhanced follow-ons to Alpha; these are portable but initially hosted on Concurrent's MASSCOMP line of multiprocessor products.
NASA Astrophysics Data System (ADS)
Zou, Liang; Fu, Zhuang; Zhao, YanZheng; Yang, JunYan
2010-07-01
This paper proposes a kind of pipelined electric circuit architecture implemented in FPGA, a very large scale integrated circuit (VLSI), which efficiently deals with the real time non-uniformity correction (NUC) algorithm for infrared focal plane arrays (IRFPA). Dual Nios II soft-core processors and a DSP with a 64+ core together constitute this image system. Each processor undertakes own systematic task, coordinating its work with each other's. The system on programmable chip (SOPC) in FPGA works steadily under the global clock frequency of 96Mhz. Adequate time allowance makes FPGA perform NUC image pre-processing algorithm with ease, which has offered favorable guarantee for the work of post image processing in DSP. And at the meantime, this paper presents a hardware (HW) and software (SW) co-design in FPGA. Thus, this systematic architecture yields an image processing system with multiprocessor, and a smart solution to the satisfaction with the performance of the system.
Fault-free behavior of reliable multiprocessor systems: FTMP experiments in AIRLAB
NASA Technical Reports Server (NTRS)
Clune, E.; Segall, Z.; Siewiorek, D.
1985-01-01
This report describes a set of experiments which were implemented on the Fault tolerant Multi-Processor (FTMP) at NASA/Langley's AIRLAB facility. These experiments are part of an effort to formulate and evaluate validation methodologies for fault-tolerant computers. This report deals with the measurement of single parameters (baselines) of a fault free system. The initial set of baseline experiments lead to the following conclusions: (1) The system clock is constant and independent of workload in the tested cases; (2) the instruction execution times are constant; (3) the R4 frame size is 40mS with some variation; (4) the frame stretching mechanism has some flaws in its implementation that allow the possibility of an infinite stretching of frame duration. Future experiments are planned. Some will broaden the results of these initial experiments. Others will measure the system more dynamically. The implementation of a synthetic workload generation mechanism for FTMP is planned to enhance the experimental environment of the system.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hanham, R.; Vogt, W.G.; Mickle, M.H.
1986-01-01
This book presents the papers given at a conference on computerized simulation. Topics considered at the conference included expert systems, modeling in electric power systems, power systems operating strategies, energy analysis, a linear programming approach to optimum load shedding in transmission systems, econometrics, simulation in natural gas engineering, solar energy studies, artificial intelligence, vision systems, hydrology, multiprocessors, and flow models.
Mnemonic convergence in social networks: The emergent properties of cognition at a collective level.
Coman, Alin; Momennejad, Ida; Drach, Rae D; Geana, Andra
2016-07-19
The development of shared memories, beliefs, and norms is a fundamental characteristic of human communities. These emergent outcomes are thought to occur owing to a dynamic system of information sharing and memory updating, which fundamentally depends on communication. Here we report results on the formation of collective memories in laboratory-created communities. We manipulated conversational network structure in a series of real-time, computer-mediated interactions in fourteen 10-member communities. The results show that mnemonic convergence, measured as the degree of overlap among community members' memories, is influenced by both individual-level information-processing phenomena and by the conversational social network structure created during conversational recall. By studying laboratory-created social networks, we show how large-scale social phenomena (i.e., collective memory) can emerge out of microlevel local dynamics (i.e., mnemonic reinforcement and suppression effects). The social-interactionist approach proposed herein points to optimal strategies for spreading information in social networks and provides a framework for measuring and forging collective memories in communities of individuals.
Power/Performance Trade-offs of Small Batched LU Based Solvers on GPUs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Fatica, Massimiliano; Gawande, Nitin A.
In this paper we propose and analyze a set of batched linear solvers for small matrices on Graphic Processing Units (GPUs), evaluating the various alternatives depending on the size of the systems to solve. We discuss three different solutions that operate with different level of parallelization and GPU features. The first, exploiting the CUBLAS library, manages matrices of size up to 32x32 and employs Warp level (one matrix, one Warp) parallelism and shared memory. The second works at Thread-block level parallelism (one matrix, one Thread-block), still exploiting shared memory but managing matrices up to 76x76. The third is Thread levelmore » parallel (one matrix, one thread) and can reach sizes up to 128x128, but it does not exploit shared memory and only relies on the high memory bandwidth of the GPU. The first and second solution only support partial pivoting, the third one easily supports partial and full pivoting, making it attractive to problems that require greater numerical stability. We analyze the trade-offs in terms of performance and power consumption as function of the size of the linear systems that are simultaneously solved. We execute the three implementations on a Tesla M2090 (Fermi) and on a Tesla K20 (Kepler).« less
A macro-micro robot for precise force applications
NASA Technical Reports Server (NTRS)
Marzwell, Neville I.; Wang, Yulun
1993-01-01
This paper describes an 8 degree-of-freedom macro-micro robot capable of performing tasks which require accurate force control. Applications such as polishing, finishing, grinding, deburring, and cleaning are a few examples of tasks which need this capability. Currently these tasks are either performed manually or with dedicated machinery because of the lack of a flexible and cost effective tool, such as a programmable force-controlled robot. The basic design and control of the macro-micro robot is described in this paper. A modular high-performance multiprocessor control system was designed to provide sufficient compute power for executing advanced control methods. An 8 degree of freedom macro-micro mechanism was constructed to enable accurate tip forces. Control algorithms based on the impedance control method were derived, coded, and load balanced for maximum execution speed on the multiprocessor system.
ATAMM enhancement and multiprocessor performance evaluation
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.; Som, Sukhamoy; Obando, Rodrigo; Malekpour, Mahyar R.; Jones, Robert L., III; Mandala, Brij Mohan V.
1991-01-01
ATAMM (Algorithm To Architecture Mapping Model) enhancement and multiprocessor performance evaluation is discussed. The following topics are included: the ATAMM model; ATAMM enhancement; ADM (Advanced Development Model) implementation of ATAMM; and ATAMM support tools.
Partitioning and packing mathematical simulation models for calculation on parallel computers
NASA Technical Reports Server (NTRS)
Arpasi, D. J.; Milner, E. J.
1986-01-01
The development of multiprocessor simulations from a serial set of ordinary differential equations describing a physical system is described. Degrees of parallelism (i.e., coupling between the equations) and their impact on parallel processing are discussed. The problem of identifying computational parallelism within sets of closely coupled equations that require the exchange of current values of variables is described. A technique is presented for identifying this parallelism and for partitioning the equations for parallel solution on a multiprocessor. An algorithm which packs the equations into a minimum number of processors is also described. The results of the packing algorithm when applied to a turbojet engine model are presented in terms of processor utilization.
NASA Technical Reports Server (NTRS)
Smith, T. B., III; Lala, J. H.
1984-01-01
The FTMP architecture is a high reliability computer concept modeled after a homogeneous multiprocessor architecture. Elements of the FTMP are operated in tight synchronism with one another and hardware fault-detection and fault-masking is provided which is transparent to the software. Operating system design and user software design is thus greatly simplified. Performance of the FTMP is also comparable to that of a simplex equivalent due to the efficiency of fault handling hardware. The FTMP project constructed an engineering module of the FTMP, programmed the machine and extensively tested the architecture through fault injection and other stress testing. This testing confirmed the soundness of the FTMP concepts.
Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Dimpsey, Robert Tod
1992-01-01
In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.
Efficient accesses of data structures using processing near memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jayasena, Nuwan S.; Zhang, Dong Ping; Diez, Paula Aguilera
Systems, apparatuses, and methods for implementing efficient queues and other data structures. A queue may be shared among multiple processors and/or threads without using explicit software atomic instructions to coordinate access to the queue. System software may allocate an atomic queue and corresponding queue metadata in system memory and return, to the requesting thread, a handle referencing the queue metadata. Any number of threads may utilize the handle for accessing the atomic queue. The logic for ensuring the atomicity of accesses to the atomic queue may reside in a management unit in the memory controller coupled to the memory wheremore » the atomic queue is allocated.« less
The Automated Instrumentation and Monitoring System (AIMS) reference manual
NASA Technical Reports Server (NTRS)
Yan, Jerry; Hontalas, Philip; Listgarten, Sherry
1993-01-01
Whether a researcher is designing the 'next parallel programming paradigm,' another 'scalable multiprocessor' or investigating resource allocation algorithms for multiprocessors, a facility that enables parallel program execution to be captured and displayed is invaluable. Careful analysis of execution traces can help computer designers and software architects to uncover system behavior and to take advantage of specific application characteristics and hardware features. A software tool kit that facilitates performance evaluation of parallel applications on multiprocessors is described. The Automated Instrumentation and Monitoring System (AIMS) has four major software components: a source code instrumentor which automatically inserts active event recorders into the program's source code before compilation; a run time performance-monitoring library, which collects performance data; a trace file animation and analysis tool kit which reconstructs program execution from the trace file; and a trace post-processor which compensate for data collection overhead. Besides being used as prototype for developing new techniques for instrumenting, monitoring, and visualizing parallel program execution, AIMS is also being incorporated into the run-time environments of various hardware test beds to evaluate their impact on user productivity. Currently, AIMS instrumentors accept FORTRAN and C parallel programs written for Intel's NX operating system on the iPSC family of multi computers. A run-time performance-monitoring library for the iPSC/860 is included in this release. We plan to release monitors for other platforms (such as PVM and TMC's CM-5) in the near future. Performance data collected can be graphically displayed on workstations (e.g. Sun Sparc and SGI) supporting X-Windows (in particular, Xl IR5, Motif 1.1.3).
Load Balancing Unstructured Adaptive Grids for CFD Problems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid
1996-01-01
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.
Distributed shared memory for roaming large volumes.
Castanié, Laurent; Mion, Christophe; Cavin, Xavier; Lévy, Bruno
2006-01-01
We present a cluster-based volume rendering system for roaming very large volumes. This system allows to move a gigabyte-sized probe inside a total volume of several tens or hundreds of gigabytes in real-time. While the size of the probe is limited by the total amount of texture memory on the cluster, the size of the total data set has no theoretical limit. The cluster is used as a distributed graphics processing unit that both aggregates graphics power and graphics memory. A hardware-accelerated volume renderer runs in parallel on the cluster nodes and the final image compositing is implemented using a pipelined sort-last rendering algorithm. Meanwhile, volume bricking and volume paging allow efficient data caching. On each rendering node, a distributed hierarchical cache system implements a global software-based distributed shared memory on the cluster. In case of a cache miss, this system first checks page residency on the other cluster nodes instead of directly accessing local disks. Using two Gigabit Ethernet network interfaces per node, we accelerate data fetching by a factor of 4 compared to directly accessing local disks. The system also implements asynchronous disk access and texture loading, which makes it possible to overlap data loading, volume slicing and rendering for optimal volume roaming.
MULTI: a shared memory approach to cooperative molecular modeling.
Darden, T; Johnson, P; Smith, H
1991-03-01
A general purpose molecular modeling system, MULTI, based on the UNIX shared memory and semaphore facilities for interprocess communication is described. In addition to the normal querying or monitoring of geometric data, MULTI also provides processes for manipulating conformations, and for displaying peptide or nucleic acid ribbons, Connolly surfaces, close nonbonded contacts, crystal-symmetry related images, least-squares superpositions, and so forth. This paper outlines the basic techniques used in MULTI to ensure cooperation among these specialized processes, and then describes how they can work together to provide a flexible modeling environment.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jones, J.P.; Bangs, A.L.; Butler, P.L.
Hetero Helix is a programming environment which simulates shared memory on a heterogeneous network of distributed-memory computers. The machines in the network may vary with respect to their native operating systems and internal representation of numbers. Hetero Helix presents a simple programming model to developers, and also considers the needs of designers, system integrators, and maintainers. The key software technology underlying Hetero Helix is the use of a compiler'' which analyzes the data structures in shared memory and automatically generates code which translates data representations from the format native to each machine into a common format, and vice versa. Themore » design of Hetero Helix was motivated in particular by the requirements of robotics applications. Hetero Helix has been used successfully in an integration effort involving 27 CPUs in a heterogeneous network and a body of software totaling roughly 100,00 lines of code. 25 refs., 6 figs.« less
ACCESS: A Communicating and Cooperating Expert Systems System.
1988-01-31
therefore more quickly accepted by programmers. This is in part due to the already familiar concepts of multi-processing environments (e.g. semaphores ...Di68] and monitors [Br75]) which can be viewed as a special case of synchronized shared memory models [Di6S]. Heterogeneous systems however, are by...locality of nodes is not possible and frequent access of memory is required. Synchronization of processes also suffers from a loss of efficiency in
Performance and evaluation of real-time multicomputer control systems
NASA Technical Reports Server (NTRS)
Shin, K. G.
1985-01-01
Three experiments on fault tolerant multiprocessors (FTMP) were begun. They are: (1) measurement of fault latency in FTMP; (2) validation and analysis of FTMP synchronization protocols; and investigation of error propagation in FTMP.
Algorithms and architecture for multiprocessor based circuit simulation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Deutsch, J.T.
Accurate electrical simulation is critical to the design of high performance integrated circuits. Logic simulators can verify function and give first-order timing information. Switch level simulators are more effective at dealing with charge sharing than standard logic simulators, but cannot provide accurate timing information or discover DC problems. Delay estimation techniques and cell level simulation can be used in constrained design methods, but must be tuned for each application, and circuit simulation must still be used to generate the cell models. None of these methods has the guaranteed accuracy that many circuit designers desire, and none can provide detailed waveformmore » information. Detailed electrical-level simulation can predict circuit performance if devices and parasitics are modeled accurately. However, the computational requirements of conventional circuit simulators make it impractical to simulate current large circuits. In this dissertation, the implementation of Iterated Timing Analysis (ITA), a relaxation-based technique for accurate circuit simulation, on a special-purpose multiprocessor is presented. The ITA method is an SOR-Newton, relaxation-based method which uses event-driven analysis and selective trace to exploit the temporal sparsity of the electrical network. Because event-driven selective trace techniques are employed, this algorithm lends itself to implementation on a data-driven computer.« less
Cache coherency without line exclusivity in MP systems having store-in caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Pomerene, J.H.; Puzak, T.R.; Rechtschaffen, R.N.
1983-11-01
By modifying the function of the storage control unit, a multiprocessor (MP) system having store-in caches is enabled to operate with the same versatility as an MP system having store-through caches, thereby eliminating the requirement for line exclusivity and greatly reducing the occurrence of cross-interrogates.
A Comparison of Three Programming Models for Adaptive Applications
NASA Technical Reports Server (NTRS)
Shan, Hong-Zhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswa, Rupak; Kwak, Dochan (Technical Monitor)
2000-01-01
We study the performance and programming effort for two major classes of adaptive applications under three leading parallel programming models. We find that all three models can achieve scalable performance on the state-of-the-art multiprocessor machines. The basic parallel algorithms needed for different programming models to deliver their best performance are similar, but the implementations differ greatly, far beyond the fact of using explicit messages versus implicit loads/stores. Compared with MPI and SHMEM, CC-SAS (cache-coherent shared address space) provides substantial ease of programming at the conceptual and program orchestration level, which often leads to the performance gain. However it may also suffer from the poor spatial locality of physically distributed shared data on large number of processors. Our CC-SAS implementation of the PARMETIS partitioner itself runs faster than in the other two programming models, and generates more balanced result for our application.
Effects of motor congruence on visual working memory.
Quak, Michel; Pecher, Diane; Zeelenberg, Rene
2014-10-01
Grounded-cognition theories suggest that memory shares processing resources with perception and action. The motor system could be used to help memorize visual objects. In two experiments, we tested the hypothesis that people use motor affordances to maintain object representations in working memory. Participants performed a working memory task on photographs of manipulable and nonmanipulable objects. The manipulable objects were objects that required either a precision grip (i.e., small items) or a power grip (i.e., large items) to use. A concurrent motor task that could be congruent or incongruent with the manipulable objects caused no difference in working memory performance relative to nonmanipulable objects. Moreover, the precision- or power-grip motor task did not affect memory performance on small and large items differently. These findings suggest that the motor system plays no part in visual working memory.
Engineering study for the functional design of a multiprocessor system
NASA Technical Reports Server (NTRS)
Miller, J. S.; Vandever, W. H.; Stanten, S. F.; Avakian, A. E.; Kosmala, A. L.
1972-01-01
The results are presented of a study to generate a functional system design of a multiprocessing computer system capable of satisfying the computational requirements of a space station. These data management system requirements were specified to include: (1) real time control, (2) data processing and storage, (3) data retrieval, and (4) remote terminal servicing.
Process for predicting structural performance of mechanical systems
Gardner, David R.; Hendrickson, Bruce A.; Plimpton, Steven J.; Attaway, Stephen W.; Heinstein, Martin W.; Vaughan, Courtenay T.
1998-01-01
A process for predicting the structural performance of a mechanical system represents the mechanical system by a plurality of surface elements. The surface elements are grouped according to their location in the volume occupied by the mechanical system so that contacts between surface elements can be efficiently located. The process is well suited for efficient practice on multiprocessor computers.
Scalable Multiprocessor for High-Speed Computing in Space
NASA Technical Reports Server (NTRS)
Lux, James; Lang, Minh; Nishimoto, Kouji; Clark, Douglas; Stosic, Dorothy; Bachmann, Alex; Wilkinson, William; Steffke, Richard
2004-01-01
A report discusses the continuing development of a scalable multiprocessor computing system for hard real-time applications aboard a spacecraft. "Hard realtime applications" signifies applications, like real-time radar signal processing, in which the data to be processed are generated at "hundreds" of pulses per second, each pulse "requiring" millions of arithmetic operations. In these applications, the digital processors must be tightly integrated with analog instrumentation (e.g., radar equipment), and data input/output must be synchronized with analog instrumentation, controlled to within fractions of a microsecond. The scalable multiprocessor is a cluster of identical commercial-off-the-shelf generic DSP (digital-signal-processing) computers plus generic interface circuits, including analog-to-digital converters, all controlled by software. The processors are computers interconnected by high-speed serial links. Performance can be increased by adding hardware modules and correspondingly modifying the software. Work is distributed among the processors in a parallel or pipeline fashion by means of a flexible master/slave control and timing scheme. Each processor operates under its own local clock; synchronization is achieved by broadcasting master time signals to all the processors, which compute offsets between the master clock and their local clocks.
Merlin - Massively parallel heterogeneous computing
NASA Technical Reports Server (NTRS)
Wittie, Larry; Maples, Creve
1989-01-01
Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
The Family in Us: Family History, Family Identity and Self-Reproductive Adaptive Behavior.
Ferring, Dieter
2017-06-01
This contribution is an essay about the notion of family identity reflecting shared significant experiences within a family system originating a set of signs used in social communication within and between families. Significant experiences are considered as experiences of events that have an immediate impact on the adaptation of the family in a given socio-ecological and cultural context at a given historical time. It is assumed that family history is stored in a shared "family memory" holding both implicit and explicit knowledge and exerting an influence on the behavior of each family member. This is described as transgenerational family memory being constituted of a system of meaningful signs. The crucial dimension underlying the logic of this essay are the ideas of adaptation as well as self-reproduction of systems.
ARTS III/Parallel Processor Design Study
DOT National Transportation Integrated Search
1975-04-01
It was the purpose of this design study to investigate the feasibility, suitability, and cost-effectiveness of augmenting the ARTS III failsafe/failsoft multiprocessor system with a form of parallel processor to accomodate a large growth in air traff...
NASA Technical Reports Server (NTRS)
Byrne, F.
1981-01-01
Time-shared interface speeds data processing in distributed computer network. Two-level high-speed scanning approach routes information to buffer, portion of which is reserved for series of "first-in, first-out" memory stacks. Buffer address structure and memory are protected from noise or failed components by error correcting code. System is applicable to any computer or processing language.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoemmen, Mark
2010-11-01
Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
An enhanced Ada run-time system for real-time embedded processors
NASA Technical Reports Server (NTRS)
Sims, J. T.
1991-01-01
An enhanced Ada run-time system has been developed to support real-time embedded processor applications. The primary focus of this development effort has been on the tasking system and the memory management facilities of the run-time system. The tasking system has been extended to support efficient and precise periodic task execution as required for control applications. Event-driven task execution providing a means of task-asynchronous control and communication among Ada tasks is supported in this system. Inter-task control is even provided among tasks distributed on separate physical processors. The memory management system has been enhanced to provide object allocation and protected access support for memory shared between disjoint processors, each of which is executing a distinct Ada program.
Proceedings of the international conference on cybernetics and societ
DOE Office of Scientific and Technical Information (OSTI.GOV)
Not Available
1985-01-01
This book presents the papers given at a conference on artificial intelligence, expert systems and knowledge bases. Topics considered at the conference included automating expert system development, modeling expert systems, causal maps, data covariances, robot vision, image processing, multiprocessors, parallel processing, VLSI structures, man-machine systems, human factors engineering, cognitive decision analysis, natural language, computerized control systems, and cybernetics.
Techniques for video compression
NASA Technical Reports Server (NTRS)
Wu, Chwan-Hwa
1995-01-01
In this report, we present our study on multiprocessor implementation of a MPEG2 encoding algorithm. First, we compare two approaches to implementing video standards, VLSI technology and multiprocessor processing, in terms of design complexity, applications, and cost. Then we evaluate the functional modules of MPEG2 encoding process in terms of their computation time. Two crucial modules are identified based on this evaluation. Then we present our experimental study on the multiprocessor implementation of the two crucial modules. Data partitioning is used for job assignment. Experimental results show that high speedup ratio and good scalability can be achieved by using this kind of job assignment strategy.
Multitasking runtime systems for the Cedar Multiprocessor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Guzzi, M.D.
1986-07-01
The programming of a MIMD machine is more complex than for SISD and SIMD machines. The multiple computational resources of the machine must be made available to the programming language compiler and to the programmer so that multitasking programs may be written. This thesis will explore the additional complexity of programming a MIMD machine, the Cedar Multiprocessor specifically, and the multitasking runtime system necessary to provide multitasking resources to the user. First, the problem will be well defined: the Cedar machine, its operating system, the programming language, and multitasking concepts will be described. Second, a solution to the problem, calledmore » macrotasking, will be proposed. This solution provides multitasking facilities to the programmer at a very coarse level with many visible machine dependencies. Third, an alternate solution, called microtasking, will be proposed. This solution provides multitasking facilities of a much finer grain. This solution does not depend so rigidly on the specific architecture of the machine. Finally, the two solutions will be compared for effectiveness. 12 refs., 16 figs.« less
Mnemonic convergence in social networks: The emergent properties of cognition at a collective level
Coman, Alin; Momennejad, Ida; Drach, Rae D.; Geana, Andra
2016-01-01
The development of shared memories, beliefs, and norms is a fundamental characteristic of human communities. These emergent outcomes are thought to occur owing to a dynamic system of information sharing and memory updating, which fundamentally depends on communication. Here we report results on the formation of collective memories in laboratory-created communities. We manipulated conversational network structure in a series of real-time, computer-mediated interactions in fourteen 10-member communities. The results show that mnemonic convergence, measured as the degree of overlap among community members’ memories, is influenced by both individual-level information-processing phenomena and by the conversational social network structure created during conversational recall. By studying laboratory-created social networks, we show how large-scale social phenomena (i.e., collective memory) can emerge out of microlevel local dynamics (i.e., mnemonic reinforcement and suppression effects). The social-interactionist approach proposed herein points to optimal strategies for spreading information in social networks and provides a framework for measuring and forging collective memories in communities of individuals. PMID:27357678
CaLRS: A Critical-Aware Shared LLC Request Scheduling Algorithm on GPGPU
Ma, Jianliang; Meng, Jinglei; Chen, Tianzhou; Wu, Minghui
2015-01-01
Ultra high thread-level parallelism in modern GPUs usually introduces numerous memory requests simultaneously. So there are always plenty of memory requests waiting at each bank of the shared LLC (L2 in this paper) and global memory. For global memory, various schedulers have already been developed to adjust the request sequence. But we find few work has ever focused on the service sequence on the shared LLC. We measured that a big number of GPU applications always queue at LLC bank for services, which provide opportunity to optimize the service order on LLC. Through adjusting the GPU memory request service order, we can improve the schedulability of SM. So we proposed a critical-aware shared LLC request scheduling algorithm (CaLRS) in this paper. The priority representative of memory request is critical for CaLRS. We use the number of memory requests that originate from the same warp but have not been serviced when they arrive at the shared LLC bank to represent the criticality of each warp. Experiments show that the proposed scheme can boost the SM schedulability effectively by promoting the scheduling priority of the memory requests with high criticality and improves the performance of GPU indirectly. PMID:25729772
NASA Technical Reports Server (NTRS)
Schilling, D. L.; Oh, S. J.; Thau, F.
1975-01-01
Developments in communications systems, computer systems, and power distribution systems for the space shuttle are described. The use of high speed delta modulation for bit rate compression in the transmission of television signals is discussed. Simultaneous Multiprocessor Organization, an approach to computer organization, is presented. Methods of computer simulation and automatic malfunction detection for the shuttle power distribution system are also described.
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
DOE Office of Scientific and Technical Information (OSTI.GOV)
Azad, Ariful; Buluc, Aydn; Pothen, Alex
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Computing Maximum Cardinality Matchings in Parallel on Bipartite Graphs via Tree-Grafting
Azad, Ariful; Buluc, Aydn; Pothen, Alex
2016-03-24
It is difficult to obtain high performance when computing matchings on parallel processors because matching algorithms explicitly or implicitly search for paths in the graph, and when these paths become long, there is little concurrency. In spite of this limitation, we present a new algorithm and its shared-memory parallelization that achieves good performance and scalability in computing maximum cardinality matchings in bipartite graphs. This algorithm searches for augmenting paths via specialized breadth-first searches (BFS) from multiple source vertices, hence creating more parallelism than single source algorithms. Algorithms that employ multiple-source searches cannot discard a search tree once no augmenting pathmore » is discovered from the tree, unlike algorithms that rely on single-source searches. We describe a novel tree-grafting method that eliminates most of the redundant edge traversals resulting from this property of multiple-source searches. We also employ the recent direction-optimizing BFS algorithm as a subroutine to discover augmenting paths faster. Our algorithm compares favorably with the current best algorithms in terms of the number of edges traversed, the average augmenting path length, and the number of iterations. Here, we provide a proof of correctness for our algorithm. Our NUMA-aware implementation is scalable to 80 threads of an Intel multiprocessor and to 240 threads on an Intel Knights Corner coprocessor. On average, our parallel algorithm runs an order of magnitude faster than the fastest algorithms available. The performance improvement is more significant on graphs with small matching number.« less
Scheduling for energy and reliability management on multiprocessor real-time systems
NASA Astrophysics Data System (ADS)
Qi, Xuan
Scheduling algorithms for multiprocessor real-time systems have been studied for years with many well-recognized algorithms proposed. However, it is still an evolving research area and many problems remain open due to their intrinsic complexities. With the emergence of multicore processors, it is necessary to re-investigate the scheduling problems and design/develop efficient algorithms for better system utilization, low scheduling overhead, high energy efficiency, and better system reliability. Focusing cluster schedulings with optimal global schedulers, we study the utilization bound and scheduling overhead for a class of cluster-optimal schedulers. Then, taking energy/power consumption into consideration, we developed energy-efficient scheduling algorithms for real-time systems, especially for the proliferating embedded systems with limited energy budget. As the commonly deployed energy-saving technique (e.g. dynamic voltage frequency scaling (DVFS)) will significantly affect system reliability, we study schedulers that have intelligent mechanisms to recuperate system reliability to satisfy the quality assurance requirements. Extensive simulation is conducted to evaluate the performance of the proposed algorithms on reduction of scheduling overhead, energy saving, and reliability improvement. The simulation results show that the proposed reliability-aware power management schemes could preserve the system reliability while still achieving substantial energy saving.
A Performance Prediction Model for a Fault-Tolerant Computer During Recovery and Restoration
NASA Technical Reports Server (NTRS)
Obando, Rodrigo A.; Stoughton, John W.
1995-01-01
The modeling and design of a fault-tolerant multiprocessor system is addressed. Of interest is the behavior of the system during recovery and restoration after a fault has occurred. The multiprocessor systems are based on the Algorithm to Architecture Mapping Model (ATAMM) and the fault considered is the death of a processor. The developed model is useful in the determination of performance bounds of the system during recovery and restoration. The performance bounds include time to recover from the fault, time to restore the system, and determination of any permanent delay in the input to output latency after the system has regained steady state. Implementation of an ATAMM based computer was developed for a four-processor generic VHSIC spaceborne computer (GVSC) as the target system. A simulation of the GVSC was also written on the code used in the ATAMM Multicomputer Operating System (AMOS). The simulation is used to verify the new model for tracking the propagation of the delay through the system and predicting the behavior of the transient state of recovery and restoration. The model is shown to accurately predict the transient behavior of an ATAMM based multicomputer during recovery and restoration.
Time Constraints and Resource Sharing in Adults' Working Memory Spans
ERIC Educational Resources Information Center
Barrouillet, Pierre; Bernardin, Sophie; Camos, Valerie
2004-01-01
This article presents a new model that accounts for working memory spans in adults, the time-based resource-sharing model. The model assumes that both components (i.e., processing and maintenance) of the main working memory tasks require attention and that memory traces decay as soon as attention is switched away. Because memory retrievals are…
Multilevel Summation of Electrostatic Potentials Using Graphics Processing Units*
Hardy, David J.; Stone, John E.; Schulten, Klaus
2009-01-01
Physical and engineering practicalities involved in microprocessor design have resulted in flat performance growth for traditional single-core microprocessors. The urgent need for continuing increases in the performance of scientific applications requires the use of many-core processors and accelerators such as graphics processing units (GPUs). This paper discusses GPU acceleration of the multilevel summation method for computing electrostatic potentials and forces for a system of charged atoms, which is a problem of paramount importance in biomolecular modeling applications. We present and test a new GPU algorithm for the long-range part of the potentials that computes a cutoff pair potential between lattice points, essentially convolving a fixed 3-D lattice of “weights” over all sub-cubes of a much larger lattice. The implementation exploits the different memory subsystems provided on the GPU to stream optimally sized data sets through the multiprocessors. We demonstrate for the full multilevel summation calculation speedups of up to 26 using a single GPU and 46 using multiple GPUs, enabling the computation of a high-resolution map of the electrostatic potential for a system of 1.5 million atoms in under 12 seconds. PMID:20161132
Vera, Javier
2018-01-01
What is the influence of short-term memory enhancement on the emergence of grammatical agreement systems in multi-agent language games? Agreement systems suppose that at least two words share some features with each other, such as gender, number, or case. Previous work, within the multi-agent language-game framework, has recently proposed models stressing the hypothesis that the emergence of a grammatical agreement system arises from the minimization of semantic ambiguity. On the other hand, neurobiological evidence argues for the hypothesis that language evolution has mainly related to an increasing of short-term memory capacity, which has allowed the online manipulation of words and meanings participating particularly in grammatical agreement systems. Here, the main aim is to propose a multi-agent language game for the emergence of a grammatical agreement system, under measurable long-range relations depending on the short-term memory capacity. Computer simulations, based on a parameter that measures the amount of short-term memory capacity, suggest that agreement marker systems arise in a population of agents equipped at least with a critical short-term memory capacity.
NASA Technical Reports Server (NTRS)
1972-01-01
The design is reported of an advanced modular computer system designated the Automatically Reconfigurable Modular Multiprocessor System, which anticipates requirements for higher computing capacity and reliability for future spaceborne computers. Subjects discussed include: an overview of the architecture, mission analysis, synchronous and nonsynchronous scheduling control, reliability, and data transmission.
Wang, Qi; Lee, Dasom; Hou, Yubo
2017-07-01
Internet technology provides a new means of recalling and sharing personal memories in the digital age. What is the mnemonic consequence of posting personal memories online? Theories of transactive memory and autobiographical memory would make contrasting predictions. In the present study, college students completed a daily diary for a week, listing at the end of each day all the events that happened to them on that day. They also reported whether they posted any of the events online. Participants received a surprise memory test after the completion of the diary recording and then another test a week later. At both tests, events posted online were significantly more likely than those not posted online to be recalled. It appears that sharing memories online may provide unique opportunities for rehearsal and meaning-making that facilitate memory retention.
Process for predicting structural performance of mechanical systems
Gardner, D.R.; Hendrickson, B.A.; Plimpton, S.J.; Attaway, S.W.; Heinstein, M.W.; Vaughan, C.T.
1998-05-19
A process for predicting the structural performance of a mechanical system represents the mechanical system by a plurality of surface elements. The surface elements are grouped according to their location in the volume occupied by the mechanical system so that contacts between surface elements can be efficiently located. The process is well suited for efficient practice on multiprocessor computers. 12 figs.
Support for Debugging Automatically Parallelized Programs
NASA Technical Reports Server (NTRS)
Hood, Robert; Jost, Gabriele; Biegel, Bryan (Technical Monitor)
2001-01-01
This viewgraph presentation provides information on the technical aspects of debugging computer code that has been automatically converted for use in a parallel computing system. Shared memory parallelization and distributed memory parallelization entail separate and distinct challenges for a debugging program. A prototype system has been developed which integrates various tools for the debugging of automatically parallelized programs including the CAPTools Database which provides variable definition information across subroutines as well as array distribution information.
Klein, Stanley B
2013-01-01
Episodic memory often is conceptualized as a uniquely human system of long-term memory that makes available knowledge accompanied by the temporal and spatial context in which that knowledge was acquired. Retrieval from episodic memory entails a form of first-person subjectivity called autonoetic consciousness that provides a sense that a recollection was something that took place in the experiencer's personal past. In this paper I expand on this definition of episodic memory. Specifically, I suggest that (1) the core features assumed unique to episodic memory are shared by semantic memory, (2) episodic memory cannot be fully understood unless one appreciates that episodic recollection requires the coordinated function of a number of distinct, yet interacting, "enabling" systems. Although these systems-ownership, self, subjective temporality, and agency-are not traditionally viewed as memorial in nature, each is necessary for episodic recollection and jointly they may be sufficient, and (3) the type of subjective awareness provided by episodic recollection (autonoetic) is relational rather than intrinsic-i.e., it can be lost in certain patient populations, thus rendering episodic memory content indistinguishable from the content of semantic long-term memory.
The Impact of Storage on Processing: How Is Information Maintained in Working Memory?
ERIC Educational Resources Information Center
Vergauwe, Evie; Camos, Valérie; Barrouillet, Pierre
2014-01-01
Working memory is typically defined as a system devoted to the simultaneous maintenance and processing of information. However, the interplay between these 2 functions is still a matter of debate in the literature, with views ranging from complete independence to complete dependence. The time-based resource-sharing model assumes that a central…
Android Protection Mechanism: A Signed Code Security Mechanism for Smartphone Applications
2011-03-01
status registers, exceptions, endian support, unaligned access support, synchronization primitives , the Jazelle Extension, and saturated integer...supports comprehensive non-blocking shared-memory synchronization primitives that scale for multiple-processor system designs. This is an improvement... synchronization . Memory semaphores can be loaded and altered without interruption because the load and store operations are atomic. Processor
Expert Systems on Multiprocessor Architectures. Phase 1
1988-08-01
great rate) as early experience indicates what alternative aspect of system operation should have been monitored in any given completed run. The... system operation should have been monitored in any given completed run. The design goals that emerged then were (1) that the simulation system should...ORGANIZATION 6b OFFICE SYMBOL 7a. NAME OF MONITORING ORGANIZATION Stanford University (If applicable) Knowledge Systems Laboratory Rome Air Development
Nearest Neighbor Searching in Binary Search Trees: Simulation of a Multiprocessor System.
ERIC Educational Resources Information Center
Stewart, Mark; Willett, Peter
1987-01-01
Describes the simulation of a nearest neighbor searching algorithm for document retrieval using a pool of microprocessors. Three techniques are described which allow parallel searching of a binary search tree as well as a PASCAL-based system, PASSIM, which can simulate these techniques. Fifty-six references are provided. (Author/LRW)
Data Traffic Reduction Schemes for Cholesky Factorization on Asynchronous Multiprocessor Systems
1989-06-01
Engineering NASA Langley Research Center Hampton, Virginia 23665-5225 Operated by the Universities Space Research Association DTIC ELECTE NASA jAUG 23...Hampton, VA 23665. ti- 1. Introduction Consider the problem of solving a system of linear equations Ax=b where .4 is an n x n symmetric, positive
Partitioning in parallel processing of production systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oflazer, K.
1987-01-01
This thesis presents research on certain issues related to parallel processing of production systems. It first presents a parallel production system interpreter that has been implemented on a four-processor multiprocessor. This parallel interpreter is based on Forgy's OPS5 interpreter and exploits production-level parallelism in production systems. Runs on the multiprocessor system indicate that it is possible to obtain speed-up of around 1.7 in the match computation for certain production systems when productions are split into three sets that are processed in parallel. The next issue addressed is that of partitioning a set of rules to processors in a parallel interpretermore » with production-level parallelism, and the extent of additional improvement in performance. The partitioning problem is formulated and an algorithm for approximate solutions is presented. The thesis next presents a parallel processing scheme for OPS5 production systems that allows some redundancy in the match computation. This redundancy enables the processing of a production to be divided into units of medium granularity each of which can be processed in parallel. Subsequently, a parallel processor architecture for implementing the parallel processing algorithm is presented.« less
A Distributed Operating System for BMD Applications.
1982-01-01
Defense) applications executing on distributed hardware with local and shared memories. The objective was to develop real - time operating system functions...make the Basic Real - Time Operating System , and the set of new EPL language primitives that provide BMD application processes with efficient mechanisms
A revised limbic system model for memory, emotion and behaviour.
Catani, Marco; Dell'acqua, Flavio; Thiebaut de Schotten, Michel
2013-09-01
Emotion, memories and behaviour emerge from the coordinated activities of regions connected by the limbic system. Here, we propose an update of the limbic model based on the seminal work of Papez, Yakovlev and MacLean. In the revised model we identify three distinct but partially overlapping networks: (i) the Hippocampal-diencephalic and parahippocampal-retrosplenial network dedicated to memory and spatial orientation; (ii) The temporo-amygdala-orbitofrontal network for the integration of visceral sensation and emotion with semantic memory and behaviour; (iii) the default-mode network involved in autobiographical memories and introspective self-directed thinking. The three networks share cortical nodes that are emerging as principal hubs in connectomic analysis. This revised network model of the limbic system reconciles recent functional imaging findings with anatomical accounts of clinical disorders commonly associated with limbic pathology. Copyright © 2013 Elsevier Ltd. All rights reserved.
NMF-mGPU: non-negative matrix factorization on multi-GPU systems.
Mejía-Roa, Edgardo; Tabas-Madrid, Daniel; Setoain, Javier; García, Carlos; Tirado, Francisco; Pascual-Montano, Alberto
2015-02-13
In the last few years, the Non-negative Matrix Factorization ( NMF ) technique has gained a great interest among the Bioinformatics community, since it is able to extract interpretable parts from high-dimensional datasets. However, the computing time required to process large data matrices may become impractical, even for a parallel application running on a multiprocessors cluster. In this paper, we present NMF-mGPU, an efficient and easy-to-use implementation of the NMF algorithm that takes advantage of the high computing performance delivered by Graphics-Processing Units ( GPUs ). Driven by the ever-growing demands from the video-games industry, graphics cards usually provided in PCs and laptops have evolved from simple graphics-drawing platforms into high-performance programmable systems that can be used as coprocessors for linear-algebra operations. However, these devices may have a limited amount of on-board memory, which is not considered by other NMF implementations on GPU. NMF-mGPU is based on CUDA ( Compute Unified Device Architecture ), the NVIDIA's framework for GPU computing. On devices with low memory available, large input matrices are blockwise transferred from the system's main memory to the GPU's memory, and processed accordingly. In addition, NMF-mGPU has been explicitly optimized for the different CUDA architectures. Finally, platforms with multiple GPUs can be synchronized through MPI ( Message Passing Interface ). In a four-GPU system, this implementation is about 120 times faster than a single conventional processor, and more than four times faster than a single GPU device (i.e., a super-linear speedup). Applications of GPUs in Bioinformatics are getting more and more attention due to their outstanding performance when compared to traditional processors. In addition, their relatively low price represents a highly cost-effective alternative to conventional clusters. In life sciences, this results in an excellent opportunity to facilitate the daily work of bioinformaticians that are trying to extract biological meaning out of hundreds of gigabytes of experimental information. NMF-mGPU can be used "out of the box" by researchers with little or no expertise in GPU programming in a variety of platforms, such as PCs, laptops, or high-end GPU clusters. NMF-mGPU is freely available at https://github.com/bioinfo-cnb/bionmf-gpu .
Initial Performance Results on IBM POWER6
NASA Technical Reports Server (NTRS)
Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh
2008-01-01
The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.
Polymorphous Computing Architectures
2007-12-12
provide a multiprocessor implementation. In this work, we introduce the Atomos transactional programming language, which is the first to include...implicit transactions, strong atomicity, and a scalable multiprocessor implementation [47]. Atomos is derived from Java, but replaces its synchronization...and conditional waiting constructs with transactional alternatives. The Atomos conditional waiting proposal is tailored to allow efficient
Specification and Verification of Secure Concurrent and Distributed Software Systems
1992-02-01
primitive search strategies work for operating systems that contain relatively few operations . As the number of operations increases, so does the the...others have granted him access to, etc . The burden of security falls on the operating system , although appropriate hardware support can minimize the...Guttag, J. Horning, and R. Levin. Synchronization primitives for a multiprocessor: a formal specification. Symposium on Operating System Principles
NASA Technical Reports Server (NTRS)
Burleigh, Scott C.
2011-01-01
Sptrace is a general-purpose space utilization tracing system that is conceptually similar to the commercial Purify product used to detect leaks and other memory usage errors. It is designed to monitor space utilization in any sort of heap, i.e., a region of data storage on some device (nominally memory; possibly shared and possibly persistent) with a flat address space. This software can trace usage of shared and/or non-volatile storage in addition to private RAM (random access memory). Sptrace is implemented as a set of C function calls that are invoked from within the software that is being examined. The function calls fall into two broad classes: (1) functions that are embedded within the heap management software [e.g., JPL's SDR (Simple Data Recorder) and PSM (Personal Space Management) systems] to enable heap usage analysis by populating a virtual time-sequenced log of usage activity, and (2) reporting functions that are embedded within the application program whose behavior is suspect. For ease of use, these functions may be wrapped privately inside public functions offered by the heap management software. Sptrace can be used for VxWorks or RTEMS realtime systems as easily as for Linux or OS/X systems.
Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Jin, Hao-Qiang; anMey, Dieter; Hatay, Ferhat F.
2003-01-01
Clusters of SMP (Symmetric Multi-Processors) nodes provide support for a wide range of parallel programming paradigms. The shared address space within each node is suitable for OpenMP parallelization. Message passing can be employed within and across the nodes of a cluster. Multiple levels of parallelism can be achieved by combining message passing and OpenMP parallelization. Which programming paradigm is the best will depend on the nature of the given problem, the hardware components of the cluster, the network, and the available software. In this study we compare the performance of different implementations of the same CFD benchmark application, using the same numerical algorithm but employing different programming paradigms.
Ensuring correct rollback recovery in distributed shared memory systems
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. Kent
1995-01-01
Distributed shared memory (DSM) implemented on a cluster of workstations is an increasingly attractive platform for executing parallel scientific applications. Checkpointing and rollback techniques can be used in such a system to allow the computation to progress in spite of the temporary failure of one or more processing nodes. This paper presents the design of an independent checkpointing method for DSM that takes advantage of DSM's specific properties to reduce error-free and rollback overhead. The scheme reduces the dependencies that need to be considered for correct rollback to those resulting from transfers of pages. Furthermore, in-transit messages can be recovered without the use of logging. We extend the scheme to a DSM implementation using lazy release consistency, where the frequency of dependencies is further reduced.
ERIC Educational Resources Information Center
Vergauwe, Evie; Barrouillet, Pierre; Camos, Valerie
2009-01-01
Examinations of interference between visual and spatial materials in working memory have suggested domain- and process-based fractionations of visuo-spatial working memory. The present study examined the role of central time-based resource sharing in visuo-spatial working memory and assessed its role in obtained interference patterns. Visual and…
Personal semantics: at the crossroads of semantic and episodic memory.
Renoult, Louis; Davidson, Patrick S R; Palombo, Daniela J; Moscovitch, Morris; Levine, Brian
2012-11-01
Declarative memory is usually described as consisting of two systems: semantic and episodic memory. Between these two poles, however, may lie a third entity: personal semantics (PS). PS concerns knowledge of one's past. Although typically assumed to be an aspect of semantic memory, it is essentially absent from existing models of knowledge. Furthermore, like episodic memory (EM), PS is idiosyncratically personal (i.e., not culturally-shared). We show that, depending on how it is operationalized, the neural correlates of PS can look more similar to semantic memory, more similar to EM, or dissimilar to both. We consider three different perspectives to better integrate PS into existing models of declarative memory and suggest experimental strategies for disentangling PS from semantic and episodic memory. Copyright © 2012 Elsevier Ltd. All rights reserved.
Camos, Valérie; Barrouillet, Pierre
2014-01-01
Working memory is the structure devoted to the maintenance of information at short term during concurrent processing activities. In this respect, the question regarding the nature of the mechanisms and systems fulfilling this maintenance function is of particular importance and has received various responses in the recent past. In the time-based resource-sharing (TBRS) model, we suggest that only two systems sustain the maintenance of information at the short term, counteracting the deleterious effect of temporal decay and interference. A non-attentional mechanism of verbal rehearsal, similar to the one described by Baddeley in the phonological loop model, uses language processes to reactivate phonological memory traces. Besides this domain-specific mechanism, an executive loop allows the reconstruction of memory traces through an attention-based mechanism of refreshing. The present paper reviews evidence of the involvement of these two independent systems in the maintenance of verbal memory items. PMID:25426049
Early Experiences Writing Performance Portable OpenMP 4 Codes
DOE Office of Scientific and Technical Information (OSTI.GOV)
Joubert, Wayne; Hernandez, Oscar R
In this paper, we evaluate the recently available directives in OpenMP 4 to parallelize a computational kernel using both the traditional shared memory approach and the newer accelerator targeting capabilities. In addition, we explore various transformations that attempt to increase application performance portability, and examine the expressiveness and performance implications of using these approaches. For example, we want to understand if the target map directives in OpenMP 4 improve data locality when mapped to a shared memory system, as opposed to the traditional first touch policy approach in traditional OpenMP. To that end, we use recent Cray and Intel compilersmore » to measure the performance variations of a simple application kernel when executed on the OLCF s Titan supercomputer with NVIDIA GPUs and the Beacon system with Intel Xeon Phi accelerators attached. To better understand these trade-offs, we compare our results from traditional OpenMP shared memory implementations to the newer accelerator programming model when it is used to target both the CPU and an attached heterogeneous device. We believe the results and lessons learned as presented in this paper will be useful to the larger user community by providing guidelines that can assist programmers in the development of performance portable code.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
McGoldrick, P.R.
1980-12-11
The Interprocess Communications System (IPCS) was written to provide a virtual machine upon which the Supervisory Control and Diagnostic System (SCDS) for the Mirror Fusion Test Facility (MFTF) could be built. The hardware upon which the IPCS runs consists of nine minicomputers sharing some common memory.
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Butler, Bryan P.
1990-01-01
The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Wang, Qi
2006-01-01
The relations of maternal reminiscing style and child self-concept to children's shared and independent autobiographical memories were examined in a sample of 189 three-year-olds and their mothers from Chinese families in China, first-generation Chinese immigrant families in the United States, and European American families. Mothers shared memories with their children and completed questionnaires; children recounted autobiographical events and described themselves with a researcher. Independent of culture, gender, child age, and language skills, maternal elaborations and evaluations were associated with children's shared memory reports, and maternal evaluations and child agentic self-focus were associated with children's independent memory reports. Maternal style and child self-concept further mediated cultural influences on children's memory. The findings provide insight into the social-cultural construction of autobiographical memory.
2016-03-01
in-vitro decision to incubate a startup, Lexumo [7], which is developing a commercial Software as a Service ( SaaS ) vulnerability assessment...LTS Label Transition System MUSE Mining and Understanding Software Enclaves RTEMS Real-Time Executive for Multi-processor Systems SaaS Software ...as a Service SSA Static Single Assignment SWE Software Epistemology UD/DU Def-Use/Use-Def Chains (Dataflow Graph)
Parallel processing and expert systems
NASA Technical Reports Server (NTRS)
Lau, Sonie; Yan, Jerry C.
1991-01-01
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited.
Parallel Computation of the Regional Ocean Modeling System (ROMS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, P; Song, Y T; Chao, Y
2005-04-05
The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less
NASA Astrophysics Data System (ADS)
Devaraj, Rajesh; Sarkar, Arnab; Biswas, Santosh
2015-11-01
In the article 'Supervisory control for fault-tolerant scheduling of real-time multiprocessor systems with aperiodic tasks', Park and Cho presented a systematic way of computing a largest fault-tolerant and schedulable language that provides information on whether the scheduler (i.e., supervisor) should accept or reject a newly arrived aperiodic task. The computation of such a language is mainly dependent on the task execution model presented in their paper. However, the task execution model is unable to capture the situation when the fault of a processor occurs even before the task has arrived. Consequently, a task execution model that does not capture this fact may possibly be assigned for execution on a faulty processor. This problem has been illustrated with an appropriate example. Then, the task execution model of Park and Cho has been modified to strengthen the requirement that none of the tasks are assigned for execution on a faulty processor.
Interference from mere thinking: mental rehearsal temporarily disrupts recall of motor memory.
Yin, Cong; Wei, Kunlin
2014-08-01
Interference between successively learned tasks is widely investigated to study motor memory. However, how simultaneously learned motor memories interact with each other has been rarely studied despite its prevalence in daily life. Assuming that motor memory shares common neural mechanisms with declarative memory system, we made unintuitive predictions that mental rehearsal, as opposed to further practice, of one motor memory will temporarily impair the recall of another simultaneously learned memory. Subjects simultaneously learned two sensorimotor tasks, i.e., visuomotor rotation and gain. They retrieved one memory by either practice or mental rehearsal and then had their memory evaluated. We found that mental rehearsal, instead of execution, impaired the recall of unretrieved memory. This impairment was content-independent, i.e., retrieving either gain or rotation impaired the other memory. Hence, conscious recollection of one motor memory interferes with the recall of another memory. This is analogous to retrieval-induced forgetting in declarative memory, suggesting a common neural process across memory systems. Our findings indicate that motor imagery is sufficient to induce interference between motor memories. Mental rehearsal, currently widely regarded as beneficial for motor performance, negatively affects memory recall when it is exercised for a subset of memorized items. Copyright © 2014 the American Physiological Society.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
Design and control of a macro-micro robot for precise force applications
NASA Technical Reports Server (NTRS)
Wang, Yulun; Mangaser, Amante; Laby, Keith; Jordan, Steve; Wilson, Jeff
1993-01-01
Creating a robot which can delicately interact with its environment has been the goal of much research. Primarily two difficulties have made this goal hard to attain. The execution of control strategies which enable precise force manipulations are difficult to implement in real time because such algorithms have been too computationally complex for available controllers. Also, a robot mechanism which can quickly and precisely execute a force command is difficult to design. Actuation joints must be sufficiently stiff, frictionless, and lightweight so that desired torques can be accurately applied. This paper describes a robotic system which is capable of delicate manipulations. A modular high-performance multiprocessor control system was designed to provide sufficient compute power for executing advanced control methods. An 8 degree of freedom macro-micro mechanism was constructed to enable accurate tip forces. Control algorithms based on the impedance control method were derived, coded, and load balanced for maximum execution speed on the multiprocessor system. Delicate force tasks such as polishing, finishing, cleaning, and deburring, are the target applications of the robot.
Performance Evaluation of Remote Memory Access (RMA) Programming on Shared Memory Parallel Computers
NASA Technical Reports Server (NTRS)
Jin, Hao-Qiang; Jost, Gabriele; Biegel, Bryan A. (Technical Monitor)
2002-01-01
The purpose of this study is to evaluate the feasibility of remote memory access (RMA) programming on shared memory parallel computers. We discuss different RMA based implementations of selected CFD application benchmark kernels and compare them to corresponding message passing based codes. For the message-passing implementation we use MPI point-to-point and global communication routines. For the RMA based approach we consider two different libraries supporting this programming model. One is a shared memory parallelization library (SMPlib) developed at NASA Ames, the other is the MPI-2 extensions to the MPI Standard. We give timing comparisons for the different implementation strategies and discuss the performance.
NASA Technical Reports Server (NTRS)
Mejzak, R. S.
1980-01-01
The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Buntinas, D.; Mercier, G.; Gropp, W.
2007-09-01
This paper presents the implementation of MPICH2 over the Nemesis communication subsystem and the evaluation of its shared-memory performance. We describe design issues as well as some of the optimization techniques we employed. We conducted a performance evaluation over shared memory using microbenchmarks. The evaluation shows that MPICH2 Nemesis has very low communication overhead, making it suitable for smaller-grained applications.
Parallel reduced-instruction-set-computer architecture for real-time symbolic pattern matching
NASA Astrophysics Data System (ADS)
Parson, Dale E.
1991-03-01
This report discusses ongoing work on a parallel reduced-instruction- set-computer (RISC) architecture for automatic production matching. The PRIOPS compiler takes advantage of the memoryless character of automatic processing by translating a program's collection of automatic production tests into an equivalent combinational circuit-a digital circuit without memory, whose outputs are immediate functions of its inputs. The circuit provides a highly parallel, fine-grain model of automatic matching. The compiler then maps the combinational circuit onto RISC hardware. The heart of the processor is an array of comparators capable of testing production conditions in parallel, Each comparator attaches to private memory that contains virtual circuit nodes-records of the current state of nodes and busses in the combinational circuit. All comparator memories hold identical information, allowing simultaneous update for a single changing circuit node and simultaneous retrieval of different circuit nodes by different comparators. Along with the comparator-based logic unit is a sequencer that determines the current combination of production-derived comparisons to try, based on the combined success and failure of previous combinations of comparisons. The memoryless nature of automatic matching allows the compiler to designate invariant memory addresses for virtual circuit nodes, and to generate the most effective sequences of comparison test combinations. The result is maximal utilization of parallel hardware, indicating speed increases and scalability beyond that found for course-grain, multiprocessor approaches to concurrent Rete matching. Future work will consider application of this RISC architecture to the standard (controlled) Rete algorithm, where search through memory dominates portions of matching.
Laboratory for Computer Science Progress Report 19, 1 July 1981-30 June 1982.
1984-05-01
Multiprocessor Architectures 202 4. TRIX Operating System 209 5. VLSI Tools 212 ’SYSTEMATIC PROGRAM DEVELOPMENT, 221 1. Introduction 222 2. Specification...exploring distributed operating systems and the architecture of single-user powerful computers that are interconnected by communication networks. The...to now. In particular, we expect to experiment with languages, operating systems , and applications that establish the feasibility of distributed
Vergauwe, Evie; Hartstra, Egbert; Barrouillet, Pierre; Brass, Marcel
2015-07-15
Working memory is often defined in cognitive psychology as a system devoted to the simultaneous processing and maintenance of information. In line with the time-based resource-sharing model of working memory (TBRS; Barrouillet and Camos, 2015; Barrouillet et al., 2004), there is accumulating evidence that, when memory items have to be maintained while performing a concurrent activity, memory performance depends on the cognitive load of this activity, independently of the domain involved. The present study used fMRI to identify regions in the brain that are sensitive to variations in cognitive load in a domain-general way. More precisely, we aimed at identifying brain areas that activate during maintenance of memory items as a direct function of the cognitive load induced by both verbal and spatial concurrent tasks. Results show that the right IFJ and bilateral SPL/IPS are the only areas showing an increased involvement as cognitive load increases and do so in a domain general manner. When correlating the fMRI signal with the approximated cognitive load as defined by the TBRS model, it was shown that the main focus of the cognitive load-related activation is located in the right IFJ. The present findings indicate that the IFJ makes domain-general contributions to time-based resource-sharing in working memory and allowed us to generate the novel hypothesis by which the IFJ might be the neural basis for the process of rapid switching. We argue that the IFJ might be a crucial part of a central attentional bottleneck in the brain because of its inability to upload more than one task rule at once. Copyright © 2015 Elsevier Inc. All rights reserved.
Scaling Irregular Applications through Data Aggregation and Software Multithreading
DOE Office of Scientific and Technical Information (OSTI.GOV)
Morari, Alessandro; Tumeo, Antonino; Chavarría-Miranda, Daniel
Bioinformatics, data analytics, semantic databases, knowledge discovery are emerging high performance application areas that exploit dynamic, linked data structures such as graphs, unbalanced trees or unstructured grids. These data structures usually are very large, requiring significantly more memory than available on single shared memory systems. Additionally, these data structures are difficult to partition on distributed memory systems. They also present poor spatial and temporal locality, thus generating unpredictable memory and network accesses. The Partitioned Global Address Space (PGAS) programming model seems suitable for these applications, because it allows using a shared memory abstraction across distributed-memory clusters. However, current PGAS languagesmore » and libraries are built to target regular remote data accesses and block transfers. Furthermore, they usually rely on the Single Program Multiple Data (SPMD) parallel control model, which is not well suited to the fine grained, dynamic and unbalanced parallelism of irregular applications. In this paper we present {\\bf GMT} (Global Memory and Threading library), a custom runtime library that enables efficient execution of irregular applications on commodity clusters. GMT integrates a PGAS data substrate with simple fork/join parallelism and provides automatic load balancing on a per node basis. It implements multi-level aggregation and lightweight multithreading to maximize memory and network bandwidth with fine-grained data accesses and tolerate long data access latencies. A key innovation in the GMT runtime is its thread specialization (workers, helpers and communication threads) that realize the overall functionality. We compare our approach with other PGAS models, such as UPC running using GASNet, and hand-optimized MPI code on a set of typical large-scale irregular applications, demonstrating speedups of an order of magnitude.« less
NASA Tech Briefs, October 2004
NASA Technical Reports Server (NTRS)
2004-01-01
Topics include: Relative-Motion Sensors and Actuators for Two Optical Tables; Improved Position Sensor for Feedback Control of Levitation; Compact Tactile Sensors for Robot Fingers; Improved Ion-Channel Biosensors; Suspended-Patch Antenna With Inverted, EM-Coupled Feed; System Would Predictively Preempt Traffic Lights for Emergency Vehicles; Optical Position Encoders for High or Low Temperatures; Inter-Valence-Subband/Conduction-Band-Transport IR Detectors; Additional Drive Circuitry for Piezoelectric Screw Motors; Software for Use with Optoelectronic Measuring Tool; Coordinating Shared Activities; Software Reduces Radio-Interference Effects in Radar Data; Using Iron to Treat Chlorohydrocarbon-Contaminated Soil; Thermally Insulating, Kinematic Tensioned-Fiber Suspension; Back Actuators for Segmented Mirrors and Other Applications; Mechanism for Self-Reacted Friction Stir Welding; Lightweight Exoskeletons with Controllable Actuators; Miniature Robotic Submarine for Exploring Harsh Environments; Electron-Spin Filters Based on the Rashba Effect; Diffusion-Cooled Tantalum Hot-Electron Bolometer Mixers; Tunable Optical True-Time Delay Devices Would Exploit EIT; Fast Query-Optimized Kernel-Machine Classification; Indentured Parts List Maintenance and Part Assembly Capture Tool - IMPACT; An Architecture for Controlling Multiple Robots; Progress in Fabrication of Rocket Combustion Chambers by VPS; CHEM-Based Self-Deploying Spacecraft Radar Antennas; Scalable Multiprocessor for High-Speed Computing in Space; and Simple Systems for Detecting Spacecraft Meteoroid Punctures.
Working memory resources are shared across sensory modalities.
Salmela, V R; Moisala, M; Alho, K
2014-10-01
A common assumption in the working memory literature is that the visual and auditory modalities have separate and independent memory stores. Recent evidence on visual working memory has suggested that resources are shared between representations, and that the precision of representations sets the limit for memory performance. We tested whether memory resources are also shared across sensory modalities. Memory precision for two visual (spatial frequency and orientation) and two auditory (pitch and tone duration) features was measured separately for each feature and for all possible feature combinations. Thus, only the memory load was varied, from one to four features, while keeping the stimuli similar. In Experiment 1, two gratings and two tones-both containing two varying features-were presented simultaneously. In Experiment 2, two gratings and two tones-each containing only one varying feature-were presented sequentially. The memory precision (delayed discrimination threshold) for a single feature was close to the perceptual threshold. However, as the number of features to be remembered was increased, the discrimination thresholds increased more than twofold. Importantly, the decrease in memory precision did not depend on the modality of the other feature(s), or on whether the features were in the same or in separate objects. Hence, simultaneously storing one visual and one auditory feature had an effect on memory precision equal to those of simultaneously storing two visual or two auditory features. The results show that working memory is limited by the precision of the stored representations, and that working memory can be described as a resource pool that is shared across modalities.
Singer, Jefferson A; Blagov, Pavel; Berry, Meredith; Oost, Kathryn M
2013-12-01
An integrative model of narrative identity builds on a dual memory system that draws on episodic memory and a long-term self to generate autobiographical memories. Autobiographical memories related to critical goals in a lifetime period lead to life-story memories, which in turn become self-defining memories when linked to an individual's enduring concerns. Self-defining memories that share repetitive emotion-outcome sequences yield narrative scripts, abstracted templates that filter cognitive-affective processing. The life story is the individual's overarching narrative that provides unity and purpose over the life course. Healthy narrative identity combines memory specificity with adaptive meaning-making to achieve insight and well-being, as demonstrated through a literature review of personality and clinical research, as well as new findings from our own research program. A clinical case study drawing on this narrative identity model is also presented with implications for treatment and research. © 2012 Wiley Periodicals, Inc.
NASA Astrophysics Data System (ADS)
Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua
2017-12-01
Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
The role of the hippocampus in navigation is memory
2017-01-01
There is considerable research on the neurobiological mechanisms within the hippocampal system that support spatial navigation. In this article I review the literature on navigational strategies in humans and animals, observations on hippocampal function in navigation, and studies of hippocampal neural activity in animals and humans performing different navigational tasks and tests of memory. Whereas the hippocampus is essential to spatial navigation via a cognitive map, its role derives from the relational organization and flexibility of cognitive maps and not from a selective role in the spatial domain. Correspondingly, hippocampal networks map multiple navigational strategies, as well as other spatial and nonspatial memories and knowledge domains that share an emphasis on relational organization. These observations suggest that the hippocampal system is not dedicated to spatial cognition and navigation, but organizes experiences in memory, for which spatial mapping and navigation are both a metaphor for and a prominent application of relational memory organization. PMID:28148640
[Involvement of aquaporin-4 in synaptic plasticity, learning and memory].
Wu, Xin; Gao, Jian-Feng
2017-06-25
Aquaporin-4 (AQP-4) is the predominant water channel in the central nervous system (CNS) and primarily expressed in astrocytes. Astrocytes have been generally believed to play important roles in regulating synaptic plasticity and information processing. However, the role of AQP-4 in regulating synaptic plasticity, learning and memory, cognitive function is only beginning to be investigated. It is well known that synaptic plasticity is the prime candidate for mediating of learning and memory. Long term potentiation (LTP) and long term depression (LTD) are two forms of synaptic plasticity, and they share some but not all the properties and mechanisms. Hippocampus is a part of limbic system that is particularly important in regulation of learning and memory. This article is to review some research progresses of the function of AQP-4 in synaptic plasticity, learning and memory, and propose the possible role of AQP-4 as a new target in the treatment of cognitive dysfunction.
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-01-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-09-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
FORCE2: A state-of-the-art two-phase code for hydrodynamic calculations
NASA Astrophysics Data System (ADS)
Ding, Jianmin; Lyczkowski, R. W.; Burge, S. W.
1993-02-01
A three-dimensional computer code for two-phase flow named FORCE2 has been developed by Babcock and Wilcox (B & W) in close collaboration with Argonne National Laboratory (ANL). FORCE2 is capable of both transient as well as steady-state simulations. This Cartesian coordinates computer program is a finite control volume, industrial grade and quality embodiment of the pilot-scale FLUFIX/MOD2 code and contains features such as three-dimensional blockages, volume and surface porosities to account for various obstructions in the flow field, and distributed resistance modeling to account for pressure drops caused by baffles, distributor plates and large tube banks. Recently computed results demonstrated the significance of and necessity for three-dimensional models of hydrodynamics and erosion. This paper describes the process whereby ANL's pilot-scale FLUFIX/MOD2 models and numerics were implemented into FORCE2. A description of the quality control to assess the accuracy of the new code and the validation using some of the measured data from Illinois Institute of Technology (UT) and the University of Illinois at Urbana-Champaign (UIUC) are given. It is envisioned that one day, FORCE2 with additional modules such as radiation heat transfer, combustion kinetics and multi-solids together with user-friendly pre- and post-processor software and tailored for massively parallel multiprocessor shared memory computational platforms will be used by industry and researchers to assist in reducing and/or eliminating the environmental and economic barriers which limit full consideration of coal, shale and biomass as energy sources, to retain energy security, and to remediate waste and ecological problems.
Experience with a UNIX based batch computing facility for H1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gerhards, R.; Kruener-Marquis, U.; Szkutnik, Z.
1994-12-31
A UNIX based batch computing facility for the H1 experiment at DESY is described. The ultimate goal is to replace the DESY IBM mainframe by a multiprocessor SGI Challenge series computer, using the UNIX operating system, for most of the computing tasks in H1.
Model Checking, Abstraction, and Compositional Verification
1993-07-01
the ( alois connections used by Bensalrnu el al. [6], and also has some relation to Kurshan’s automata homonuor- phisms [62]. (Actually. we can impose a...multiprocessor simulation model. ACM Transactions on Computer Systems, 4(4):273-298, November 1986. [41 D. L. Beatty, R. E. Bryant, and C.-J. Seger
Developing Software to Use Parallel Processing Effectively
1988-10-01
Experience, Vol 15(6), June 1985, p53 Gajski85 Gajski , Daniel D. and Jih-Kwon Peir, "Essential Issues in Multiprocessor Systems", IEEE Computer, June...Treleaven (eds.), Springer-Verlag, pp. 213-225 (June 1987). Kuck83 David Kuck, Duncan Lawrie, Ron Cytron, Ahmed Sameh and Daniel Gajski , The Architecture and
Working Memory Span Development: A Time-Based Resource-Sharing Model Account
ERIC Educational Resources Information Center
Barrouillet, Pierre; Gavens, Nathalie; Vergauwe, Evie; Gaillard, Vinciane; Camos, Valerie
2009-01-01
The time-based resource-sharing model (P. Barrouillet, S. Bernardin, & V. Camos, 2004) assumes that during complex working memory span tasks, attention is frequently and surreptitiously switched from processing to reactivate decaying memory traces before their complete loss. Three experiments involving children from 5 to 14 years of age…
Direct access inter-process shared memory
Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B
2013-10-22
A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.
Grouping and binding in visual short-term memory.
Quinlan, Philip T; Cohen, Dale J
2012-09-01
Findings of 2 experiments are reported that challenge the current understanding of visual short-term memory (VSTM). In both experiments, a single study display, containing 6 colored shapes, was presented briefly and then probed with a single colored shape. At stake is how VSTM retains a record of different objects that share common features: In the 1st experiment, 2 study items sometimes shared a common feature (either a shape or a color). The data revealed a color sharing effect, in which memory was much better for items that shared a common color than for items that did not. The 2nd experiment showed that the size of the color sharing effect depended on whether a single pair of items shared a common color or whether 2 pairs of items were so defined-memory for all items improved when 2 color groups were presented. In explaining performance, an account is advanced in which items compete for a fixed number of slots, but then memory recall for any given stored item is prone to error. A critical assumption is that items that share a common color are stored together in a slot as a chunk. The evidence provides further support for the idea that principles of perceptual organization may determine the manner in which items are stored in VSTM. PsycINFO Database Record (c) 2012 APA, all rights reserved.
Advanced Numerical Techniques of Performance Evaluation. Volume 2
1990-06-01
multiprocessor environment. This factor is determined by the overhead of the primitives available in the system ( semaphore , monitor , or message... semaphore , monitor , or message passing primitives ) and U the programming ability of the user who implements the simulation. " t,: the sequential...Warp Operating System . i Pro" lftevcnth ACM Symposum on Operating Systems Princlplcs, pages 77 9:3, Auslin, TX, Nov wicr 1987. ACM. [121 D.R. Jefferson
A Study of Shared-Memory Mutual Exclusion Protocols Using CADP
NASA Astrophysics Data System (ADS)
Mateescu, Radu; Serwe, Wendelin
Mutual exclusion protocols are an essential building block of concurrent systems: indeed, such a protocol is required whenever a shared resource has to be protected against concurrent non-atomic accesses. Hence, many variants of mutual exclusion protocols exist in the shared-memory setting, such as Peterson's or Dekker's well-known protocols. Although the functional correctness of these protocols has been studied extensively, relatively little attention has been paid to their non-functional aspects, such as their performance in the long run. In this paper, we report on experiments with the performance evaluation of mutual exclusion protocols using Interactive Markov Chains. Steady-state analysis provides an additional criterion for comparing protocols, which complements the verification of their functional properties. We also carefully re-examined the functional properties, whose accurate formulation as temporal logic formulas in the action-based setting turns out to be quite involved.
Handling debugger breakpoints in a shared instruction system
Gooding, Thomas Michael; Shok, Richard Michael
2014-01-21
A debugger debugs processes that execute shared instructions so that a breakpoint set for one process will not cause a breakpoint to occur in the other processes. A breakpoint is set by recording the original instruction at the desired location and writing a trap instruction to the shared instructions at that location. When a process encounters the breakpoint, the process passes control to the debugger for breakpoint processing if the breakpoint was set at that location for that process. If the trap was not set at that location for that process, the cacheline containing the trap is copied to a small scratchpad memory, and the virtual memory mappings are changed to translate the virtual address of the cacheline to the scratchpad. The original instruction is then written to replace the trap instruction in the scratchpad, so that process can execute the instructions in the scatchpad thereby avoiding the trap instruction.
Brown, Thackery I.; Stern, Chantal E.
2014-01-01
Many life experiences share information with other memories. In order to make decisions based on overlapping memories, we need to distinguish between experiences to determine the appropriate behavior for the current situation. Previous work suggests that the medial temporal lobe (MTL) and medial caudate interact to support the retrieval of overlapping navigational memories in different contexts. The present study used functional magnetic resonance imaging (fMRI) in humans to test the prediction that the MTL and medial caudate play complementary roles in learning novel mazes that cross paths with, and must be distinguished from, previously learned routes. During fMRI scanning, participants navigated virtual routes that were well learned from prior training while also learning new mazes. Critically, some routes learned during scanning shared hallways with those learned during pre-scan training. Overlap between mazes required participants to use contextual cues to select between alternative behaviors. Results demonstrated parahippocampal cortex activity specific for novel spatial cues that distinguish between overlapping routes. The hippocampus and medial caudate were active for learning overlapping spatial memories, and increased their activity for previously learned routes when they became context dependent. Our findings provide novel evidence that the MTL and medial caudate play complementary roles in the learning, updating, and execution of context-dependent navigational behaviors. PMID:23448868
Benchmarking GNU Radio Kernels and Multi-Processor Scheduling
2013-01-14
AMD E350 APU , comparable to Atom • ARM Cortex A8 running on a Gumstix Overo on an Ettus USRP E110 The general testing procedure consists of • Build...Intel Atom, and the AMD E350 APU . 3.2 Multi-Processor Scheduling Figure 1: GFLOPs per second through an FFT array on an Intel i7. Example output from
Considerations for Multiprocessor Topologies
NASA Technical Reports Server (NTRS)
Byrd, Gregory T.; Delagi, Bruce A.
1987-01-01
Choosing a multiprocessor interconnection topology may depend on high-level considerations, such as the intended application domain and the expected number of processors. It certainly depends on low-level implementation details, such as packaging and communications protocols. The authors first use rough measures of cost and performance to characterize several topologies. They then examine how implementation details can affect the realizable performance of a topology.
Shared reality in interpersonal relationships.
Andersen, Susan M; Przybylinski, Elizabeth
2017-11-24
Close relationships afford us opportunities to create and maintain meaning systems as shared perceptions of ourselves and the world. Establishing a sense of mutual understanding allows for creating and maintaining lasting social bonds, and as such, is important in human relations. In a related vein, it has long been known that knowledge of significant others in one's life is stored in memory and evoked with new persons-in the social-cognitive process of 'transference'-imbuing new encounters with significance and leading to predictable cognitive, evaluative, motivational, and behavioral consequences, as well as shifts in the self and self-regulation, depending on the particular significant other evoked. In these pages, we briefly review the literature on meaning as interpersonally defined and then selectively review research on transference in interpersonal perception. Based on this, we then highlight a recent series of studies focused on shared meaning systems in transference. The highlighted studies show that values and beliefs that develop in close relationships (as shared reality) are linked in memory to significant-other knowledge, and thus, are indirectly activated (made accessible) when cues in a new person implicitly activate that significant-other knowledge (in transference), with these shared beliefs then actively pursued with the new person and even protected against threat. This also confers a sense of mutual understanding, and all told, serves both relational and epistemic functions. In concluding, we consider as well the relevance of co-construction of shared reality n such processes. Copyright © 2017 Elsevier Ltd. All rights reserved.
Shared memories reveal shared structure in neural activity across individuals
Chen, J.; Leong, Y.C.; Honey, C.J.; Yong, C.H.; Norman, K.A.; Hasson, U.
2016-01-01
Our lives revolve around sharing experiences and memories with others. When different people recount the same events, how similar are their underlying neural representations? Participants viewed a fifty-minute movie, then verbally described the events during functional MRI, producing unguided detailed descriptions lasting up to forty minutes. As each person spoke, event-specific spatial patterns were reinstated in default-network, medial-temporal, and high-level visual areas. Individual event patterns were both highly discriminable from one another and similar between people, suggesting consistent spatial organization. In many high-order areas, patterns were more similar between people recalling the same event than between recall and perception, indicating systematic reshaping of percept into memory. These results reveal the existence of a common spatial organization for memories in high-level cortical areas, where encoded information is largely abstracted beyond sensory constraints; and that neural patterns during perception are altered systematically across people into shared memory representations for real-life events. PMID:27918531
Method of up-front load balancing for local memory parallel processors
NASA Technical Reports Server (NTRS)
Baffes, Paul Thomas (Inventor)
1990-01-01
In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.
Strategies for concurrent processing of complex algorithms in data driven architectures
NASA Technical Reports Server (NTRS)
Stoughton, John W.; Mielke, Roland R.
1987-01-01
The results of ongoing research directed at developing a graph theoretical model for describing data and control flow associated with the execution of large grained algorithms in a spatial distributed computer environment is presented. This model is identified by the acronym ATAMM (Algorithm/Architecture Mapping Model). The purpose of such a model is to provide a basis for establishing rules for relating an algorithm to its execution in a multiprocessor environment. Specifications derived from the model lead directly to the description of a data flow architecture which is a consequence of the inherent behavior of the data and control flow described by the model. The purpose of the ATAMM based architecture is to optimize computational concurrency in the multiprocessor environment and to provide an analytical basis for performance evaluation. The ATAMM model and architecture specifications are demonstrated on a prototype system for concept validation.
NASA Technical Reports Server (NTRS)
Noor, Ahmed K.; Peters, Jeanne M.
1989-01-01
A computational procedure is presented for the nonlinear dynamic analysis of unsymmetric structures on vector multiprocessor systems. The procedure is based on a novel hierarchical partitioning strategy in which the response of the unsymmetric and antisymmetric response vectors (modes), each obtained by using only a fraction of the degrees of freedom of the original finite element model. The three key elements of the procedure which result in high degree of concurrency throughout the solution process are: (1) mixed (or primitive variable) formulation with independent shape functions for the different fields; (2) operator splitting or restructuring of the discrete equations at each time step to delineate the symmetric and antisymmetric vectors constituting the response; and (3) two level iterative process for generating the response of the structure. An assessment is made of the effectiveness of the procedure on the CRAY X-MP/4 computers.
NASA Technical Reports Server (NTRS)
Gomez, Fernando
1989-01-01
It is shown how certain kinds of domain independent expert systems based on classification problem-solving methods can be constructed directly from natural language descriptions by a human expert. The expert knowledge is not translated into production rules. Rather, it is mapped into conceptual structures which are integrated into long-term memory (LTM). The resulting system is one in which problem-solving, retrieval and memory organization are integrated processes. In other words, the same algorithm and knowledge representation structures are shared by these processes. As a result of this, the system can answer questions, solve problems or reorganize LTM.
Choi, Hae-Yoon; Kensinger, Elizabeth A; Rajaram, Suparna
2017-09-01
Social transmission of memory and its consequence on collective memory have generated enduring interdisciplinary interest because of their widespread significance in interpersonal, sociocultural, and political arenas. We tested the influence of 3 key factors-emotional salience of information, group structure, and information distribution-on mnemonic transmission, social contagion, and collective memory. Participants individually studied emotionally salient (negative or positive) and nonemotional (neutral) picture-word pairs that were completely shared, partially shared, or unshared within participant triads, and then completed 3 consecutive recalls in 1 of 3 conditions: individual-individual-individual (control), collaborative-collaborative (identical group; insular structure)-individual, and collaborative-collaborative (reconfigured group; diverse structure)-individual. Collaboration enhanced negative memories especially in insular group structure and especially for shared information, and promoted collective forgetting of positive memories. Diverse group structure reduced this negativity effect. Unequally distributed information led to social contagion that creates false memories; diverse structure propagated a greater variety of false memories whereas insular structure promoted confidence in false recognition and false collective memory. A simultaneous assessment of network structure, information distribution, and emotional valence breaks new ground to specify how network structure shapes the spread of negative memories and false memories, and the emergence of collective memory. (PsycINFO Database Record (c) 2017 APA, all rights reserved).
Effects of cacheing on multitasking efficiency and programming strategy on an ELXSI 6400
DOE Office of Scientific and Technical Information (OSTI.GOV)
Montry, G.R.; Benner, R.E.
1985-12-01
The impact of a cache/shared memory architecture, and, in particular, the cache coherency problem, upon concurrent algorithm and program development is discussed. In this context, a simple set of programming strategies are proposed which streamline code development and improve code performance when multitasking in a cache/shared memory or distributed memory environment.
NASA Technical Reports Server (NTRS)
Mavriplis, D. J.; Das, Raja; Saltz, Joel; Vermeland, R. E.
1992-01-01
An efficient three dimensional unstructured Euler solver is parallelized on a Cray Y-MP C90 shared memory computer and on an Intel Touchstone Delta distributed memory computer. This paper relates the experiences gained and describes the software tools and hardware used in this study. Performance comparisons between two differing architectures are made.
Kuwajima, Mariko; Sawaguchi, Toshiyuki
2010-10-01
General fluid intelligence (gF) is a major component of intellect in both adults and children. Whereas its neural substrates have been studied relatively thoroughly in adults, those are poorly understood in children, particularly preschoolers. Here, we hypothesized that gF and visuospatial working memory share a common neural system within the lateral prefrontal cortex (LPFC) during the preschool years (4-6 years). At the behavioral level, we found that gF positively and significantly correlated with abilities (especially accuracy) in visuospatial working memory. Optical topography revealed that the LPFC of preschoolers was activated and deactivated during the visuospatial working memory task and the gF task. We found that the spatio-temporal features of neural activity in the LPFC were similar for both the visuospatial working memory task and the gF task. Further, 2 months of training for the visuospatial working memory task significantly increased gF in the preschoolers. These findings suggest that a common neural system in the LPFC is recruited to improve the visuospatial working memory and gF in preschoolers. Efficient recruitment of this neural system may be important for good performance in these functions in preschoolers, and behavioral training using this system would help to increase gF at these ages.
Static analysis of the hull plate using the finite element method
NASA Astrophysics Data System (ADS)
Ion, A.
2015-11-01
This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
USSR Report: Cybernetics, Computers and Automation Technology. No. 69.
1983-05-06
computers in multiprocessor and multistation design , control and scientific research automation systems. The results of comparing the efficiency of...Podvizhnaya, Scientific Research Institute of Control Computers, Severodonetsk] [Text] The most significant change in the design of the SM-2M compared to...UPRAVLYAYUSHCHIYE SISTEMY I MASHINY, Nov-Dec 82) 95 APPLICATIONS Kiev Automated Control System, Design Features and Prospects for Development (V. A
Image Understanding and Intelligent Parallel Systems
1991-05-09
a common user interface for the interactive , graphical manipulation of those histories, and...Circuits and Systems, August 1987. Yap, S.-K. and M.L. Scott, "PenGuin: A language for reactive graphical user interface programming," to appear, INTERACT , Cambridge, United Kingdom, 1990. 74 ...of up to a factor of 100 over single-workstation implementations. User interfaces to large multiprocessor computers are a difficult issue addressed
Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications
2009-05-01
Analysis of Photonic Networks for a Chip Multiprocessor Using Scientific Applications Gilbert Hendry†, Shoaib Kamil‡?, Aleksandr Biberman†, Johnnie...electronic networks -on-chip warrants investigating real application traces on functionally compa- rable photonic and electronic network designs. We... network can achieve 75× improvement in energy ef- ficiency for synthetic benchmarks and up to 37× improve- ment for real scientific applications
Method for wiring allocation and switch configuration in a multiprocessor environment
Aridor, Yariv [Zichron Ya'akov, IL; Domany, Tamar [Kiryat Tivon, IL; Frachtenberg, Eitan [Jerusalem, IL; Gal, Yoav [Haifa, IL; Shmueli, Edi [Haifa, IL; Stockmeyer, legal representative, Robert E.; Stockmeyer, Larry Joseph [San Jose, CA
2008-07-15
A method for wiring allocation and switch configuration in a multiprocessor computer, the method including employing depth-first tree traversal to determine a plurality of paths among a plurality of processing elements allocated to a job along a plurality of switches and wires in a plurality of D-lines, and selecting one of the paths in accordance with at least one selection criterion.
Generic Software for Emulating Multiprocessor Architectures.
1985-05-01
RD-A157 662 GENERIC SOFTWARE FOR EMULATING MULTIPROCESSOR 1/2 AlRCHITECTURES(J) MASSACHUSETTS INST OF TECH CAMBRIDGE U LRS LAB FOR COMPUTER SCIENCE R...AREA & WORK UNIT NUMBERS MIT Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139 ____________ I I. CONTROLLING OFFICE NAME AND...aide If neceeasy end Identify by block number) Computer architecture, emulation, simulation, dataf low 20. ABSTRACT (Continue an reverse slde It
Investigation into the Use of Texturing for Real-Time Computer Animation.
1987-12-01
produce a rough polygon surface [7]. Research in the area of real time texturing has also been conducted. Using a specially designed multi-processor system ...Oka, Tsutsui, Ohba, Kurauchi and Tago have introduced real-time manipulation of texture mapped surfaces [8]. Using multi- processors, systems will...a call to the system function defpattern(n,size,mask) short n,size; short *mask, which takes as input an index to a system table of patterns, a
NASA Technical Reports Server (NTRS)
Abramson, N.
1974-01-01
The Aloha system was studied and developed and extended to advanced forms of computer communications networks. Theoretical and simulation studies of Aloha type radio channels for use in packet switched communications networks were performed. Improved versions of the Aloha communications techniques and their extensions were tested experimentally. A packet radio repeater suitable for use with the Aloha system operational network was developed. General studies of the organization of multiprocessor systems centered on the development of the BCC 500 computer were concluded.
Schapiro, Anna C; McDevitt, Elizabeth A; Chen, Lang; Norman, Kenneth A; Mednick, Sara C; Rogers, Timothy T
2017-11-01
Semantic memory encompasses knowledge about both the properties that typify concepts (e.g. robins, like all birds, have wings) as well as the properties that individuate conceptually related items (e.g. robins, in particular, have red breasts). We investigate the impact of sleep on new semantic learning using a property inference task in which both kinds of information are initially acquired equally well. Participants learned about three categories of novel objects possessing some properties that were shared among category exemplars and others that were unique to an exemplar, with exposure frequency varying across categories. In Experiment 1, memory for shared properties improved and memory for unique properties was preserved across a night of sleep, while memory for both feature types declined over a day awake. In Experiment 2, memory for shared properties improved across a nap, but only for the lower-frequency category, suggesting a prioritization of weakly learned information early in a sleep period. The increase was significantly correlated with amount of REM, but was also observed in participants who did not enter REM, suggesting involvement of both REM and NREM sleep. The results provide the first evidence that sleep improves memory for the shared structure of object categories, while simultaneously preserving object-unique information.