Conditional load and store in a shared memory
Blumrich, Matthias A; Ohmacht, Martin
2015-02-03
A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
Ordering of guarded and unguarded stores for no-sync I/O
Gara, Alan; Ohmacht, Martin
2013-06-25
A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
NASA Technical Reports Server (NTRS)
Harper, Richard E.; Butler, Bryan P.
1990-01-01
The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Multiprocessor shared-memory information exchange
DOE Office of Scientific and Technical Information (OSTI.GOV)
Santoline, L.L.; Bowers, M.D.; Crew, A.W.
1989-02-01
In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
System and method for programmable bank selection for banked memory subsystems
Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan
2010-09-07
A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
Method for prefetching non-contiguous data structures
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2009-05-05
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.
A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
On nonlinear finite element analysis in single-, multi- and parallel-processors
NASA Technical Reports Server (NTRS)
Utku, S.; Melosh, R.; Islam, M.; Salama, M.
1982-01-01
Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Hypercluster - Parallel processing for computational mechanics
NASA Technical Reports Server (NTRS)
Blech, Richard A.
1988-01-01
An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
System and method for memory allocation in a multiclass memory system
Loh, Gabriel; Meswani, Mitesh; Ignatowski, Michael; Nutter, Mark
2016-06-28
A system for memory allocation in a multiclass memory system includes a processor coupleable to a plurality of memories sharing a unified memory address space, and a library store to store a library of software functions. The processor identifies a type of a data structure in response to a memory allocation function call to the library for allocating memory to the data structure. Using the library, the processor allocates portions of the data structure among multiple memories of the multiclass memory system based on the type of the data structure.
DMA shared byte counters in a parallel computer
Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos
2010-04-06
A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Direct access inter-process shared memory
Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B
2013-10-22
A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.
Vascular system modeling in parallel environment - distributed and shared memory approaches
Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne
2011-01-01
The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Reader set encoding for directory of shared cache memory in multiprocessor system
Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang
2014-06-10
In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Architectures for reasoning in parallel
NASA Technical Reports Server (NTRS)
Hall, Lawrence O.
1989-01-01
The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.
Applications considerations in the system design of highly concurrent multiprocessors
NASA Technical Reports Server (NTRS)
Lundstrom, Stephen F.
1987-01-01
A flow model processor approach to parallel processing is described, using very-high-performance individual processors, high-speed circuit switched interconnection networks, and a high-speed synchronization capability to minimize the effect of the inherently serial portions of applications on performance. Design studies related to the determination of the number of processors, the memory organization, and the structure of the networks used to interconnect the processor and memory resources are discussed. Simulations indicate that applications centered on the large shared data memory should be able to sustain over 500 million floating point operations per second.
Parallel discrete event simulation: A shared memory approach
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.
1987-01-01
With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
Parallel discrete event simulation using shared memory
NASA Technical Reports Server (NTRS)
Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.
1988-01-01
With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.
A message passing kernel for the hypercluster parallel processing test bed
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Quealy, Angela; Cole, Gary L.
1989-01-01
A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
NASA Technical Reports Server (NTRS)
Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.
1994-01-01
The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
Error recovery in shared memory multiprocessors using private caches
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.
1990-01-01
The problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
Android Protection Mechanism: A Signed Code Security Mechanism for Smartphone Applications
2011-03-01
status registers, exceptions, endian support, unaligned access support, synchronization primitives , the Jazelle Extension, and saturated integer...supports comprehensive non-blocking shared-memory synchronization primitives that scale for multiple-processor system designs. This is an improvement... synchronization . Memory semaphores can be loaded and altered without interruption because the load and store operations are atomic. Processor
Solutions and debugging for data consistency in multiprocessors with noncoherent caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.
1995-02-01
We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
Parallelising a molecular dynamics algorithm on a multi-processor workstation
NASA Astrophysics Data System (ADS)
Müller-Plathe, Florian
1990-12-01
The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
Cache-based error recovery for shared memory multiprocessor systems
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.
1989-01-01
A multiprocessor cache-based checkpointing and recovery scheme for of recovering from transient processor errors in a shared-memory multiprocessor with private caches is presented. New implementation techniques that use checkpoint identifiers and recovery stacks to reduce performance degradation in processor utilization during normal execution are examined. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions that take error latency into account are presented.
A High Performance VLSI Computer Architecture For Computer Graphics
NASA Astrophysics Data System (ADS)
Chin, Chi-Yuan; Lin, Wen-Tai
1988-10-01
A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
C-MOS array design techniques: SUMC multiprocessor system study
NASA Technical Reports Server (NTRS)
Clapp, W. A.; Helbig, W. A.; Merriam, A. S.
1972-01-01
The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.
An enhanced Ada run-time system for real-time embedded processors
NASA Technical Reports Server (NTRS)
Sims, J. T.
1991-01-01
An enhanced Ada run-time system has been developed to support real-time embedded processor applications. The primary focus of this development effort has been on the tasking system and the memory management facilities of the run-time system. The tasking system has been extended to support efficient and precise periodic task execution as required for control applications. Event-driven task execution providing a means of task-asynchronous control and communication among Ada tasks is supported in this system. Inter-task control is even provided among tasks distributed on separate physical processors. The memory management system has been enhanced to provide object allocation and protected access support for memory shared between disjoint processors, each of which is executing a distinct Ada program.
An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches
NASA Astrophysics Data System (ADS)
Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur
2018-03-01
Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.
NASA Technical Reports Server (NTRS)
Fatoohi, Rod; Saini, Subbash; Ciotti, Robert
2006-01-01
We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
NASA Technical Reports Server (NTRS)
Mejzak, R. S.
1980-01-01
The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.
Shared versus distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Jordan, Harry F.
1991-01-01
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
Initial Performance Results on IBM POWER6
NASA Technical Reports Server (NTRS)
Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh
2008-01-01
The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-01-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Berrendorf, R.
1992-09-01
Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Methodology for fast detection of false sharing in threaded scientific codes
Chung, I-Hsin; Cong, Guojing; Murata, Hiroki; Negishi, Yasushi; Wen, Hui-Fang
2014-11-25
A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
NASA Astrophysics Data System (ADS)
Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide
2015-09-01
The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing
NASA Technical Reports Server (NTRS)
Fricker, David M.
1997-01-01
The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
PANDA: A distributed multiprocessor operating system
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chubb, P.
1989-01-01
PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports
1991-06-01
Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Supporting shared data structures on distributed memory architectures
NASA Technical Reports Server (NTRS)
Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John
1990-01-01
Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Memory Network For Distributed Data Processors
NASA Technical Reports Server (NTRS)
Bolen, David; Jensen, Dean; Millard, ED; Robinson, Dave; Scanlon, George
1992-01-01
Universal Memory Network (UMN) is modular, digital data-communication system enabling computers with differing bus architectures to share 32-bit-wide data between locations up to 3 km apart with less than one millisecond of latency. Makes it possible to design sophisticated real-time and near-real-time data-processing systems without data-transfer "bottlenecks". This enterprise network permits transmission of volume of data equivalent to an encyclopedia each second. Facilities benefiting from Universal Memory Network include telemetry stations, simulation facilities, power-plants, and large laboratories or any facility sharing very large volumes of data. Main hub of UMN is reflection center including smaller hubs called Shared Memory Interfaces.
High-performance computing — an overview
NASA Astrophysics Data System (ADS)
Marksteiner, Peter
1996-08-01
An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.
Hypercluster Parallel Processor
NASA Technical Reports Server (NTRS)
Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela
1992-01-01
Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
Cache write generate for parallel image processing on shared memory architectures.
Wittenbrink, C M; Somani, A K; Chen, C H
1996-01-01
We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
Parallel processing approach to transform-based image coding
NASA Astrophysics Data System (ADS)
Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.
1991-06-01
This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
Asynchronous Communication Scheme For Hypercube Computer
NASA Technical Reports Server (NTRS)
Madan, Herb S.
1988-01-01
Scheme devised for asynchronous-message communication system for Mark III hypercube concurrent-processor network. Network consists of up to 1,024 processing elements connected electrically as though were at corners of 10-dimensional cube. Each node contains two Motorola 68020 processors along with Motorola 68881 floating-point processor utilizing up to 4 megabytes of shared dynamic random-access memory. Scheme intended to support applications requiring passage of both polled or solicited and unsolicited messages.
Tactical Operations Analysis Support Facility.
1981-05-01
Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared
A cache-aided multiprocessor rollback recovery scheme
NASA Technical Reports Server (NTRS)
Wu, Kun-Lung; Fuchs, W. Kent
1989-01-01
This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
Importance of balanced architectures in the design of high-performance imaging systems
NASA Astrophysics Data System (ADS)
Sgro, Joseph A.; Stanton, Paul C.
1999-03-01
Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Multi-processor including data flow accelerator module
Davidson, George S.; Pierce, Paul E.
1990-01-01
An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Fault-Tolerant, Real-Time, Multi-Core Computer System
NASA Technical Reports Server (NTRS)
Gostelow, Kim P.
2012-01-01
A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
Data traffic reduction schemes for sparse Cholesky factorizations
NASA Technical Reports Server (NTRS)
Naik, Vijay K.; Patrick, Merrell L.
1988-01-01
Load distribution schemes are presented which minimize the total data traffic in the Cholesky factorization of dense and sparse, symmetric, positive definite matrices on multiprocessor systems with local and shared memory. The total data traffic in factoring an n x n sparse, symmetric, positive definite matrix representing an n-vertex regular 2-D grid graph using n (sup alpha), alpha is equal to or less than 1, processors are shown to be O(n(sup 1 + alpha/2)). It is O(n(sup 3/2)), when n (sup alpha), alpha is equal to or greater than 1, processors are used. Under the conditions of uniform load distribution, these results are shown to be asymptotically optimal. The schemes allow efficient use of up to O(n) processors before the total data traffic reaches the maximum value of O(n(sup 3/2)). The partitioning employed within the scheme, allows a better utilization of the data accessed from shared memory than those of previously published methods.
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Hoemmen, Mark
2010-11-01
Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
NASA Astrophysics Data System (ADS)
Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.
2017-11-01
Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Werner, N.E.; Van Matre, S.W.
1985-05-01
This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms
NASA Technical Reports Server (NTRS)
Djomehri, M. Jahed; Rizk, Yehia M.
1999-01-01
The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT
DOE Office of Scientific and Technical Information (OSTI.GOV)
Secchi, Simone; Tumeo, Antonino; Villa, Oreste
Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
Parallel computing for probabilistic fatigue analysis
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.
1993-01-01
This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Visualization Co-Processing of a CFD Simulation
NASA Technical Reports Server (NTRS)
Vaziri, Arsi
1999-01-01
OVERFLOW, a widely used CFD simulation code, is combined with a visualization system, pV3, to experiment with an environment for simulation/visualization co-processing on a SGI Origin 2000 computer(O2K) system. The shared memory version of the solver is used with the O2K 'pfa' preprocessor invoked to automatically discover parallelism in the source code. No other explicit parallelism is enabled. In order to study the scaling and performance of the visualization co-processing system, sample runs are made with different processor groups in the range of 1 to 254 processors. The data exchange between the visualization system and the simulation system is rapid enough for user interactivity when the problem size is small. This shared memory version of OVERFLOW, with minimal parallelization, does not scale well to an increasing number of available processors. The visualization task takes about 18 to 30% of the total processing time and does not appear to be a major contributor to the poor scaling. Improper load balancing and inter-processor communication overhead are contributors to this poor performance. Work is in progress which is aimed at obtaining improved parallel performance of the solver and removing the limitations of serial data transfer to pV3 by examining various parallelization/communication strategies, including the use of the explicit message passing.
A pervasive parallel framework for visualization: final report for FWP 10-014707
DOE Office of Scientific and Technical Information (OSTI.GOV)
Moreland, Kenneth D.
2014-01-01
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
Distributed simulation using a real-time shared memory network
NASA Technical Reports Server (NTRS)
Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.
1993-01-01
The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.
NASA Technical Reports Server (NTRS)
Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas
2008-01-01
A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Static analysis of the hull plate using the finite element method
NASA Astrophysics Data System (ADS)
Ion, A.
2015-11-01
This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
VLSI 'smart' I/O module development
NASA Astrophysics Data System (ADS)
Kirk, Dan
The developmental history, design, and operation of the MIL-STD-1553A/B discrete and serial module (DSM) for the U.S. Navy AN/AYK-14(V) avionics computer are described and illustrated with diagrams. The ongoing preplanned product improvement for the AN/AYK-14(V) includes five dual-redundant MIL-STD-1553 channels based on DSMs. The DSM is a front-end processor for transferring data to and from a common memory, sharing memory with a host processor to provide improved 'smart' input/output performance. Each DSM comprises three hardware sections: three VLSI-6000 semicustomized CMOS arrays, memory units to support the arrays, and buffers and resynchronization circuits. The DSM hardware module design, VLSI-6000 design tools, controlware and test software, and checkout procedures (using a hardware simulator) are characterized in detail.
Rapid solution of large-scale systems of equations
NASA Technical Reports Server (NTRS)
Storaasli, Olaf O.
1994-01-01
The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
Parallelization of KENO-Va Monte Carlo code
NASA Astrophysics Data System (ADS)
Ramón, Javier; Peña, Jorge
1995-07-01
KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
A general model for memory interference in a multiprocessor system with memory hierarchy
NASA Technical Reports Server (NTRS)
Taha, Badie A.; Standley, Hilda M.
1989-01-01
The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
OpenMP Performance on the Columbia Supercomputer
NASA Technical Reports Server (NTRS)
Haoqiang, Jin; Hood, Robert
2005-01-01
This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.
Parallel Programming Paradigms
1987-07-01
Unclassified IS.. DECLASSIFICATIONIOOWNGRADIN G 16. DISTRIBUTION STATEMENT (of this Report) Distribution of this report is unlimited. 17...8416878 and by the Office of Naval Research Contracts No. N00014-86-K-0264 and No. N00014-85- K-0328. 8 ?~~ O . G 1 49 II Parallel Programming Paradigms...processors -. "to fetch from the same memory cell (list head) and thus seems to favor a shared memory - g implementation [37). In this dissertation, we
Multiprocessing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1990-01-01
Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Villa, Oreste; Tumeo, Antonino; Secchi, Simone
Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
Implementing Shared Memory Parallelism in MCBEND
NASA Astrophysics Data System (ADS)
Bird, Adam; Long, David; Dobson, Geoff
2017-09-01
MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1996-01-01
We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms
NASA Technical Reports Server (NTRS)
Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.
1997-01-01
We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Chapman, Barbara; Mehrotra, Piyush; Zima, Hans
1991-01-01
Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
Parallel algorithms for boundary value problems
NASA Technical Reports Server (NTRS)
Lin, Avi
1990-01-01
A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
NASA Technical Reports Server (NTRS)
Phyne, J. R.; Nelson, M. D.
1975-01-01
The design and implementation of hardware and software systems involved in using a 40,000 bit/second communication line as the connecting link between an IMLAC PDS 1-D display computer and a Univac 1108 computer system were described. The IMLAC consists of two independent processors sharing a common memory. The display processor generates the deflection and beam control currents as it interprets a program contained in the memory; the minicomputer has a general instruction set and is responsible for starting and stopping the display processor and for communicating with the outside world through the keyboard, teletype, light pen, and communication line. The processing time associated with each data byte was minimized by designing the input and output processes as finite state machines which automatically sequence from each state to the next. Several tests of the communication link and the IMLAC software were made using a special low capacity computer grade cable between the IMLAC and the Univac.
High speed quantitative digital microscopy
NASA Technical Reports Server (NTRS)
Castleman, K. R.; Price, K. H.; Eskenazi, R.; Ovadya, M. M.; Navon, M. A.
1984-01-01
Modern digital image processing hardware makes possible quantitative analysis of microscope images at high speed. This paper describes an application to automatic screening for cervical cancer. The system uses twelve MC6809 microprocessors arranged in a pipeline multiprocessor configuration. Each processor executes one part of the algorithm on each cell image as it passes through the pipeline. Each processor communicates with its upstream and downstream neighbors via shared two-port memory. Thus no time is devoted to input-output operations as such. This configuration is expected to be at least ten times faster than previous systems.
Multitasking OS manages a team of processors
DOE Office of Scientific and Technical Information (OSTI.GOV)
Ripps, D.L.
1983-07-21
MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
The Mark III Hypercube-Ensemble Computers
NASA Technical Reports Server (NTRS)
Peterson, John C.; Tuazon, Jesus O.; Lieberman, Don; Pniel, Moshe
1988-01-01
Mark III Hypercube concept applied in development of series of increasingly powerful computers. Processor of each node of Mark III Hypercube ensemble is specialized computer containing three subprocessors and shared main memory. Solves problem quickly by simultaneously processing part of problem at each such node and passing combined results to host computer. Disciplines benefitting from speed and memory capacity include astrophysics, geophysics, chemistry, weather, high-energy physics, applied mechanics, image processing, oil exploration, aircraft design, and microcircuit design.
Multiple core computer processor with globally-accessible local memories
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shalf, John; Donofrio, David; Oliker, Leonid
A multi-core computer processor including a plurality of processor cores interconnected in a Network-on-Chip (NoC) architecture, a plurality of caches, each of the plurality of caches being associated with one and only one of the plurality of processor cores, and a plurality of memories, each of the plurality of memories being associated with a different set of at least one of the plurality of processor cores and each of the plurality of memories being configured to be visible in a global memory address space such that the plurality of memories are visible to two or more of the plurality ofmore » processor cores.« less
An architecture for real-time vision processing
NASA Technical Reports Server (NTRS)
Chien, Chiun-Hong
1994-01-01
To study the feasibility of developing an architecture for real time vision processing, a task queue server and parallel algorithms for two vision operations were designed and implemented on an i860-based Mercury Computing System 860VS array processor. The proposed architecture treats each vision function as a task or set of tasks which may be recursively divided into subtasks and processed by multiple processors coordinated by a task queue server accessible by all processors. Each idle processor subsequently fetches a task and associated data from the task queue server for processing and posts the result to shared memory for later use. Load balancing can be carried out within the processing system without the requirement for a centralized controller. The author concludes that real time vision processing cannot be achieved without both sequential and parallel vision algorithms and a good parallel vision architecture.
Multiprocessing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1991-01-01
Little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPS or more) to improve turnaround time in computational aerodynamics. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, such improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) is applied through multitasking via a strategy that requires relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-Fortran-Unix interface. The existing code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jin, Shuangshuang; Chen, Yousu; Wu, Di
2015-12-09
Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Array processor architecture connection network
NASA Technical Reports Server (NTRS)
Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)
1982-01-01
A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.
The Tera Multithreaded Architecture and Unstructured Meshes
NASA Technical Reports Server (NTRS)
Bokhari, Shahid H.; Mavriplis, Dimitri J.
1998-01-01
The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
MPF: A portable message passing facility for shared memory multiprocessors
NASA Technical Reports Server (NTRS)
Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.
1987-01-01
The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory
NASA Technical Reports Server (NTRS)
Janssens, Bob; Fuchs, W. Kent
1994-01-01
Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.
NASA Astrophysics Data System (ADS)
Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying
2017-05-01
In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.
A simple modern correctness condition for a space-based high-performance multiprocessor
NASA Technical Reports Server (NTRS)
Probst, David K.; Li, Hon F.
1992-01-01
A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
Parallel Gaussian elimination of a block tridiagonal matrix using multiple microcomputers
NASA Technical Reports Server (NTRS)
Blech, Richard A.
1989-01-01
The solution of a block tridiagonal matrix using parallel processing is demonstrated. The multiprocessor system on which results were obtained and the software environment used to program that system are described. Theoretical partitioning and resource allocation for the Gaussian elimination method used to solve the matrix are discussed. The results obtained from running 1, 2 and 3 processor versions of the block tridiagonal solver are presented. The PASCAL source code for these solvers is given in the appendix, and may be transportable to other shared memory parallel processors provided that the synchronization outlines are reproduced on the target system.
A Parallel Rendering Algorithm for MIMD Architectures
NASA Technical Reports Server (NTRS)
Crockett, Thomas W.; Orloff, Tobias
1991-01-01
Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
General-purpose interface bus for multiuser, multitasking computer system
NASA Technical Reports Server (NTRS)
Generazio, Edward R.; Roth, Don J.; Stang, David B.
1990-01-01
The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
Compiler-directed cache management in multiprocessors
NASA Technical Reports Server (NTRS)
Cheong, Hoichi; Veidenbaum, Alexander V.
1990-01-01
The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace-driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.
1987-12-01
Synchronization and Data Passing Mechanism ........ 50 4. System Shut Down .................................................................. 51 5...high performance, fault tolerance, and extensibility. These features are attained by synchronizing and coordinating the dis- tributed multicomputer... synchronizing all processors in the network. In a multitransputer network, processes that communicate with each other do so synchronously . This makes
Debugging Fortran on a shared memory machine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Allen, T.R.; Padua, D.A.
1987-01-01
Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.
Optical Associative Processors For Visual Perception"
NASA Astrophysics Data System (ADS)
Casasent, David; Telfer, Brian
1988-05-01
We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.
G-cueing microcontroller (a microprocessor application in simulators)
NASA Technical Reports Server (NTRS)
Horattas, C. G.
1980-01-01
A g cueing microcontroller is described which consists of a tandem pair of microprocessors, dedicated to the task of simulating pilot sensed cues caused by gravity effects. This task includes execution of a g cueing model which drives actuators that alter the configuration of the pilot's seat. The g cueing microcontroller receives acceleration commands from the aerodynamics model in the main computer and creates the stimuli that produce physical acceleration effects of the aircraft seat on the pilots anatomy. One of the two microprocessors is a fixed instruction processor that performs all control and interface functions. The other, a specially designed bipolar bit slice microprocessor, is a microprogrammable processor dedicated to all arithmetic operations. The two processors communicate with each other by a shared memory. The g cueing microcontroller contains its own dedicated I/O conversion modules for interface with the seat actuators and controls, and a DMA controller for interfacing with the simulation computer. Any application which can be microcoded within the available memory, the available real time and the available I/O channels, could be implemented in the same controller.
Efficient accesses of data structures using processing near memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jayasena, Nuwan S.; Zhang, Dong Ping; Diez, Paula Aguilera
Systems, apparatuses, and methods for implementing efficient queues and other data structures. A queue may be shared among multiple processors and/or threads without using explicit software atomic instructions to coordinate access to the queue. System software may allocate an atomic queue and corresponding queue metadata in system memory and return, to the requesting thread, a handle referencing the queue metadata. Any number of threads may utilize the handle for accessing the atomic queue. The logic for ensuring the atomicity of accesses to the atomic queue may reside in a management unit in the memory controller coupled to the memory wheremore » the atomic queue is allocated.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wasserman, H.J.
1996-02-01
The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
Dynamic programming on a shared-memory multiprocessor
NASA Technical Reports Server (NTRS)
Edmonds, Phil; Chu, Eleanor; George, Alan
1993-01-01
Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
Performances of multiprocessor multidisk architectures for continuous media storage
NASA Astrophysics Data System (ADS)
Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.
1996-03-01
Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
The force on the flex: Global parallelism and portability
NASA Technical Reports Server (NTRS)
Jordan, H. F.
1986-01-01
A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
NASA Technical Reports Server (NTRS)
Chapman, Barbara; Mehrotra, Piyush; Zima, Hans
1992-01-01
Exploiting the full performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna Fortran is a language extension of Fortran which provides the user with a wide range of facilities for such mapping of data structures. In contrast to current programming practice, programs in Vienna Fortran are written using global data references. Thus, the user has the advantages of a shared memory programming paradigm while explicitly controlling the data distribution. In this paper, we present the language features of Vienna Fortran for FORTRAN 77, together with examples illustrating the use of these features.
Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems
NASA Technical Reports Server (NTRS)
Naik, Vijay K.; Patrick, Merrell L.
1989-01-01
Communication requirements of Cholesky factorization of dense and sparse symmetric, positive definite matrices are analyzed. The communication requirement is characterized by the data traffic generated on multiprocessor systems with local and shared memory. Lower bound proofs are given to show that when the load is uniformly distributed the data traffic associated with factoring an n x n dense matrix using n to the alpha power (alpha less than or equal 2) processors is omega(n to the 2 + alpha/2 power). For n x n sparse matrices representing a square root of n x square root of n regular grid graph the data traffic is shown to be omega(n to the 1 + alpha/2 power), alpha less than or equal 1. Partitioning schemes that are variations of block assignment scheme are described and it is shown that the data traffic generated by these schemes are asymptotically optimal. The schemes allow efficient use of up to O(n to the 2nd power) processors in the dense case and up to O(n) processors in the sparse case before the total data traffic reaches the maximum value of O(n to the 3rd power) and O(n to the 3/2 power), respectively. It is shown that the block based partitioning schemes allow a better utilization of the data accessed from shared memory and thus reduce the data traffic than those based on column-wise wrap around assignment schemes.
Peregrine System Configuration | High-Performance Computing | NREL
nodes and storage are connected by a high speed InfiniBand network. Compute nodes are diskless with an directories are mounted on all nodes, along with a file system dedicated to shared projects. A brief processors with 64 GB of memory. All nodes are connected to the high speed Infiniband network and and a
NASA Astrophysics Data System (ADS)
Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry
2003-08-01
Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
Is random access memory random?
NASA Technical Reports Server (NTRS)
Denning, P. J.
1986-01-01
Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers.
Method of up-front load balancing for local memory parallel processors
NASA Technical Reports Server (NTRS)
Baffes, Paul Thomas (Inventor)
1990-01-01
In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.
Avoiding and tolerating latency in large-scale next-generation shared-memory multiprocessors
NASA Technical Reports Server (NTRS)
Probst, David K.
1993-01-01
A scalable solution to the memory-latency problem is necessary to prevent the large latencies of synchronization and memory operations inherent in large-scale shared-memory multiprocessors from reducing high performance. We distinguish latency avoidance and latency tolerance. Latency is avoided when data is brought to nearby locales for future reference. Latency is tolerated when references are overlapped with other computation. Latency-avoiding locales include: processor registers, data caches used temporally, and nearby memory modules. Tolerating communication latency requires parallelism, allowing the overlap of communication and computation. Latency-tolerating techniques include: vector pipelining, data caches used spatially, prefetching in various forms, and multithreading in various forms. Relaxing the consistency model permits increased use of avoidance and tolerance techniques. Each model is a mapping from the program text to sets of partial orders on program operations; it is a convention about which temporal precedences among program operations are necessary. Information about temporal locality and parallelism constrains the use of avoidance and tolerance techniques. Suitable architectural primitives and compiler technology are required to exploit the increased freedom to reorder and overlap operations in relaxed models.
Direct match data flow memory for data driven computing
Davidson, George S.; Grafe, Victor Gerald
1997-01-01
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow memory for data driven computing
Davidson, G.S.; Grafe, V.G.
1997-10-07
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
NASA Technical Reports Server (NTRS)
Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash
2003-01-01
Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
Direct match data flow machine apparatus and process for data driven computing
Davidson, G.S.; Grafe, V.G.
1997-08-12
A data flow computer and method of computing are disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ``fire`` signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing
Davidson, G.S.; Grafe, V.G.
1988-07-22
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information from an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status bit to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a ''fire'' signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor. 11 figs.
Data flow machine for data driven computing
Davidson, George S.; Grafe, Victor G.
1995-01-01
A data flow computer which of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Direct match data flow machine apparatus and process for data driven computing
Davidson, George S.; Grafe, Victor Gerald
1997-01-01
A data flow computer and method of computing is disclosed which utilizes a data driven processor node architecture. The apparatus in a preferred embodiment includes a plurality of First-In-First-Out (FIFO) registers, a plurality of related data flow memories, and a processor. The processor makes the necessary calculations and includes a control unit to generate signals to enable the appropriate FIFO register receiving the result. In a particular embodiment, there are three FIFO registers per node: an input FIFO register to receive input information form an outside source and provide it to the data flow memories; an output FIFO register to provide output information from the processor to an outside recipient; and an internal FIFO register to provide information from the processor back to the data flow memories. The data flow memories are comprised of four commonly addressed memories. A parameter memory holds the A and B parameters used in the calculations; an opcode memory holds the instruction; a target memory holds the output address; and a tag memory contains status bits for each parameter. One status bit indicates whether the corresponding parameter is in the parameter memory and one status but to indicate whether the stored information in the corresponding data parameter is to be reused. The tag memory outputs a "fire" signal (signal R VALID) when all of the necessary information has been stored in the data flow memories, and thus when the instruction is ready to be fired to the processor.
Distributed Systems Technology Survey.
1987-03-01
and prolocols. 2. Hardware Technology Ecnomic factor we a majo reonm for the prolierat of dlstbted systoe. Processors, memory, an magne tc ndoptical...destined messages and pertorn the a pro te forwarding. There gImsno agreement that a ightweight process mechanism is essential to support com- monly used...Xerox PARC environment [311. Shared file servers, discussed below, are essential to the success of such a scheme. 11. ecurlity A distributed
NASA Technical Reports Server (NTRS)
OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)
1998-01-01
This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).
Merlin - Massively parallel heterogeneous computing
NASA Technical Reports Server (NTRS)
Wittie, Larry; Maples, Creve
1989-01-01
Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
A Parallel Saturation Algorithm on Shared Memory Architectures
NASA Technical Reports Server (NTRS)
Ezekiel, Jonathan; Siminiceanu
2007-01-01
Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Method and apparatus for high speed data acquisition and processing
Ferron, J.R.
1997-02-11
A method and apparatus are disclosed for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register. 15 figs.
Method and apparatus for high speed data acquisition and processing
Ferron, John R.
1997-01-01
A method and apparatus for high speed digital data acquisition. The apparatus includes one or more multiplexers for receiving multiple channels of digital data at a low data rate and asserting a multiplexed data stream at a high data rate, and one or more FIFO memories for receiving data from the multiplexers and asserting the data to a real time processor. Preferably, the invention includes two multiplexers, two FIFO memories, and a 64-bit bus connecting the FIFO memories with the processor. Each multiplexer receives four channels of 14-bit digital data at a rate of up to 5 MHz per channel, and outputs a data stream to one of the FIFO memories at a rate of 20 MHz. The FIFO memories assert output data in parallel to the 64-bit bus, thus transferring 14-bit data values to the processor at a combined rate of 40 MHz. The real time processor is preferably a floating-point processor which processes 32-bit floating-point words. A set of mask bits is prestored in each 32-bit storage location of the processor memory into which a 14-bit data value is to be written. After data transfer from the FIFO memories, mask bits are concatenated with each stored 14-bit data value to define a valid 32-bit floating-point word. Preferably, a user can select any of several modes for starting and stopping direct memory transfers of data from the FIFO memories to memory within the real time processor, by setting the content of a control and status register.
NASA Astrophysics Data System (ADS)
Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.
2017-07-01
Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
NASA Astrophysics Data System (ADS)
Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter
2015-12-01
AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
NASA Technical Reports Server (NTRS)
Janetzke, David C.; Murthy, Durbha V.
1991-01-01
Aeroelastic analysis is multi-disciplinary and computationally expensive. Hence, it can greatly benefit from parallel processing. As part of an effort to develop an aeroelastic capability on a distributed memory transputer network, a parallel algorithm for the computation of aerodynamic influence coefficients is implemented on a network of 32 transputers. The aerodynamic influence coefficients are calculated using a 3-D unsteady aerodynamic model and a parallel discretization. Efficiencies up to 85 percent were demonstrated using 32 processors. The effect of subtask ordering, problem size, and network topology are presented. A comparison to results on a shared memory computer indicates that higher speedup is achieved on the distributed memory system.
Active non-volatile memory post-processing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Kannan, Sudarsun; Milojicic, Dejan S.; Talwar, Vanish
A computing node includes an active Non-Volatile Random Access Memory (NVRAM) component which includes memory and a sub-processor component. The memory is to store data chunks received from a processor core, the data chunks comprising metadata indicating a type of post-processing to be performed on data within the data chunks. The sub-processor component is to perform post-processing of said data chunks based on said metadata.
On the impact of communication complexity in the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D.; Vanrosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Implementations of BLAST for parallel computers.
Jülich, A
1995-02-01
The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
On the impact of communication complexity on the design of parallel numerical algorithms
NASA Technical Reports Server (NTRS)
Gannon, D. B.; Van Rosendale, J.
1984-01-01
This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Arranging computer architectures to create higher-performance controllers
NASA Technical Reports Server (NTRS)
Jacklin, Stephen A.
1988-01-01
Techniques for integrating microprocessors, array processors, and other intelligent devices in control systems are reviewed, with an emphasis on the (re)arrangement of components to form distributed or parallel processing systems. Consideration is given to the selection of the host microprocessor, increasing the power and/or memory capacity of the host, multitasking software for the host, array processors to reduce computation time, the allocation of real-time and non-real-time events to different computer subsystems, intelligent devices to share the computational burden for real-time events, and intelligent interfaces to increase communication speeds. The case of a helicopter vibration-suppression and stabilization controller is analyzed as an example, and significant improvements in computation and throughput rates are demonstrated.
NASA Technical Reports Server (NTRS)
Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)
1983-01-01
A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
NASA Technical Reports Server (NTRS)
Das, Raja; Ponnusamy, Ravi; Saltz, Joel; Mavriplis, Dimitri
1991-01-01
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods.
High order parallel numerical schemes for solving incompressible flows
NASA Technical Reports Server (NTRS)
Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.
1992-01-01
The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver
NASA Technical Reports Server (NTRS)
Mavriplis, Dimitri J.
1998-01-01
A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chin, George; Marquez, Andres; Choudhury, Sutanay
2012-09-01
Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Towards Scalable 1024 Processor Shared Memory Systems
NASA Technical Reports Server (NTRS)
Ciotti, Robert B.; Thigpen, William W. (Technical Monitor)
2001-01-01
Over the past 3 years, NASA Ames has been involved in a cooperative effort with SGI to develop the largest single system image systems available. Currently a 1024 Origin3OOO is under development, with first boot expected later in the summer of 2001. This paper discusses some early results with a 512p Origin3OOO system and some arcane IRIX system calls that can dramatically improve scaling performance.
Efficient Approximation Algorithms for Weighted $b$-Matching
DOE Office of Scientific and Technical Information (OSTI.GOV)
Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.
2016-01-01
We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
Multi-Core Processor Memory Contention Benchmark Analysis Case Study
NASA Technical Reports Server (NTRS)
Simon, Tyler; McGalliard, James
2009-01-01
Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Near-memory data reorganization engine
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gokhale, Maya; Lloyd, G. Scott
A memory subsystem package is provided that has processing logic for data reorganization within the memory subsystem package. The processing logic is adapted to reorganize data stored within the memory subsystem package. In some embodiments, the memory subsystem package includes memory units, a memory interconnect, and a data reorganization engine ("DRE"). The data reorganization engine includes a stream interconnect and DRE units including a control processor and a load-store unit. The control processor is adapted to execute instructions to control a data reorganization. The load-store unit is adapted to process data move commands received from the control processor via themore » stream interconnect for loading data from a load memory address of a memory unit and storing data to a store memory address of a memory unit.« less
A compositional reservoir simulator on distributed memory parallel computers
DOE Office of Scientific and Technical Information (OSTI.GOV)
Rame, M.; Delshad, M.
1995-12-31
This paper presents the application of distributed memory parallel computes to field scale reservoir simulations using a parallel version of UTCHEM, The University of Texas Chemical Flooding Simulator. The model is a general purpose highly vectorized chemical compositional simulator that can simulate a wide range of displacement processes at both field and laboratory scales. The original simulator was modified to run on both distributed memory parallel machines (Intel iPSC/960 and Delta, Connection Machine 5, Kendall Square 1 and 2, and CRAY T3D) and a cluster of workstations. A domain decomposition approach has been taken towards parallelization of the code. Amore » portion of the discrete reservoir model is assigned to each processor by a set-up routine that attempts a data layout as even as possible from the load-balance standpoint. Each of these subdomains is extended so that data can be shared between adjacent processors for stencil computation. The added routines that make parallel execution possible are written in a modular fashion that makes the porting to new parallel platforms straight forward. Results of the distributed memory computing performance of Parallel simulator are presented for field scale applications such as tracer flood and polymer flood. A comparison of the wall-clock times for same problems on a vector supercomputer is also presented.« less
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction
Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...
1995-01-01
In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
A fault-tolerant information processing concept for space vehicles.
NASA Technical Reports Server (NTRS)
Hopkins, A. L., Jr.
1971-01-01
A distributed fault-tolerant information processing system is proposed, comprising a central multiprocessor, dedicated local processors, and multiplexed input-output buses connecting them together. The processors in the multiprocessor are duplicated for error detection, which is felt to be less expensive than using coded redundancy of comparable effectiveness. Error recovery is made possible by a triplicated scratchpad memory in each processor. The main multiprocessor memory uses replicated memory for error detection and correction. Local processors use any of three conventional redundancy techniques: voting, duplex pairs with backup, and duplex pairs in independent subsystems.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications
NASA Astrophysics Data System (ADS)
Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.
2015-06-01
The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G; Salapura, Valentina
2014-12-02
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Cooperative system and method using mobile robots for testing a cooperative search controller
Byrne, Raymond H.; Harrington, John J.; Eskridge, Steven E.; Hurtado, John E.
2002-01-01
A test system for testing a controller provides a way to use large numbers of miniature mobile robots to test a cooperative search controller in a test area, where each mobile robot has a sensor, a communication device, a processor, and a memory. A method of using a test system provides a way for testing a cooperative search controller using multiple robots sharing information and communicating over a communication network.
Shared performance monitor in a multiprocessor system
Chiu, George; Gara, Alan G.; Salapura, Valentina
2012-07-24
A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis
NASA Technical Reports Server (NTRS)
Lim, Chieng-Fai
1991-01-01
The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.
A Tutorial on Parallel and Concurrent Programming in Haskell
NASA Astrophysics Data System (ADS)
Peyton Jones, Simon; Singh, Satnam
This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
Teleoperated position control of a PUMA robot
NASA Technical Reports Server (NTRS)
Austin, Edmund; Fong, Chung P.
1987-01-01
A laboratory distributed computer control teleoperator system is developed to support NASA's future space telerobotic operation. This teleoperator system uses a universal force-reflecting hand controller in the local iste as the operator's input device. In the remote site, a PUMA controller recieves the Cartesian position commands and implements PID control laws to position the PUMA robot. The local site uses two microprocessors while the remote site uses three. The processors communicate with each other through shared memory. The PUMA robot controller was interfaced through custom made electronics to bypass VAL. The development status of this teleoperator system is reported. The execution time of each processor is analyzed, and the overall system throughput rate is reported. Methods to improve the efficiency and performance are discussed.
Key Technologies of Phone Storage Forensics Based on ARM Architecture
NASA Astrophysics Data System (ADS)
Zhang, Jianghan; Che, Shengbing
2018-03-01
Smart phones are mainly running Android, IOS and Windows Phone three mobile platform operating systems. The android smart phone has the best market shares and its processor chips are almost ARM software architecture. The chips memory address mapping mechanism of ARM software architecture is different with x86 software architecture. To forensics to android mart phone, we need to understand three key technologies: memory data acquisition, the conversion mechanism from virtual address to the physical address, and find the system’s key data. This article presents a viable solution which does not rely on the operating system API for a complete solution to these three issues.
Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E
DOE Office of Scientific and Technical Information (OSTI.GOV)
Jette, M.; Caliga, D.
Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less
NASA Technical Reports Server (NTRS)
Schwab, Andrew J. (Inventor); Aylor, James (Inventor); Hitchcock, Charles Young (Inventor); Wulf, William A. (Inventor); McKee, Sally A. (Inventor); Moyer, Stephen A. (Inventor); Klenke, Robert (Inventor)
2000-01-01
A data processing system is disclosed which comprises a data processor and memory control device for controlling the access of information from the memory. The memory control device includes temporary storage and decision ability for determining what order to execute the memory accesses. The compiler detects the requirements of the data processor and selects the data to stream to the memory control device which determines a memory access order. The order in which to access said information is selected based on the location of information stored in the memory. The information is repeatedly accessed from memory and stored in the temporary storage until all streamed information is accessed. The information is stored until required by the data processor. The selection of the order in which to access information maximizes bandwidth and decreases the retrieval time.
7 CFR 1435.310 - Sharing processors' allocations with producers.
Code of Federal Regulations, 2011 CFR
2011-01-01
... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.
Code of Federal Regulations, 2010 CFR
2010-01-01
... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.
Code of Federal Regulations, 2012 CFR
2012-01-01
... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.
Code of Federal Regulations, 2014 CFR
2014-01-01
... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.
Code of Federal Regulations, 2013 CFR
2013-01-01
... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
Software Coherence in Multiprocessor Memory Systems. Ph.D. Thesis
NASA Technical Reports Server (NTRS)
Bolosky, William Joseph
1993-01-01
Processors are becoming faster and multiprocessor memory interconnection systems are not keeping up. Therefore, it is necessary to have threads and the memory they access as near one another as possible. Typically, this involves putting memory or caches with the processors, which gives rise to the problem of coherence: if one processor writes an address, any other processor reading that address must see the new value. This coherence can be maintained by the hardware or with software intervention. Systems of both types have been built in the past; the hardware-based systems tended to outperform the software ones. However, the ratio of processor to interconnect speed is now so high that the extra overhead of the software systems may no longer be significant. This issue is explored both by implementing a software maintained system and by introducing and using the technique of offline optimal analysis of memory reference traces. It finds that in properly built systems, software maintained coherence can perform comparably to or even better than hardware maintained coherence. The architectural features necessary for efficient software coherence to be profitable include a small page size, a fast trap mechanism, and the ability to execute instructions while remote memory references are outstanding.
Interpreter composition issues in the formal verification of a processor-memory module
NASA Technical Reports Server (NTRS)
Fura, David A.; Cohen, Gerald C.
1994-01-01
This report describes interpreter composition techniques suitable for the formal specification and verification of a processor-memory module using the HOL theorem proving system. The processor-memory module is a multichip subsystem within a fault-tolerant embedded system under development within the Boeing Defense and Space Group. Modeling and verification methods were developed that permit provably secure composition at the transaction-level of specification, significantly reducing the complexity of the hierarchical verification of the system.
Software-Controlled Caches in the VMP Multiprocessor
1986-03-01
programming system level that Processors is tuned for the VMP design. In this vein, we are interested in exploring how far the software support can go to ...handled in software, analogously to the handling agement of the shared program state is familiar and of virtual memory page faults. Hardware support for...ensure good behavior, as opposed to how Each cache miss results in bus traffic. Table 2 pro- vides the bus cost for the "average" cache miss. Fig
Thread Migration in the Presence of Pointers
NASA Technical Reports Server (NTRS)
Cronk, David; Haines, Matthew; Mehrotra, Piyush
1996-01-01
Dynamic migration of lightweight threads supports both data locality and load balancing. However, migrating threads that contain pointers referencing data in both the stack and heap remains an open problem. In this paper we describe a technique by which threads with pointers referencing both stack and non-shared heap data can be migrated such that the pointers remain valid after migration. As a result, threads containing pointers can now be migrated between processors in a homogeneous distributed memory environment.
Shehzad, Danish; Bozkuş, Zeki
2016-01-01
Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models.
Bozkuş, Zeki
2016-01-01
Increase in complexity of neuronal network models escalated the efforts to make NEURON simulation environment efficient. The computational neuroscientists divided the equations into subnets amongst multiple processors for achieving better hardware performance. On parallel machines for neuronal networks, interprocessor spikes exchange consumes large section of overall simulation time. In NEURON for communication between processors Message Passing Interface (MPI) is used. MPI_Allgather collective is exercised for spikes exchange after each interval across distributed memory systems. The increase in number of processors though results in achieving concurrency and better performance but it inversely affects MPI_Allgather which increases communication time between processors. This necessitates improving communication methodology to decrease the spikes exchange time over distributed memory systems. This work has improved MPI_Allgather method using Remote Memory Access (RMA) by moving two-sided communication to one-sided communication, and use of recursive doubling mechanism facilitates achieving efficient communication between the processors in precise steps. This approach enhanced communication concurrency and has improved overall runtime making NEURON more efficient for simulation of large neuronal network models. PMID:27413363
Stateless and stateful implementations of faithful execution
Pierson, Lyndon G; Witzke, Edward L; Tarman, Thomas D; Robertson, Perry J; Eldridge, John M; Campbell, Philip L
2014-12-16
A faithful execution system includes system memory, a target processor, and protection engine. The system memory stores a ciphertext including value fields and integrity fields. The value fields each include an encrypted executable instruction and the integrity fields each include an encrypted integrity value for determining whether a corresponding one of the value fields has been modified. The target processor executes plaintext instructions decoded from the ciphertext while the protection engine is coupled between the system memory and the target processor. The protection engine includes logic to retrieve the ciphertext from the system memory, decrypt the value fields into the plaintext instructions, perform an integrity check based on the integrity fields to determine whether any of the corresponding value fields have been modified, and provide the plaintext instructions to the target processor for execution.
CoNNeCT Baseband Processor Module
NASA Technical Reports Server (NTRS)
Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.
2011-01-01
A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.
NASA Astrophysics Data System (ADS)
Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.
2011-12-01
Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
An efficient parallel-processing method for transposing large matrices in place.
Portnoff, M R
1999-01-01
We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.
NASA Technical Reports Server (NTRS)
Jost, Gabriele; Labarta, Jesus; Gimenez, Judit
2004-01-01
With the current trend in parallel computer architectures towards clusters of shared memory symmetric multi-processors, parallel programming techniques have evolved that support parallelism beyond a single level. When comparing the performance of applications based on different programming paradigms, it is important to differentiate between the influence of the programming model itself and other factors, such as implementation specific behavior of the operating system (OS) or architectural issues. Rewriting-a large scientific application in order to employ a new programming paradigms is usually a time consuming and error prone task. Before embarking on such an endeavor it is important to determine that there is really a gain that would not be possible with the current implementation. A detailed performance analysis is crucial to clarify these issues. The multilevel programming paradigms considered in this study are hybrid MPI/OpenMP, MLP, and nested OpenMP. The hybrid MPI/OpenMP approach is based on using MPI [7] for the coarse grained parallelization and OpenMP [9] for fine grained loop level parallelism. The MPI programming paradigm assumes a private address space for each process. Data is transferred by explicitly exchanging messages via calls to the MPI library. This model was originally designed for distributed memory architectures but is also suitable for shared memory systems. The second paradigm under consideration is MLP which was developed by Taft. The approach is similar to MPi/OpenMP, using a mix of coarse grain process level parallelization and loop level OpenMP parallelization. As it is the case with MPI, a private address space is assumed for each process. The MLP approach was developed for ccNUMA architectures and explicitly takes advantage of the availability of shared memory. A shared memory arena which is accessible by all processes is required. Communication is done by reading from and writing to the shared memory.
NASA Astrophysics Data System (ADS)
Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua
2017-12-01
Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
Optimization of image processing algorithms on mobile platforms
NASA Astrophysics Data System (ADS)
Poudel, Pramod; Shirvaikar, Mukul
2011-03-01
This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like context-aware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms.
Multiple Microcomputer Control Algorithm.
1979-09-01
discrete and semaphore supervisor calls can be used with tasks in separate processors, in which case they are maintained in shared memory. Operations on ...the source or destination operand specifier of each mode in most cases . However, four of the 16 general register addressing modes and one of the 8 pro...instruction time is based on the specified usage factors and the best cast, and worst case execution times for the instruc- 1I 5 1NAVTRAEQZJ1PCrN M’.V7~j
NASA Technical Reports Server (NTRS)
Smith, T. B., Jr.; Lala, J. H.
1983-01-01
The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.
Parallelization of a Monte Carlo particle transport simulation code
NASA Astrophysics Data System (ADS)
Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.
2010-05-01
We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sancho Pitarch, Jose Carlos; Kerbyson, Darren; Lang, Mike
Increasing the core-count on current and future processors is posing critical challenges to the memory subsystem to efficiently handle concurrent memory requests. The current trend to cope with this challenge is to increase the number of memory channels available to the processor's memory controller. In this paper we investigate the effectiveness of this approach on the performance of parallel scientific applications. Specifically, we explore the trade-off between employing multiple memory channels per memory controller and the use of multiple memory controllers. Experiments conducted on two current state-of-the-art multicore processors, a 6-core AMD Istanbul and a 4-core Intel Nehalem-EP, for amore » wide range of production applications shows that there is a diminishing return when increasing the number of memory channels per memory controller. In addition, we show that this performance degradation can be efficiently addressed by increasing the ratio of memory controllers to channels while keeping the number of memory channels constant. Significant performance improvements can be achieved in this scheme, up to 28%, in the case of using two memory controllers with each with one channel compared with one controller with two memory channels.« less
NASA Technical Reports Server (NTRS)
Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)
2001-01-01
The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
SKIRT: Hybrid parallelization of radiative transfer simulations
NASA Astrophysics Data System (ADS)
Verstocken, S.; Van De Putte, D.; Camps, P.; Baes, M.
2017-07-01
We describe the design, implementation and performance of the new hybrid parallelization scheme in our Monte Carlo radiative transfer code SKIRT, which has been used extensively for modelling the continuum radiation of dusty astrophysical systems including late-type galaxies and dusty tori. The hybrid scheme combines distributed memory parallelization, using the standard Message Passing Interface (MPI) to communicate between processes, and shared memory parallelization, providing multiple execution threads within each process to avoid duplication of data structures. The synchronization between multiple threads is accomplished through atomic operations without high-level locking (also called lock-free programming). This improves the scaling behaviour of the code and substantially simplifies the implementation of the hybrid scheme. The result is an extremely flexible solution that adjusts to the number of available nodes, processors and memory, and consequently performs well on a wide variety of computing architectures.
Parallel machine architecture for production rule systems
Allen, Jr., John D.; Butler, Philip L.
1989-01-01
A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.
Testing and operating a multiprocessor chip with processor redundancy
Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J
2014-10-21
A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
Performance prediction: A case study using a multi-ring KSR-1 machine
NASA Technical Reports Server (NTRS)
Sun, Xian-He; Zhu, Jianping
1995-01-01
While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
Low latency messages on distributed memory multiprocessors
NASA Technical Reports Server (NTRS)
Rosing, Matthew; Saltz, Joel
1993-01-01
Many of the issues in developing an efficient interface for communication on distributed memory machines are described and a portable interface is proposed. Although the hardware component of message latency is less than one microsecond on many distributed memory machines, the software latency associated with sending and receiving typed messages is on the order of 50 microseconds. The reason for this imbalance is that the software interface does not match the hardware. By changing the interface to match the hardware more closely, applications with fine grained communication can be put on these machines. Based on several tests that were run on the iPSC/860, an interface that will better match current distributed memory machines is proposed. The model used in the proposed interface consists of a computation processor and a communication processor on each node. Communication between these processors and other nodes in the system is done through a buffered network. Information that is transmitted is either data or procedures to be executed on the remote processor. The dual processor system is better suited for efficiently handling asynchronous communications compared to a single processor system. The ability to send data or procedure is very flexible for minimizing message latency, based on the type of communication being performed. The test performed and the proposed interface are described.
Multiprocessor switch with selective pairing
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
2014-03-11
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switch or a bus
Parallel Computation of the Regional Ocean Modeling System (ROMS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Wang, P; Song, Y T; Chao, Y
2005-04-05
The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less
Optical backplane interconnect switch for data processors and computers
NASA Technical Reports Server (NTRS)
Hendricks, Herbert D.; Benz, Harry F.; Hammer, Jacob M.
1989-01-01
An optoelectronic integrated device design is reported which can be used to implement an all-optical backplane interconnect switch. The switch is sized to accommodate an array of processors and memories suitable for direct replacement into the basic avionic multiprocessor backplane. The optical backplane interconnect switch is also suitable for direct replacement of the PI bus traffic switch and at the same time, suitable for supporting pipelining of the processor and memory. The 32 bidirectional switchable interconnects are configured with broadcast capability for controls, reconfiguration, and messages. The approach described here can handle a serial interconnection of data processors or a line-to-link interconnection of data processors. An optical fiber demonstration of this approach is presented.
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications
NASA Astrophysics Data System (ADS)
Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei
2007-04-01
In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.
Efficient Parallel Algorithms on Restartable Fail-Stop Processors
1991-01-01
resource (memory), and ( 3 ) that processors, memory and their interconnection must be The model of parallel computation known as the Par- perfectly...setting), arid ure an(I restart errors. We describe these arguments if] [AAtPS 871 (in a deterministic setting). Fault-tolerance Section 3 . of...grannmarity at the processor level --- for recent work on where Al is the nmber of failures during this step’s gate granilarities see [All 90, Pip 85
Multiprocessor architecture: Synthesis and evaluation
NASA Technical Reports Server (NTRS)
Standley, Hilda M.
1990-01-01
Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Hybrid Parallelism for Volume Rendering on Large-, Multi-, and Many-Core Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Howison, Mark; Bethel, E. Wes; Childs, Hank
2012-01-01
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells.more » The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Feo, J.T.
1993-10-01
This report contain papers on: Programmability and performance issues; The case of an iterative partial differential equation solver; Implementing the kernal of the Australian Region Weather Prediction Model in Sisal; Even and quarter-even prime length symmetric FFTs and their Sisal Implementations; Top-down thread generation for Sisal; Overlapping communications and computations on NUMA architechtures; Compiling technique based on dataflow analysis for funtional programming language Valid; Copy elimination for true multidimensional arrays in Sisal 2.0; Increasing parallelism for an optimization that reduces copying in IF2 graphs; Caching in on Sisal; Cache performance of Sisal Vs. FORTRAN; FFT algorithms on a shared-memory multiprocessor;more » A parallel implementation of nonnumeric search problems in Sisal; Computer vision algorithms in Sisal; Compilation of Sisal for a high-performance data driven vector processor; Sisal on distributed memory machines; A virtual shared addressing system for distributed memory Sisal; Developing a high-performance FFT algorithm in Sisal for a vector supercomputer; Implementation issues for IF2 on a static data-flow architechture; and Systematic control of parallelism in array-based data-flow computation. Selected papers have been indexed separately for inclusion in the Energy Science and Technology Database.« less
State recovery and lockstep execution restart in a system with multiprocessor pairing
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gara, Alan; Gschwind, Michael K; Salapura, Valentina
System, method and computer program product for a multiprocessing system to offer selective pairing of processor cores for increased processing reliability. A selective pairing facility is provided that selectively connects, i.e., pairs, multiple microprocessor or processor cores to provide one highly reliable thread (or thread group). Each paired microprocessor or processor cores that provide one highly reliable thread for high-reliability connect with a system components such as a memory "nest" (or memory hierarchy), an optional system controller, and optional interrupt controller, optional I/O or peripheral devices, etc. The memory nest is attached to a selective pairing facility via a switchmore » or a bus. Each selectively paired processor core is includes a transactional execution facility, whereing the system is configured to enable processor rollback to a previous state and reinitialize lockstep execution in order to recover from an incorrect execution when an incorrect execution has been detected by the selective pairing facility.« less
NASA Astrophysics Data System (ADS)
Akil, Mohamed
2017-05-01
The real-time processing is getting more and more important in many image processing applications. Image segmentation is one of the most fundamental tasks image analysis. As a consequence, many different approaches for image segmentation have been proposed. The watershed transform is a well-known image segmentation tool. The watershed transform is a very data intensive task. To achieve acceleration and obtain real-time processing of watershed algorithms, parallel architectures and programming models for multicore computing have been developed. This paper focuses on the survey of the approaches for parallel implementation of sequential watershed algorithms on multicore general purpose CPUs: homogeneous multicore processor with shared memory. To achieve an efficient parallel implementation, it's necessary to explore different strategies (parallelization/distribution/distributed scheduling) combined with different acceleration and optimization techniques to enhance parallelism. In this paper, we give a comparison of various parallelization of sequential watershed algorithms on shared memory multicore architecture. We analyze the performance measurements of each parallel implementation and the impact of the different sources of overhead on the performance of the parallel implementations. In this comparison study, we also discuss the advantages and disadvantages of the parallel programming models. Thus, we compare the OpenMP (an application programming interface for multi-Processing) with Ptheads (POSIX Threads) to illustrate the impact of each parallel programming model on the performance of the parallel implementations.
Method and apparatus for managing access to a memory
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for managing access to a memory of a computing system. A controller transforms a plurality of operations that represent a computing job into an operational memory layout that reduces a size of a selected portion of the memory that needs to be accessed to perform the computing job. The controller stores the operational memory layout in a plurality of memory cells within the selected portion of the memory. The controller controls a sequence by which a processor in the computing system accesses the memory to perform the computing job using the operational memory layout. The operationalmore » memory layout reduces an amount of energy consumed by the processor to perform the computing job.« less
SALT: The Simulator for the Analysis of LWP Timing
NASA Technical Reports Server (NTRS)
Springer, Paul L.; Rodrigues, Arun; Brockman, Jay
2006-01-01
With the emergence of new processor architectures that are highly multithreaded, and support features such as full/empty memory semantics and split-phase memory transactions, the need for a processor simulator to handle these features becomes apparent. This paper describes such a simulator, called SALT.
Communications systems and methods for subsea processors
Gutierrez, Jose; Pereira, Luis
2016-04-26
A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.
NASA Technical Reports Server (NTRS)
Clement, Bradley J.; Estlin, Tara A.; Bornstein, Benjamin J.
2013-01-01
The Mobile Thread Task Manager (MTTM) is being applied to parallelizing existing flight software to understand the benefits and to develop new techniques and architectural concepts for adapting software to multicore architectures. It allocates and load-balances tasks for a group of threads that migrate across processors to improve cache performance. In order to balance-load across threads, the MTTM augments a basic map-reduce strategy to draw jobs from a global queue. In a multicore processor, memory may be "homed" to the cache of a specific processor and must be accessed from that processor. The MTTB architecture wraps access to data with thread management to move threads to the home processor for that data so that the computation follows the data in an attempt to avoid L2 cache misses. Cache homing is also handled by a memory manager that translates identifiers to processor IDs where the data will be homed (according to rules defined by the user). The user can also specify the number of threads and processors separately, which is important for tuning performance for different patterns of computation and memory access. MTTM efficiently processes tasks in parallel on a multiprocessor computer. It also provides an interface to make it easier to adapt existing software to a multiprocessor environment.
Fast 2D FWI on a multi and many-cores workstation.
NASA Astrophysics Data System (ADS)
Thierry, Philippe; Donno, Daniela; Noble, Mark
2014-05-01
Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.
On-board landmark navigation and attitude reference parallel processor system
NASA Technical Reports Server (NTRS)
Gilbert, L. E.; Mahajan, D. T.
1978-01-01
An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.
Description and Simulation of a Fast Packet Switch Architecture for Communication Satellites
NASA Technical Reports Server (NTRS)
Quintana, Jorge A.; Lizanich, Paul J.
1995-01-01
The NASA Lewis Research Center has been developing the architecture for a multichannel communications signal processing satellite (MCSPS) as part of a flexible, low-cost meshed-VSAT (very small aperture terminal) network. The MCSPS architecture is based on a multifrequency, time-division-multiple-access (MF-TDMA) uplink and a time-division multiplex (TDM) downlink. There are eight uplink MF-TDMA beams, and eight downlink TDM beams, with eight downlink dwells per beam. The information-switching processor, which decodes, stores, and transmits each packet of user data to the appropriate downlink dwell onboard the satellite, has been fully described by using VHSIC (Very High Speed Integrated-Circuit) Hardware Description Language (VHDL). This VHDL code, which was developed in-house to simulate the information switching processor, showed that the architecture is both feasible and viable. This paper describes a shared-memory-per-beam architecture, its VHDL implementation, and the simulation efforts.
Automation of Data Traffic Control on DSM Architecture
NASA Technical Reports Server (NTRS)
Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry
2001-01-01
The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
Parallel programming with Easy Java Simulations
NASA Astrophysics Data System (ADS)
Esquembre, F.; Christian, W.; Belloni, M.
2018-01-01
Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.
High speed optical object recognition processor with massive holographic memory
NASA Technical Reports Server (NTRS)
Chao, T.; Zhou, H.; Reyes, G.
2002-01-01
Real-time object recognition using a compact grayscale optical correlator will be introduced. A holographic memory module for storing a large bank of optimum correlation filters, to accommodate the large data throughput rate needed for many real-world applications, has also been developed. System architecture of the optical processor and the holographic memory will be presented. Application examples of this object recognition technology will also be demonstrated.
Pedretti, Kevin
2008-11-18
A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Gaeke, Brian R.; Husbands, Parry; Li, Xiaoye S.; Oliker, Leonid; Yelick, Katherine A.; Biegel, Bryan (Technical Monitor)
2002-01-01
The increasing gap between processor and memory performance has lead to new architectural models for memory-intensive applications. In this paper, we explore the performance of a set of memory-intensive benchmarks and use them to compare the performance of conventional cache-based microprocessors to a mixed logic and DRAM processor called VIRAM. The benchmarks are based on problem statements, rather than specific implementations, and in each case we explore the fundamental hardware requirements of the problem, as well as alternative algorithms and data structures that can help expose fine-grained parallelism or simplify memory access patterns. The benchmarks are characterized by their memory access patterns, their basic control structures, and the ratio of computation to memory operation.
X-LUNA: Extending Free/Open Source Real Time Executive for On-Board Space Applications
NASA Astrophysics Data System (ADS)
Braga, P.; Henriques, L.; Zulianello, M.
2008-08-01
In this paper we present xLuna, a system based on the RTEMS [1] Real-Time Operating System that is able to run on demand a GNU/Linux Operating System [2] as RTEMS' lowest priority task. Linux runs in user-mode and in a different memory partition. This allows running Hard Real-Time tasks and Linux applications on the same system sharing the Hardware resources while keeping a safe isolation and the Real-Time characteristics of RTEMS. Communication between both Systems is possible through a loose coupled mechanism based on message queues. Currently only SPARC LEON2 processor with Memory Management Unit (MMU) is supported. The advantage in having two isolated systems is that non critical components are quickly developed or simply ported reducing time-to-market and budget.
FPGA-Based, Self-Checking, Fault-Tolerant Computers
NASA Technical Reports Server (NTRS)
Some, Raphael; Rennels, David
2004-01-01
A proposed computer architecture would exploit the capabilities of commercially available field-programmable gate arrays (FPGAs) to enable computers to detect and recover from bit errors. The main purpose of the proposed architecture is to enable fault-tolerant computing in the presence of single-event upsets (SEUs). [An SEU is a spurious bit flip (also called a soft error) caused by a single impact of ionizing radiation.] The architecture would also enable recovery from some soft errors caused by electrical transients and, to some extent, from intermittent and permanent (hard) errors caused by aging of electronic components. A typical FPGA of the current generation contains one or more complete processor cores, memories, and highspeed serial input/output (I/O) channels, making it possible to shrink a board-level processor node to a single integrated-circuit chip. Custom, highly efficient microcontrollers, general-purpose computers, custom I/O processors, and signal processors can be rapidly and efficiently implemented by use of FPGAs. Unfortunately, FPGAs are susceptible to SEUs. Prior efforts to mitigate the effects of SEUs have yielded solutions that degrade performance of the system and require support from external hardware and software. In comparison with other fault-tolerant- computing architectures (e.g., triple modular redundancy), the proposed architecture could be implemented with less circuitry and lower power demand. Moreover, the fault-tolerant computing functions would require only minimal support from circuitry outside the central processing units (CPUs) of computers, would not require any software support, and would be largely transparent to software and to other computer hardware. There would be two types of modules: a self-checking processor module and a memory system (see figure). The self-checking processor module would be implemented on a single FPGA and would be capable of detecting its own internal errors. It would contain two CPUs executing identical programs in lock step, with comparison of their outputs to detect errors. It would also contain various cache local memory circuits, communication circuits, and configurable special-purpose processors that would use self-checking checkers. (The basic principle of the self-checking checker method is to utilize logic circuitry that generates error signals whenever there is an error in either the checker or the circuit being checked.) The memory system would comprise a main memory and a hardware-controlled check-pointing system (CPS) based on a buffer memory denoted the recovery cache. The main memory would contain random-access memory (RAM) chips and FPGAs that would, in addition to everything else, implement double-error-detecting and single-error-correcting memory functions to enable recovery from single-bit errors.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nelson, Andrew F.; Wetzstein, M.; Naab, T.
2009-10-01
We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the codemore » necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE. Finally, we find that although parallel performance on small problems may reach a plateau beyond which more processors bring no additional speedup, performance never decreases, a factor important for running large simulations on many processors with individual time steps, where only a small fraction of the total particles require updates at any given moment.« less
Multiprocessor architectural study
NASA Technical Reports Server (NTRS)
Kosmala, A. L.; Stanten, S. F.; Vandever, W. H.
1972-01-01
An architectural design study was made of a multiprocessor computing system intended to meet functional and performance specifications appropriate to a manned space station application. Intermetrics, previous experience, and accumulated knowledge of the multiprocessor field is used to generate a baseline philosophy for the design of a future SUMC* multiprocessor. Interrupts are defined and the crucial questions of interrupt structure, such as processor selection and response time, are discussed. Memory hierarchy and performance is discussed extensively with particular attention to the design approach which utilizes a cache memory associated with each processor. The ability of an individual processor to approach its theoretical maximum performance is then analyzed in terms of a hit ratio. Memory management is envisioned as a virtual memory system implemented either through segmentation or paging. Addressing is discussed in terms of various register design adopted by current computers and those of advanced design.
NASA Technical Reports Server (NTRS)
Saini, Subash; Bailey, David; Chancellor, Marisa K. (Technical Monitor)
1997-01-01
High Performance Fortran (HPF), the high-level language for parallel Fortran programming, is based on Fortran 90. HALF was defined by an informal standards committee known as the High Performance Fortran Forum (HPFF) in 1993, and modeled on TMC's CM Fortran language. Several HPF features have since been incorporated into the draft ANSI/ISO Fortran 95, the next formal revision of the Fortran standard. HPF allows users to write a single parallel program that can execute on a serial machine, a shared-memory parallel machine, or a distributed-memory parallel machine. HPF eliminates the complex, error-prone task of explicitly specifying how, where, and when to pass messages between processors on distributed-memory machines, or when to synchronize processors on shared-memory machines. HPF is designed in a way that allows the programmer to code an application at a high level, and then selectively optimize portions of the code by dropping into message-passing or calling tuned library routines as 'extrinsics'. Compilers supporting High Performance Fortran features first appeared in late 1994 and early 1995 from Applied Parallel Research (APR) Digital Equipment Corporation, and The Portland Group (PGI). IBM introduced an HPF compiler for the IBM RS/6000 SP/2 in April of 1996. Over the past two years, these implementations have shown steady improvement in terms of both features and performance. The performance of various hardware/ programming model (HPF and MPI (message passing interface)) combinations will be compared, based on latest NAS (NASA Advanced Supercomputing) Parallel Benchmark (NPB) results, thus providing a cross-machine and cross-model comparison. Specifically, HPF based NPB results will be compared with MPI based NPB results to provide perspective on performance currently obtainable using HPF versus MPI or versus hand-tuned implementations such as those supplied by the hardware vendors. In addition we would also present NPB (Version 1.0) performance results for the following systems: DEC Alpha Server 8400 5/440, Fujitsu VPP Series (VX, VPP300, and VPP700), HP/Convex Exemplar SPP2000, IBM RS/6000 SP P2SC node (120 MHz) NEC SX-4/32, SGI/CRAY T3E, SGI Origin2000.
Toshiba TDF-500 High Resolution Viewing And Analysis System
NASA Astrophysics Data System (ADS)
Roberts, Barry; Kakegawa, M.; Nishikawa, M.; Oikawa, D.
1988-06-01
A high resolution, operator interactive, medical viewing and analysis system has been developed by Toshiba and Bio-Imaging Research. This system provides many advanced features including high resolution displays, a very large image memory and advanced image processing capability. In particular, the system provides CRT frame buffers capable of update in one frame period, an array processor capable of image processing at operator interactive speeds, and a memory system capable of updating multiple frame buffers at frame rates whilst supporting multiple array processors. The display system provides 1024 x 1536 display resolution at 40Hz frame and 80Hz field rates. In particular, the ability to provide whole or partial update of the screen at the scanning rate is a key feature. This allows multiple viewports or windows in the display buffer with both fixed and cine capability. To support image processing features such as windowing, pan, zoom, minification, filtering, ROI analysis, multiplanar and 3D reconstruction, a high performance CPU is integrated into the system. This CPU is an array processor capable of up to 400 million instructions per second. To support the multiple viewer and array processors' instantaneous high memory bandwidth requirement, an ultra fast memory system is used. This memory system has a bandwidth capability of 400MB/sec and a total capacity of 256MB. This bandwidth is more than adequate to support several high resolution CRT's and also the fast processing unit. This fully integrated approach allows effective real time image processing. The integrated design of viewing system, memory system and array processor are key to the imaging system. It is the intention to describe the architecture of the image system in this paper.
Dynamic overset grid communication on distributed memory parallel processors
NASA Technical Reports Server (NTRS)
Barszcz, Eric; Weeratunga, Sisira K.; Meakin, Robert L.
1993-01-01
A parallel distributed memory implementation of intergrid communication for dynamic overset grids is presented. Included are discussions of various options considered during development. Results are presented comparing an Intel iPSC/860 to a single processor Cray Y-MP. Results for grids in relative motion show the iPSC/860 implementation to be faster than the Cray implementation.
A Parallel Cartesian Approach for External Aerodynamics of Vehicles with Complex Geometry
NASA Technical Reports Server (NTRS)
Aftosmis, M. J.; Berger, M. J.; Adomavicius, G.
2001-01-01
This workshop paper presents the current status in the development of a new approach for the solution of the Euler equations on Cartesian meshes with embedded boundaries in three dimensions on distributed and shared memory architectures. The approach uses adaptively refined Cartesian hexahedra to fill the computational domain. Where these cells intersect the geometry, they are cut by the boundary into arbitrarily shaped polyhedra which receive special treatment by the solver. The presentation documents a newly developed multilevel upwind solver based on a flexible domain-decomposition strategy. One novel aspect of the work is its use of space-filling curves (SFC) for memory efficient on-the-fly parallelization, dynamic re-partitioning and automatic coarse mesh generation. Within each subdomain the approach employs a variety reordering techniques so that relevant data are on the same page in memory permitting high-performance on cache-based processors. Details of the on-the-fly SFC based partitioning are presented as are construction rules for the automatic coarse mesh generation. After describing the approach, the paper uses model problems and 3- D configurations to both verify and validate the solver. The model problems demonstrate that second-order accuracy is maintained despite the presence of the irregular cut-cells in the mesh. In addition, it examines both parallel efficiency and convergence behavior. These investigations demonstrate a parallel speed-up in excess of 28 on 32 processors of an SGI Origin 2000 system and confirm that mesh partitioning has no effect on convergence behavior.
Processor register error correction management
Bose, Pradip; Cher, Chen-Yong; Gupta, Meeta S.
2016-12-27
Processor register protection management is disclosed. In embodiments, a method of processor register protection management can include determining a sensitive logical register for executable code generated by a compiler, generating an error-correction table identifying the sensitive logical register, and storing the error-correction table in a memory accessible by a processor. The processor can be configured to generate a duplicate register of the sensitive logical register identified by the error-correction table.
PCI-based WILDFIRE reconfigurable computing engines
NASA Astrophysics Data System (ADS)
Fross, Bradley K.; Donaldson, Robert L.; Palmer, Douglas J.
1996-10-01
WILDFORCE is the first PCI-based custom reconfigurable computer that is based on the Splash 2 technology transferred from the National Security Agency and the Institute for Defense Analyses, Supercomputing Research Center (SRC). The WILDFORCE architecture has many of the features of the WILDFIRE computer, such as field- programmable gate array (FPGA) based processing elements, linear array and crossbar interconnection, and high- performance memory and I/O subsystems. New features introduced in the PCI-based WILDFIRE systems include memory/processor options that can be added to any processing element. These options include static and dynamic memory, digital signal processors (DSPs), FPGAs, and microprocessors. In addition to memory/processor options, many different application specific connectors can be used to extend the I/O capabilities of the system, including systolic I/O, camera input and video display output. This paper also discusses how this new PCI-based reconfigurable computing engine is used for rapid-prototyping, real-time video processing and other DSP applications.
Using Coarrays to Parallelize Legacy Fortran Applications: Strategy and Case Study
Radhakrishnan, Hari; Rouson, Damian W. I.; Morris, Karla; ...
2015-01-01
This paper summarizes a strategy for parallelizing a legacy Fortran 77 program using the object-oriented (OO) and coarray features that entered Fortran in the 2003 and 2008 standards, respectively. OO programming (OOP) facilitates the construction of an extensible suite of model-verification and performance tests that drive the development. Coarray parallel programming facilitates a rapid evolution from a serial application to a parallel application capable of running on multicore processors and many-core accelerators in shared and distributed memory. We delineate 17 code modernization steps used to refactor and parallelize the program and study the resulting performance. Our initial studies were donemore » using the Intel Fortran compiler on a 32-core shared memory server. Scaling behavior was very poor, and profile analysis using TAU showed that the bottleneck in the performance was due to our implementation of a collective, sequential summation procedure. We were able to improve the scalability and achieve nearly linear speedup by replacing the sequential summation with a parallel, binary tree algorithm. We also tested the Cray compiler, which provides its own collective summation procedure. Intel provides no collective reductions. With Cray, the program shows linear speedup even in distributed-memory execution. We anticipate similar results with other compilers once they support the new collective procedures proposed for Fortran 2015.« less
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.
Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D
2002-08-07
Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
Xyce parallel electronic simulator users guide, version 6.1
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas; Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers; A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models; Device models that are specifically tailored to meet Sandia's needs, including some radiationaware devices (for Sandia users only); and Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase-a message passing parallel implementation-which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users' guide, Version 6.0.1.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Xyce parallel electronic simulator users guide, version 6.0.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R; Mei, Ting; Russo, Thomas V.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandias needs, including some radiationaware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase a message passing parallel implementation which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
Support for Diagnosis of Custom Computer Hardware
NASA Technical Reports Server (NTRS)
Molock, Dwaine S.
2008-01-01
The Coldfire SDN Diagnostics software is a flexible means of exercising, testing, and debugging custom computer hardware. The software is a set of routines that, collectively, serve as a common software interface through which one can gain access to various parts of the hardware under test and/or cause the hardware to perform various functions. The routines can be used to construct tests to exercise, and verify the operation of, various processors and hardware interfaces. More specifically, the software can be used to gain access to memory, to execute timer delays, to configure interrupts, and configure processor cache, floating-point, and direct-memory-access units. The software is designed to be used on diverse NASA projects, and can be customized for use with different processors and interfaces. The routines are supported, regardless of the architecture of a processor that one seeks to diagnose. The present version of the software is configured for Coldfire processors on the Subsystem Data Node processor boards of the Solar Dynamics Observatory. There is also support for the software with respect to Mongoose V, RAD750, and PPC405 processors or their equivalents.
Integrating Cache Performance Modeling and Tuning Support in Parallelization Tools
NASA Technical Reports Server (NTRS)
Waheed, Abdul; Yan, Jerry; Saini, Subhash (Technical Monitor)
1998-01-01
With the resurgence of distributed shared memory (DSM) systems based on cache-coherent Non Uniform Memory Access (ccNUMA) architectures and increasing disparity between memory and processors speeds, data locality overheads are becoming the greatest bottlenecks in the way of realizing potential high performance of these systems. While parallelization tools and compilers facilitate the users in porting their sequential applications to a DSM system, a lot of time and effort is needed to tune the memory performance of these applications to achieve reasonable speedup. In this paper, we show that integrating cache performance modeling and tuning support within a parallelization environment can alleviate this problem. The Cache Performance Modeling and Prediction Tool (CPMP), employs trace-driven simulation techniques without the overhead of generating and managing detailed address traces. CPMP predicts the cache performance impact of source code level "what-if" modifications in a program to assist a user in the tuning process. CPMP is built on top of a customized version of the Computer Aided Parallelization Tools (CAPTools) environment. Finally, we demonstrate how CPMP can be applied to tune a real Computational Fluid Dynamics (CFD) application.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Donofrio, David
A method and apparatus for performing stencil computations efficiently are disclosed. In one embodiment, a processor receives an offset, and in response, retrieves a value from a memory via a single instruction, where the retrieving comprises: identifying, based on the offset, one of a plurality of registers of the processor; loading an address stored in the identified register; and retrieving from the memory the value at the address.
Dynamically programmable cache
NASA Astrophysics Data System (ADS)
Nakkar, Mouna; Harding, John A.; Schwartz, David A.; Franzon, Paul D.; Conte, Thomas
1998-10-01
Reconfigurable machines have recently been used as co- processors to accelerate the execution of certain algorithms or program subroutines. The problems with the above approach include high reconfiguration time and limited partial reconfiguration. By far the most critical problems are: (1) the small on-chip memory which results in slower execution time, and (2) small FPGA areas that cannot implement large subroutines. Dynamically Programmable Cache (DPC) is a novel architecture for embedded processors which offers solutions to the above problems. To solve memory access problems, DPC processors merge reconfigurable arrays with the data cache at various cache levels to create a multi-level reconfigurable machines. As a result DPC machines have both higher data accessibility and FPGA memory bandwidth. To solve the limited FPGA resource problem, DPC processors implemented multi-context switching (Virtualization) concept. Virtualization allows implementation of large subroutines with fewer FPGA cells. Additionally, DPC processors can parallelize the execution of several operations resulting in faster execution time. In this paper, the speedup improvement for DPC machines are shown to be 5X faster than an Altera FLEX10K FPGA chip and 2X faster than a Sun Ultral SPARC station for two different algorithms (convolution and motion estimation).
Systems and methods for performing wireless financial transactions
DOE Office of Scientific and Technical Information (OSTI.GOV)
McCown, Steven Harvey
2012-07-03
A secure computing module (SCM) is configured for connection with a host device. The SCM includes a processor for performing secure processing operations, a host interface for coupling the processor to the host device, and a memory connected to the processor wherein the processor logically isolates at least some of the memory from access by the host device. The SCM also includes a proximate-field wireless communicator connected to the processor to communicate with another SCM associated with another host device. The SCM generates a secure digital signature for a financial transaction package and communicates the package and the signature tomore » the other SCM using the proximate-field wireless communicator. Financial transactions are performed from person to person using the secure digital signature of each person's SCM and possibly message encryption. The digital signatures and transaction details are communicated to appropriate financial organizations to authenticate the transaction parties and complete the transaction.« less
Geospace simulations on the Cell BE processor
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D.
2008-12-01
OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
NASA Astrophysics Data System (ADS)
Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.
2014-03-01
The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
Hybrid quantum processors: molecular ensembles as quantum memory for solid state circuits.
Rabl, P; DeMille, D; Doyle, J M; Lukin, M D; Schoelkopf, R J; Zoller, P
2006-07-21
We investigate a hybrid quantum circuit where ensembles of cold polar molecules serve as long-lived quantum memories and optical interfaces for solid state quantum processors. The quantum memory realized by collective spin states (ensemble qubit) is coupled to a high-Q stripline cavity via microwave Raman processes. We show that, for convenient trap-surface distances of a few microm, strong coupling between the cavity and ensemble qubit can be achieved. We discuss basic quantum information protocols, including a swap from the cavity photon bus to the molecular quantum memory, and a deterministic two qubit gate. Finally, we investigate coherence properties of molecular ensemble quantum bits.
Load Balancing Strategies for Multiphase Flows on Structured Grids
NASA Astrophysics Data System (ADS)
Olshefski, Kristopher; Owkes, Mark
2017-11-01
The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.
Concurrent computation of attribute filters on shared memory parallel machines.
Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold
2008-10-01
Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
Reconfigurable pipelined processor
DOE Office of Scientific and Technical Information (OSTI.GOV)
Saccardi, R.J.
1989-09-19
This patent describes a reconfigurable pipelined processor for processing data. It comprises: a plurality of memory devices for storing bits of data; a plurality of arithmetic units for performing arithmetic functions with the data; cross bar means for connecting the memory devices with the arithmetic units for transferring data therebetween; at least one counter connected with the cross bar means for providing a source of addresses to the memory devices; at least one variable tick delay device connected with each of the memory devices and arithmetic units; and means for providing control bits to the variable tick delay device formore » variably controlling the input and output operations thereof to selectively delay the memory devices and arithmetic units to align the data for processing in a selected sequence.« less
Multiprogramming performance degradation - Case study on a shared memory multiprocessor
NASA Technical Reports Server (NTRS)
Dimpsey, R. T.; Iyer, R. K.
1989-01-01
The performance degradation due to multiprogramming overhead is quantified for a parallel-processing machine. Measurements of real workloads were taken, and it was found that there is a moderate correlation between the completion time of a program and the amount of system overhead measured during program execution. Experiments in controlled environments were then conducted to calculate a lower bound on the performance degradation of parallel jobs caused by multiprogramming overhead. The results show that the multiprogramming overhead of parallel jobs consumes at least 4 percent of the processor time. When two or more serial jobs are introduced into the system, this amount increases to 5.3 percent
A mechanism for efficient debugging of parallel programs
DOE Office of Scientific and Technical Information (OSTI.GOV)
Miller, B.P.; Choi, J.D.
1988-01-01
This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). The authors describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. The authors introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. The extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions ofmore » the co-operating processes.« less
Kalman filter tracking on parallel architectures
NASA Astrophysics Data System (ADS)
Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.
2017-10-01
We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
The Spectral Element Method for Geophysical Flows
NASA Astrophysics Data System (ADS)
Taylor, Mark
1998-11-01
We will describe SEAM, a Spectral Element Atmospheric Model. SEAM solves the 3D primitive equations used in climate modeling and medium range forecasting. SEAM uses a spectral element discretization for the surface of the globe and finite differences in the vertical direction. The model is spectrally accurate, as demonstrated by a variety of test cases. It is well suited for modern distributed-shared memory computers, sustaining over 24 GFLOPS on a 240 processor HP Exemplar. This performance has allowed us to run several interesting simulations in full spherical geometry at high resolution (over 22 million grid points).
NASA Astrophysics Data System (ADS)
Erez, Mattan; Dally, William J.
Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.
Assessing Server Fault Tolerance and Disaster Recovery Implementation in Thin Client Architectures
2007-09-01
server • Windows 2003 server Processor AMD Geode GX Memory 512MB Flash/256MB DDR RAM I/O/Peripheral Support • VGA-type video output (DB-15...2000 Advanced Server Processor AMD Geode NX 1500 Memory • 256MB or 512MB or 1GB DDR SDRAM • 1GB or 512MB Flash I/O/Peripheral Support • SiS741 GX
2004-07-01
steadily for the past fifteen years, while memory latency and bandwidth have improved much more slowly. For example, Intel processor clock rates38 have... processor and memory performance) all greatly restrict the ability to achieve high levels of performance for science, engineering, and national...sub-nuclear distances. Guide experiments to identify transition from quantum chromodynamics to quark -gluon plasma. Accelerator Physics Accurate
A floating-point/multiple-precision processor for airborne applications
NASA Technical Reports Server (NTRS)
Yee, R.
1982-01-01
A compact input output (I/O) numerical processor capable of performing floating-point, multiple precision and other arithmetic functions at execution times which are at least 100 times faster than comparable software emulation is described. The I/O device is a microcomputer system containing a 16 bit microprocessor, a numerical coprocessor with eight 80 bit registers running at a 5 MHz clock rate, 18K random access memory (RAM) and 16K electrically programmable read only memory (EPROM). The processor acts as an intelligent slave to the host computer and can be programmed in high order languages such as FORTRAN and PL/M-86.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sofronov, I.D.; Voronin, B.L.; Butnev, O.I.
1997-12-31
The aim of the work performed is to develop a 3D parallel program for numerical calculation of gas dynamics problem with heat conductivity on distributed memory computational systems (CS), satisfying the condition of numerical result independence from the number of processors involved. Two basically different approaches to the structure of massive parallel computations have been developed. The first approach uses the 3D data matrix decomposition reconstructed at temporal cycle and is a development of parallelization algorithms for multiprocessor CS with shareable memory. The second approach is based on using a 3D data matrix decomposition not reconstructed during a temporal cycle.more » The program was developed on 8-processor CS MP-3 made in VNIIEF and was adapted to a massive parallel CS Meiko-2 in LLNL by joint efforts of VNIIEF and LLNL staffs. A large number of numerical experiments has been carried out with different number of processors up to 256 and the efficiency of parallelization has been evaluated in dependence on processor number and their parameters.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Sitaraman, Hariswaran; Grout, Ray W
This work investigates novel algorithm designs and optimization techniques for restructuring chemistry integrators in zero and multidimensional combustion solvers, which can then be effectively used on the emerging generation of Intel's Many Integrated Core/Xeon Phi processors. These processors offer increased computing performance via large number of lightweight cores at relatively lower clock speeds compared to traditional processors (e.g. Intel Sandybridge/Ivybridge) used in current supercomputers. This style of processor can be productively used for chemistry integrators that form a costly part of computational combustion codes, in spite of their relatively lower clock speeds. Performance commensurate with traditional processors is achieved heremore » through the combination of careful memory layout, exposing multiple levels of fine grain parallelism and through extensive use of vendor supported libraries (Cilk Plus and Math Kernel Libraries). Important optimization techniques for efficient memory usage and vectorization have been identified and quantified. These optimizations resulted in a factor of ~ 3 speed-up using Intel 2013 compiler and ~ 1.5 using Intel 2017 compiler for large chemical mechanisms compared to the unoptimized version on the Intel Xeon Phi. The strategies, especially with respect to memory usage and vectorization, should also be beneficial for general purpose computational fluid dynamics codes.« less
A Framework for Parallel Unstructured Grid Generation for Complex Aerodynamic Simulations
NASA Technical Reports Server (NTRS)
Zagaris, George; Pirzadeh, Shahyar Z.; Chrisochoides, Nikos
2009-01-01
A framework for parallel unstructured grid generation targeting both shared memory multi-processors and distributed memory architectures is presented. The two fundamental building-blocks of the framework consist of: (1) the Advancing-Partition (AP) method used for domain decomposition and (2) the Advancing Front (AF) method used for mesh generation. Starting from the surface mesh of the computational domain, the AP method is applied recursively to generate a set of sub-domains. Next, the sub-domains are meshed in parallel using the AF method. The recursive nature of domain decomposition naturally maps to a divide-and-conquer algorithm which exhibits inherent parallelism. For the parallel implementation, the Master/Worker pattern is employed to dynamically balance the varying workloads of each task on the set of available CPUs. Performance results by this approach are presented and discussed in detail as well as future work and improvements.
LANDSAT-D flight segment operations manual. Appendix B: OBC software operations
NASA Technical Reports Server (NTRS)
Talipsky, R.
1981-01-01
The LANDSAT 4 satellite contains two NASA standard spacecraft computers and 65,536 words of memory. Onboard computer software is divided into flight executive and applications processors. Both applications processors and the flight executive use one or more of 67 system tables to obtain variables, constants, and software flags. Output from the software for monitoring operation is via 49 OBC telemetry reports subcommutated in the spacecraft telemetry. Information is provided about the flight software as it is used to control the various spacecraft operations and interpret operational OBC telemetry. Processor function descriptions, processor operation, software constraints, processor system tables, processor telemetry, and processor flow charts are presented.
High-performance ultra-low power VLSI analog processor for data compression
NASA Technical Reports Server (NTRS)
Tawel, Raoul (Inventor)
1996-01-01
An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.
Realization of a single image haze removal system based on DaVinci DM6467T processor
NASA Astrophysics Data System (ADS)
Liu, Zhuang
2014-10-01
Video monitoring system (VMS) has been extensively applied in domains of target recognition, traffic management, remote sensing, auto navigation and national defence. However the VMS has a strong dependence on the weather, for instance, in foggy weather, the quality of images received by the VMS are distinct degraded and the effective range of VMS is also decreased. All in all, the VMS performs terribly in bad weather. Thus the research of fog degraded images enhancement has very high theoretical and practical application value. A design scheme of a fog degraded images enhancement system based on the TI DaVinci processor is presented in this paper. The main function of the referred system is to extract and digital cameras capture images and execute image enhancement processing to obtain a clear image. The processor used in this system is the dual core TI DaVinci DM6467T - ARM@500MHz+DSP@1GH. A MontaVista Linux operating system is running on the ARM subsystem which handles I/O and application processing. The DSP handles signal processing and the results are available to the ARM subsystem in shared memory.The system benefits from the DaVinci processor so that, with lower power cost and smaller volume, it provides the equivalent image processing capability of a X86 computer. The outcome shows that the system in this paper can process images at 25 frames per second on D1 resolution.
Comparison of Origin 2000 and Origin 3000 Using NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Turney, Raymond D.
2001-01-01
This report describes results of benchmark tests on the Origin 3000 system currently being installed at the NASA Ames National Advanced Supercomputing facility. This machine will ultimately contain 1024 R14K processors. The first part of the system, installed in November, 2000 and named mendel, is an Origin 3000 with 128 R12K processors. For comparison purposes, the tests were also run on lomax, an Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to determine system performance and measure the impact of changes on the machine as it evolves. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versiqns used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN 3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.
1990-08-01
LCTVs) ..................... 17 2.14 JOINT FOURIER TRANSFORM PROCESSOR .................. 18 2.15 HOLOGRAPHIC ASSOCIATIVE MEMORY USING A MICRO ...RADC-TR-90-256 Final Technical Report August1990 AD-A227 163 HYBRID OPTICAL PROCESSOR Dove Electronics, Inc. J.F. Dove, F.T .S. Yu, C. Eldering...ANM SUSUE & FUNDING NUMBERS C - F19628-87-C-0086 HYBRID OPTICAL PROCESSOR PE - 61102F PR - 2305 &AUThNOA TA - J7 J.F. Dove, F.T.S. Yu, C. Eldering WU
Shared Memory Parallelism for 3D Cartesian Discrete Ordinates Solver
NASA Astrophysics Data System (ADS)
Moustafa, Salli; Dutka-Malen, Ivan; Plagne, Laurent; Ponçot, Angélique; Ramet, Pierre
2014-06-01
This paper describes the design and the performance of DOMINO, a 3D Cartesian SN solver that implements two nested levels of parallelism (multicore+SIMD) on shared memory computation nodes. DOMINO is written in C++, a multi-paradigm programming language that enables the use of powerful and generic parallel programming tools such as Intel TBB and Eigen. These two libraries allow us to combine multi-thread parallelism with vector operations in an efficient and yet portable way. As a result, DOMINO can exploit the full power of modern multi-core processors and is able to tackle very large simulations, that usually require large HPC clusters, using a single computing node. For example, DOMINO solves a 3D full core PWR eigenvalue problem involving 26 energy groups, 288 angular directions (S16), 46 × 106 spatial cells and 1 × 1012 DoFs within 11 hours on a single 32-core SMP node. This represents a sustained performance of 235 GFlops and 40:74% of the SMP node peak performance for the DOMINO sweep implementation. The very high Flops/Watt ratio of DOMINO makes it a very interesting building block for a future many-nodes nuclear simulation tool.
NASA Astrophysics Data System (ADS)
Leamy, Michael J.; Springer, Adam C.
In this research we report parallel implementation of a Cellular Automata-based simulation tool for computing elastodynamic response on complex, two-dimensional domains. Elastodynamic simulation using Cellular Automata (CA) has recently been presented as an alternative, inherently object-oriented technique for accurately and efficiently computing linear and nonlinear wave propagation in arbitrarily-shaped geometries. The local, autonomous nature of the method should lead to straight-forward and efficient parallelization. We address this notion on symmetric multiprocessor (SMP) hardware using a Java-based object-oriented CA code implementing triangular state machines (i.e., automata) and the MPI bindings written in Java (MPJ Express). We use MPJ Express to reconfigure our existing CA code to distribute a domain's automata to cores present on a dual quad-core shared-memory system (eight total processors). We note that this message passing parallelization strategy is directly applicable to computer clustered computing, which will be the focus of follow-on research. Results on the shared memory platform indicate nearly-ideal, linear speed-up. We conclude that the CA-based elastodynamic simulator is easily configured to run in parallel, and yields excellent speed-up on SMP hardware.
Mass Memory Storage Devices for AN/SLQ-32(V).
1985-06-01
tactical programs and libraries into the AN/UYK-19 computer , the RP-16 microprocessor, and other peripheral processors (e.g., ADLS and Band 1) will be...software must be loaded into computer memory from the 4-track magnetic tape cartridges (MTCs) on which the programs are stored. Program load begins...software. Future computer programs , which will reside in peripheral processors, include the Automated Decoy Launching System (ADLS) and Band 1. As
Multinode reconfigurable pipeline computer
NASA Technical Reports Server (NTRS)
Nosenchuck, Daniel M. (Inventor); Littman, Michael G. (Inventor)
1989-01-01
A multinode parallel-processing computer is made up of a plurality of innerconnected, large capacity nodes each including a reconfigurable pipeline of functional units such as Integer Arithmetic Logic Processors, Floating Point Arithmetic Processors, Special Purpose Processors, etc. The reconfigurable pipeline of each node is connected to a multiplane memory by a Memory-ALU switch NETwork (MASNET). The reconfigurable pipeline includes three (3) basic substructures formed from functional units which have been found to be sufficient to perform the bulk of all calculations. The MASNET controls the flow of signals from the memory planes to the reconfigurable pipeline and vice versa. the nodes are connectable together by an internode data router (hyperspace router) so as to form a hypercube configuration. The capability of the nodes to conditionally configure the pipeline at each tick of the clock, without requiring a pipeline flush, permits many powerful algorithms to be implemented directly.
Xyce Parallel Electronic Simulator Users' Guide Version 6.8
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik Venkatraman; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been de- signed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel com- puting platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows onemore » to develop new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase$-$ a message passing parallel implementation $-$ which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows.« less
NASA Astrophysics Data System (ADS)
Haron, Adib; Mahdzair, Fazren; Luqman, Anas; Osman, Nazmie; Junid, Syed Abdul Mutalib Al
2018-03-01
One of the most significant constraints of Von Neumann architecture is the limited bandwidth between memory and processor. The cost to move data back and forth between memory and processor is considerably higher than the computation in the processor itself. This architecture significantly impacts the Big Data and data-intensive application such as DNA analysis comparison which spend most of the processing time to move data. Recently, the in-memory processing concept was proposed, which is based on the capability to perform the logic operation on the physical memory structure using a crossbar topology and non-volatile resistive-switching memristor technology. This paper proposes a scheme to map digital equality comparator circuit on memristive memory crossbar array. The 2-bit, 4-bit, 8-bit, 16-bit, 32-bit, and 64-bit of equality comparator circuit are mapped on memristive memory crossbar array by using material implication logic in a sequential and parallel method. The simulation results show that, for the 64-bit word size, the parallel mapping exhibits 2.8× better performance in total execution time than sequential mapping but has a trade-off in terms of energy consumption and area utilization. Meanwhile, the total crossbar area can be reduced by 1.2× for sequential mapping and 1.5× for parallel mapping both by using the overlapping technique.
Partitioning problems in parallel, pipelined and distributed computing
NASA Technical Reports Server (NTRS)
Bokhari, S.
1985-01-01
The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Novel memory architecture for video signal processor
NASA Astrophysics Data System (ADS)
Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei
1993-11-01
An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” andmore » if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Here, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient.« less
Accelerating molecular dynamic simulation on the cell processor and Playstation 3.
Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S
2009-01-30
Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
Mittal, Sparsh; Vetter, Jeffrey S.
2015-04-24
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based onmore » their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.« less
A GaAs vector processor based on parallel RISC microprocessors
NASA Astrophysics Data System (ADS)
Misko, Tim A.; Rasset, Terry L.
A vector processor architecture based on the development of a 32-bit microprocessor using gallium arsenide (GaAs) technology has been developed. The McDonnell Douglas vector processor (MVP) will be fabricated completely from GaAs digital integrated circuits. The MVP architecture includes a vector memory of 1 megabyte, a parallel bus architecture with eight processing elements connected in parallel, and a control processor. The processing elements consist of a reduced instruction set CPU (RISC) with four floating-point coprocessor units and necessary memory interface functions. This architecture has been simulated for several benchmark programs including complex fast Fourier transform (FFT), complex inner product, trigonometric functions, and sort-merge routine. The results of this study indicate that the MVP can process a 1024-point complex FFT at a speed of 112 microsec (389 megaflops) while consuming approximately 618 W of power in a volume of approximately 0.1 ft-cubed.
A Survey of Techniques for Modeling and Improving Reliability of Computing Systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Vetter, Jeffrey S.
Recent trends of aggressive technology scaling have greatly exacerbated the occurrences and impact of faults in computing systems. This has made `reliability' a first-order design constraint. To address the challenges of reliability, several techniques have been proposed. In this study, we provide a survey of architectural techniques for improving resilience of computing systems. We especially focus on techniques proposed for microarchitectural components, such as processor registers, functional units, cache and main memory etc. In addition, we discuss techniques proposed for non-volatile memory, GPUs and 3D-stacked processors. To underscore the similarities and differences of the techniques, we classify them based onmore » their key characteristics. We also review the metrics proposed to quantify vulnerability of processor structures. Finally, we believe that this survey will help researchers, system-architects and processor designers in gaining insights into the techniques for improving reliability of computing systems.« less
Performance analysis and kernel size study of the Lynx real-time operating system
NASA Technical Reports Server (NTRS)
Liu, Yuan-Kwei; Gibson, James S.; Fernquist, Alan R.
1993-01-01
This paper analyzes the Lynx real-time operating system (LynxOS), which has been selected as the operating system for the Space Station Freedom Data Management System (DMS). The features of LynxOS are compared to other Unix-based operating system (OS). The tools for measuring the performance of LynxOS, which include a high-speed digital timer/counter board, a device driver program, and an application program, are analyzed. The timings for interrupt response, process creation and deletion, threads, semaphores, shared memory, and signals are measured. The memory size of the DMS Embedded Data Processor (EDP) is limited. Besides, virtual memory is not suitable for real-time applications because page swap timing may not be deterministic. Therefore, the DMS software, including LynxOS, has to fit in the main memory of an EDP. To reduce the LynxOS kernel size, the following steps are taken: analyzing the factors that influence the kernel size; identifying the modules of LynxOS that may not be needed in an EDP; adjusting the system parameters of LynxOS; reconfiguring the device drivers used in the LynxOS; and analyzing the symbol table. The reductions in kernel disk size, kernel memory size and total kernel size reduction from each step mentioned above are listed and analyzed.
Magnetic Bubble Memories for Data Collection in Sounding Rockets,
1982-01-29
generate interest in bubbles as a mass storage device for micro - processor based equipment, manufacturers have come up with a variety of diversified...absence of a bubble represents a Ŕ". With diameters on the order of I to 5 micro -meters, these bubbles are so small that extremely tiny chips can hold...methods of transfer: polled I/O, interrupt driven I/O, and direct memory access (DMA). The first two methods require tho host processor be involved
Li, Jian; Bloch, Pavel; Xu, Jing; Sarunic, Marinko V; Shannon, Lesley
2011-05-01
Fourier domain optical coherence tomography (FD-OCT) provides faster line rates, better resolution, and higher sensitivity for noninvasive, in vivo biomedical imaging compared to traditional time domain OCT (TD-OCT). However, because the signal processing for FD-OCT is computationally intensive, real-time FD-OCT applications demand powerful computing platforms to deliver acceptable performance. Graphics processing units (GPUs) have been used as coprocessors to accelerate FD-OCT by leveraging their relatively simple programming model to exploit thread-level parallelism. Unfortunately, GPUs do not "share" memory with their host processors, requiring additional data transfers between the GPU and CPU. In this paper, we implement a complete FD-OCT accelerator on a consumer grade GPU/CPU platform. Our data acquisition system uses spectrometer-based detection and a dual-arm interferometer topology with numerical dispersion compensation for retinal imaging. We demonstrate that the maximum line rate is dictated by the memory transfer time and not the processing time due to the GPU platform's memory model. Finally, we discuss how the performance trends of GPU-based accelerators compare to the expected future requirements of FD-OCT data rates.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh; Zhang, Zhao
With each CMOS technology generation, leakage energy consumption has been dramatically increasing and hence, managing leakage power consumption of large last-level caches (LLCs) has become a critical issue in modern processor design. In this paper, we present EnCache, a novel software-based technique which uses dynamic profiling-based cache reconfiguration for saving cache leakage energy. EnCache uses a simple hardware component called profiling cache, which dynamically predicts energy efficiency of an application for 32 possible cache configurations. Using these estimates, system software reconfigures the cache to the most energy efficient configuration. EnCache uses dynamic cache reconfiguration and hence, it does not requiremore » offline profiling or tuning the parameter for each application. Furthermore, EnCache optimizes directly for the overall memory subsystem (LLC and main memory) energy efficiency instead of the LLC energy efficiency alone. The experiments performed with an x86-64 simulator and workloads from SPEC2006 suite confirm that EnCache provides larger energy saving than a conventional energy saving scheme. For single core and dual-core system configurations, the average savings in memory subsystem energy over a shared baseline configuration are 30.0% and 27.3%, respectively.« less
A portable approach for PIC on emerging architectures
NASA Astrophysics Data System (ADS)
Decyk, Viktor
2016-03-01
A portable approach for designing Particle-in-Cell (PIC) algorithms on emerging exascale computers, is based on the recognition that 3 distinct programming paradigms are needed. They are: low level vector (SIMD) processing, middle level shared memory parallel programing, and high level distributed memory programming. In addition, there is a memory hierarchy associated with each level. Such algorithms can be initially developed using vectorizing compilers, OpenMP, and MPI. This is the approach recommended by Intel for the Phi processor. These algorithms can then be translated and possibly specialized to other programming models and languages, as needed. For example, the vector processing and shared memory programming might be done with CUDA instead of vectorizing compilers and OpenMP, but generally the algorithm itself is not greatly changed. The UCLA PICKSC web site at http://www.idre.ucla.edu/ contains example open source skeleton codes (mini-apps) illustrating each of these three programming models, individually and in combination. Fortran2003 now supports abstract data types, and design patterns can be used to support a variety of implementations within the same code base. Fortran2003 also supports interoperability with C so that implementations in C languages are also easy to use. Finally, main codes can be translated into dynamic environments such as Python, while still taking advantage of high performing compiled languages. Parallel languages are still evolving with interesting developments in co-Array Fortran, UPC, and OpenACC, among others, and these can also be supported within the same software architecture. Work supported by NSF and DOE Grants.
System Level RBDO for Military Ground Vehicles using High Performance Computing
2008-01-01
platform. Only the analyses that required more than 24 processors were conducted on the Onyx 350 due to the limited number of processors on the...optimization constraints varied. The queues set the number of processors and number of finite element code licenses available to the analyses. sgi ONYX ...3900: unix 24 MIPS R16000 PROCESSORS 4 IR2 GRAPHICS PIPES 4 IR3 GRAPHICS PIPES 24 GBYTES MEMORY 36 GBYTES LOCAL DISK SPACE sgi ONYX 350: unix 32 MIPS
Cooperative Data Sharing: Simple Support for Clusters of SMP Nodes
NASA Technical Reports Server (NTRS)
DiNucci, David C.; Balley, David H. (Technical Monitor)
1997-01-01
Libraries like PVM and MPI send typed messages to allow for heterogeneous cluster computing. Lower-level libraries, such as GAM, provide more efficient access to communication by removing the need to copy messages between the interface and user space in some cases. still lower-level interfaces, such as UNET, get right down to the hardware level to provide maximum performance. However, these are all still interfaces for passing messages from one process to another, and have limited utility in a shared-memory environment, due primarily to the fact that message passing is just another term for copying. This drawback is made more pertinent by today's hybrid architectures (e.g. clusters of SMPs), where it is difficult to know beforehand whether two communicating processes will share memory. As a result, even portable language tools (like HPF compilers) must either map all interprocess communication, into message passing with the accompanying performance degradation in shared memory environments, or they must check each communication at run-time and implement the shared-memory case separately for efficiency. Cooperative Data Sharing (CDS) is a single user-level API which abstracts all communication between processes into the sharing and access coordination of memory regions, in a model which might be described as "distributed shared messages" or "large-grain distributed shared memory". As a result, the user programs to a simple latency-tolerant abstract communication specification which can be mapped efficiently to either a shared-memory or message-passing based run-time system, depending upon the available architecture. Unlike some distributed shared memory interfaces, the user still has complete control over the assignment of data to processors, the forwarding of data to its next likely destination, and the queuing of data until it is needed, so even the relatively high latency present in clusters can be accomodated. CDS does not require special use of an MMU, which can add overhead to some DSM systems, and does not require an SPMD programming model. unlike some message-passing interfaces, CDS allows the user to implement efficient demand-driven applications where processes must "fight" over data, and does not perform copying if processes share memory and do not attempt concurrent writes. CDS also supports heterogeneous computing, dynamic process creation, handlers, and a very simple thread-arbitration mechanism. Additional support for array subsections is currently being considered. The CDS1 API, which forms the kernel of CDS, is built primarily upon only 2 communication primitives, one process initiation primitive, and some data translation (and marshalling) routines, memory allocation routines, and priority control routines. The entire current collection of 28 routines provides enough functionality to implement most (or all) of MPI 1 and 2, which has a much larger interface consisting of hundreds of routines. still, the API is small enough to consider integrating into standard os interfaces for handling inter-process communication in a network-independent way. This approach would also help to solve many of the problems plaguing other higher-level standards such as MPI and PVM which must, in some cases, "play OS" to adequately address progress and process control issues. The CDS2 API, a higher level of interface roughly equivalent in functionality to MPI and to be built entirely upon CDS1, is still being designed. It is intended to add support for the equivalent of communicators, reduction and other collective operations, process topologies, additional support for process creation, and some automatic memory management. CDS2 will not exactly match MPI, because the copy-free semantics of communication from CDS1 will be supported. CDS2 application programs will be free to carefully also use CDS1. CDS1 has been implemented on networks of workstations running unmodified Unix-based operating systems, using UDP/IP and vendor-supplied high- performance locks. Although its inter-node performance is currently unimpressive due to rudimentary implementation technique, it even now outperforms highly-optimized MPI implementation on intra-node communication due to its support for non-copy communication. The similarity of the CDS1 architecture to that of other projects such as UNET and TRAP suggests that the inter-node performance can be increased significantly to surpass MPI or PVM, and it may be possible to migrate some of its functionality to communication controllers.
Template Based Low Data Rate Speech Encoder
1993-09-30
Nasality Distinguishes In/ from d/ 95.6 96.9 1m/ from /b/, etc. Sustention Distinguishes /f/ from /p/, $7.5 88.3 ibi from N/, Al from /0 8. etc. Sibilation...processor performs mainly Processor Workstation input/output (I/O) operations. The dynamic random access memory (DRAM) has 16 million bytes of...storage capacity. To execute the 800-b/s voice algorithm, the following amount of memory is needed: 5 MB for tables, 1.5 MB for it "program, and 30 KB for
Two-level main memory co-design: Multi-threaded algorithmic primitives, analysis, and simulation
Bender, Michael A.; Berry, Jonathan W.; Hammond, Simon D.; ...
2017-01-03
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” andmore » if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Here, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient.« less
NASA Technical Reports Server (NTRS)
Perkinson, J. A.
1974-01-01
The application of associative memory processor equipment to conventional host processors type systems is discussed. Efforts were made to demonstrate how such application relieves the task burden of conventional systems, and enhance system speed and efficiency. Data cover comparative theoretical performance analysis, demonstration of expanded growth capabilities, and demonstrations of actual hardware in simulated environment.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2011-01-11
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-02-21
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Compiler analysis for irregular problems in FORTRAN D
NASA Technical Reports Server (NTRS)
Vonhanxleden, Reinhard; Kennedy, Ken; Koelbel, Charles; Das, Raja; Saltz, Joel
1992-01-01
We developed a dataflow framework which provides a basis for rigorously defining strategies to make use of runtime preprocessing methods for distributed memory multiprocessors. In many programs, several loops access the same off-processor memory locations. Our runtime support gives us a mechanism for tracking and reusing copies of off-processor data. A key aspect of our compiler analysis strategy is to determine when it is safe to reuse copies of off-processor data. Another crucial function of the compiler analysis is to identify situations which allow runtime preprocessing overheads to be amortized. This dataflow analysis will make it possible to effectively use the results of interprocedural analysis in our efforts to reduce interprocessor communication and the need for runtime preprocessing.
New computing systems and their impact on structural analysis and design
NASA Technical Reports Server (NTRS)
Noor, Ahmed K.
1989-01-01
A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
Optical Interconnections for VLSI Computational Systems Using Computer-Generated Holography.
NASA Astrophysics Data System (ADS)
Feldman, Michael Robert
Optical interconnects for VLSI computational systems using computer generated holograms are evaluated in theory and experiment. It is shown that by replacing particular electronic connections with free-space optical communication paths, connection of devices on a single chip or wafer and between chips or modules can be improved. Optical and electrical interconnects are compared in terms of power dissipation, communication bandwidth, and connection density. Conditions are determined for which optical interconnects are advantageous. Based on this analysis, it is shown that by applying computer generated holographic optical interconnects to wafer scale fine grain parallel processing systems, dramatic increases in system performance can be expected. Some new interconnection networks, designed to take full advantage of optical interconnect technology, have been developed. Experimental Computer Generated Holograms (CGH's) have been designed, fabricated and subsequently tested in prototype optical interconnected computational systems. Several new CGH encoding methods have been developed to provide efficient high performance CGH's. One CGH was used to decrease the access time of a 1 kilobit CMOS RAM chip. Another was produced to implement the inter-processor communication paths in a shared memory SIMD parallel processor array.
Protect sensitive data with lightweight memory encryption
NASA Astrophysics Data System (ADS)
Zhou, Hongwei; Yuan, Jinhui; Xiao, Rui; Zhang, Kai; Sun, Jingyao
2018-04-01
Since current commercial processor is not able to deal with the data in the cipher text, the sensitive data have to be exposed in the memory. It leaves a window for the adversary. To protect the sensitive data, a direct idea is to encrypt the data when the processor does not access them. On the observation, we have developed a lightweight memory encryption, called LeMe, to protect the sensitive data in the application. LeMe marks the sensitive data in the memory with the page table entry, and encrypts the data in their free time. LeMe is built on the Linux with a 3.17.6 kernel, and provides four user interfaces as dynamic link library. Our evaluations show LeMe is effective to protect the sensitive data and incurs an acceptable performance overhead.
NASA Astrophysics Data System (ADS)
González, Diego; Botella, Guillermo; García, Carlos; Prieto, Manuel; Tirado, Francisco
2013-12-01
This contribution focuses on the optimization of matching-based motion estimation algorithms widely used for video coding standards using an Altera custom instruction-based paradigm and a combination of synchronous dynamic random access memory (SDRAM) with on-chip memory in Nios II processors. A complete profile of the algorithms is achieved before the optimization, which locates code leaks, and afterward, creates a custom instruction set, which is then added to the specific design, enhancing the original system. As well, every possible memory combination between on-chip memory and SDRAM has been tested to achieve the best performance. The final throughput of the complete designs are shown. This manuscript outlines a low-cost system, mapped using very large scale integration technology, which accelerates software algorithms by converting them into custom hardware logic blocks and showing the best combination between on-chip memory and SDRAM for the Nios II processor.
Message Passing and Shared Address Space Parallelism on an SMP Cluster
NASA Technical Reports Server (NTRS)
Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)
2002-01-01
Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
Optimization of the Multi-Spectral Euclidean Distance Calculation for FPGA-based Spaceborne Systems
NASA Technical Reports Server (NTRS)
Cristo, Alejandro; Fisher, Kevin; Perez, Rosa M.; Martinez, Pablo; Gualtieri, Anthony J.
2012-01-01
Due to the high quantity of operations that spaceborne processing systems must carry out in space, new methodologies and techniques are being presented as good alternatives in order to free the main processor from work and improve the overall performance. These include the development of ancillary dedicated hardware circuits that carry out the more redundant and computationally expensive operations in a faster way, leaving the main processor free to carry out other tasks while waiting for the result. One of these devices is SpaceCube, a FPGA-based system designed by NASA. The opportunity to use FPGA reconfigurable architectures in space allows not only the optimization of the mission operations with hardware-level solutions, but also the ability to create new and improved versions of the circuits, including error corrections, once the satellite is already in orbit. In this work, we propose the optimization of a common operation in remote sensing: the Multi-Spectral Euclidean Distance calculation. For that, two different hardware architectures have been designed and implemented in a Xilinx Virtex-5 FPGA, the same model of FPGAs used by SpaceCube. Previous results have shown that the communications between the embedded processor and the circuit create a bottleneck that affects the overall performance in a negative way. In order to avoid this, advanced methods including memory sharing, Native Port Interface (NPI) connections and Data Burst Transfers have been used.
Geospace simulations using modern accelerator processor technology
NASA Astrophysics Data System (ADS)
Germaschewski, K.; Raeder, J.; Larson, D. J.
2009-12-01
OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
JPRS Report, Science & Technology, China, High-Performance Computer Systems
1992-10-28
microprocessor array The microprocessor array in the AP85 system is com- posed of 16 completely identical array element micro - processors . Each array element...microprocessors and capable of host machine reading and writing. The memory capacity of the array element micro - processors as a whole can be expanded...transmission functions to carry out data transmission from array element micro - processor to array element microprocessor, from array element
Chatterjee, Siddhartha [Yorktown Heights, NY; Gunnels, John A [Brewster, NY
2011-11-08
A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.
NASA Astrophysics Data System (ADS)
Hayakawa, Hitoshi; Ogawa, Makoto; Shibata, Tadashi
2005-04-01
A very large scale integrated circuit (VLSI) architecture for a multiple-instruction-stream multiple-data-stream (MIMD) associative processor has been proposed. The processor employs an architecture that enables seamless switching from associative operations to arithmetic operations. The MIMD element is convertible to a regular central processing unit (CPU) while maintaining its high performance as an associative processor. Therefore, the MIMD associative processor can perform not only on-chip perception, i.e., searching for the vector most similar to an input vector throughout the on-chip cache memory, but also arithmetic and logic operations similar to those in ordinary CPUs, both simultaneously in parallel processing. Three key technologies have been developed to generate the MIMD element: associative-operation-and-arithmetic-operation switchable calculation units, a versatile register control scheme within the MIMD element for flexible operations, and a short instruction set for minimizing the memory size for program storage. Key circuit blocks were designed and fabricated using 0.18 μm complementary metal-oxide-semiconductor (CMOS) technology. As a result, the full-featured MIMD element is estimated to be 3 mm2, showing the feasibility of an 8-parallel-MIMD-element associative processor in a single chip of 5 mm× 5 mm.
Architecture of security management unit for safe hosting of multiple agents
NASA Astrophysics Data System (ADS)
Gilmont, Tanguy; Legat, Jean-Didier; Quisquater, Jean-Jacques
1999-04-01
In such growing areas as remote applications in large public networks, electronic commerce, digital signature, intellectual property and copyright protection, and even operating system extensibility, the hardware security level offered by existing processors is insufficient. They lack protection mechanisms that prevent the user from tampering critical data owned by those applications. Some devices make exception, but have not enough processing power nor enough memory to stand up to such applications (e.g. smart cards). This paper proposes an architecture of secure processor, in which the classical memory management unit is extended into a new security management unit. It allows ciphered code execution and ciphered data processing. An internal permanent memory can store cipher keys and critical data for several client agents simultaneously. The ordinary supervisor privilege scheme is replaced by a privilege inheritance mechanism that is more suited to operating system extensibility. The result is a secure processor that has hardware support for extensible multitask operating systems, and can be used for both general applications and critical applications needing strong protection. The security management unit and the internal permanent memory can be added to an existing CPU core without loss of performance, and do not require it to be modified.
Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo
2018-02-01
Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Real-Time Data Processing in the muon system of the D0 detector.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Neeti Parashar et al.
2001-07-03
This paper presents a real-time application of the 16-bit fixed point Digital Signal Processors (DSPs), in the Muon System of the D0 detector located at the Fermilab Tevatron, presently the world's highest-energy hadron collider. As part of the Upgrade for a run beginning in the year 2000, the system is required to process data at an input event rate of 10 KHz without incurring significant deadtime in readout. The ADSP21csp01 processor has high I/O bandwidth, single cycle instruction execution and fast task switching support to provide efficient multisignal processing. The processor's internal memory consists of 4K words of Program Memorymore » and 4K words of Data Memory. In addition there is an external memory of 32K words for general event buffering and 16K words of Dual port Memory for input data queuing. This DSP fulfills the requirement of the Muon subdetector systems for data readout. All error handling, buffering, formatting and transferring of the data to the various trigger levels of the data acquisition system is done in software. The algorithms developed for the system complete these tasks in about 20 {micro}s per event.« less
Code of Federal Regulations, 2010 CFR
2010-10-01
... 50 Wildlife and Fisheries 9 2010-10-01 2010-10-01 false Crab Quota Share (QS), Processor QS (PQS... established based on the regional designations determined on August 1, 2005. QS or PQS issued after this date... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
A high performance linear equation solver on the VPP500 parallel supercomputer
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nakanishi, Makoto; Ina, Hiroshi; Miura, Kenichi
1994-12-31
This paper describes the implementation of two high performance linear equation solvers developed for the Fujitsu VPP500, a distributed memory parallel supercomputer system. The solvers take advantage of the key architectural features of VPP500--(1) scalability for an arbitrary number of processors up to 222 processors, (2) flexible data transfer among processors provided by a crossbar interconnection network, (3) vector processing capability on each processor, and (4) overlapped computation and transfer. The general linear equation solver based on the blocked LU decomposition method achieves 120.0 GFLOPS performance with 100 processors in the LIN-PACK Highly Parallel Computing benchmark.
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS
Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.
2014-01-01
Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
High-performance sparse matrix-matrix products on Intel KNL and multicore architectures
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nagasaka, Y; Matsuoka, S; Azad, A
Sparse matrix-matrix multiplication (SpGEMM) is a computational primitive that is widely used in areas ranging from traditional numerical applications to recent big data analysis and machine learning. Although many SpGEMM algorithms have been proposed, hardware specific optimizations for multi- and many-core processors are lacking and a detailed analysis of their performance under various use cases and matrices is not available. We firstly identify and mitigate multiple bottlenecks with memory management and thread scheduling on Intel Xeon Phi (Knights Landing or KNL). Specifically targeting multi- and many-core processors, we develop a hash-table-based algorithm and optimize a heap-based shared-memory SpGEMM algorithm. Wemore » examine their performance together with other publicly available codes. Different from the literature, our evaluation also includes use cases that are representative of real graph algorithms, such as multi-source breadth-first search or triangle counting. Our hash-table and heap-based algorithms are showing significant speedups from libraries in the majority of the cases while different algorithms dominate the other scenarios with different matrix size, sparsity, compression factor and operation type. We wrap up in-depth evaluation results and make a recipe to give the best SpGEMM algorithm for target scenario. A critical finding is that hash-table-based SpGEMM gets a significant performance boost if the nonzeros are not required to be sorted within each row of the output matrix.« less
Xyce™ Parallel Electronic Simulator Users' Guide, Version 6.5.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Keiter, Eric R.; Aadithya, Karthik V.; Mei, Ting
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). This includes support for most popular parallel and serial computers. A differential-algebraic-equation (DAE) formulation, which better isolates the device model package from solver algorithms. This allows one to developmore » new types of analysis without requiring the implementation of analysis-specific device models. Device models that are specifically tailored to meet Sandia's needs, including some radiation- aware devices (for Sandia users only). Object-oriented code design and implementation using modern coding practices. Xyce is a parallel code in the most general sense of the phrase -- a message passing parallel implementation -- which allows it to run efficiently a wide range of computing platforms. These include serial, shared-memory and distributed-memory parallel platforms. Attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The information herein is subject to change without notice. Copyright © 2002-2016 Sandia Corporation. All rights reserved.« less
NASA Astrophysics Data System (ADS)
Onizawa, Naoya; Tamakoshi, Akira; Hanyu, Takahiro
2017-08-01
In this paper, reinitialization-free nonvolatile computer systems are designed and evaluated for energy-harvesting Internet of things (IoT) applications. In energy-harvesting applications, as power supplies generated from renewable power sources cause frequent power failures, data processed need to be backed up when power failures occur. Unless data are safely backed up before power supplies diminish, reinitialization processes are required when power supplies are recovered, which results in low energy efficiencies and slow operations. Using nonvolatile devices in processors and memories can realize a faster backup than a conventional volatile computer system, leading to a higher energy efficiency. To evaluate the energy efficiency upon frequent power failures, typical computer systems including processors and memories are designed using 90 nm CMOS or CMOS/magnetic tunnel junction (MTJ) technologies. Nonvolatile ARM Cortex-M0 processors with 4 kB MRAMs are evaluated using a typical computing benchmark program, Dhrystone, which shows a few order-of-magnitude reductions in energy in comparison with a volatile processor with SRAM.
NASA Technical Reports Server (NTRS)
Sohn, Andrew; Biswas, Rupak; Simon, Horst D.
1996-01-01
The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.
Phase space simulation of collisionless stellar systems on the massively parallel processor
NASA Technical Reports Server (NTRS)
White, Richard L.
1987-01-01
A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem.
Simplifying and speeding the management of intra-node cache coherence
Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton on Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Phillip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Yorktown Heights, NY
2012-04-17
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an area of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.
Managing coherence via put/get windows
DOE Office of Scientific and Technical Information (OSTI.GOV)
Blumrich, Matthias A; Chen, Dong; Coteus, Paul W
A method and apparatus for managing coherence between two processors of a two processor node of a multi-processor computer system. Generally the present invention relates to a software algorithm that simplifies and significantly speeds the management of cache coherence in a message passing parallel computer, and to hardware apparatus that assists this cache coherence algorithm. The software algorithm uses the opening and closing of put/get windows to coordinate the activated required to achieve cache coherence. The hardware apparatus may be an extension to the hardware address decode, that creates, in the physical memory address space of the node, an areamore » of virtual memory that (a) does not actually exist, and (b) is therefore able to respond instantly to read and write requests from the processing elements.« less
Neurovision processor for designing intelligent sensors
NASA Astrophysics Data System (ADS)
Gupta, Madan M.; Knopf, George K.
1992-03-01
A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.
Computationally Efficient Modeling and Simulation of Large Scale Systems
NASA Technical Reports Server (NTRS)
Jain, Jitesh (Inventor); Koh, Cheng-Kok (Inventor); Balakrishnan, Vankataramanan (Inventor); Cauley, Stephen F (Inventor); Li, Hong (Inventor)
2014-01-01
A system for simulating operation of a VLSI interconnect structure having capacitive and inductive coupling between nodes thereof, including a processor, and a memory, the processor configured to perform obtaining a matrix X and a matrix Y containing different combinations of passive circuit element values for the interconnect structure, the element values for each matrix including inductance L and inverse capacitance P, obtaining an adjacency matrix A associated with the interconnect structure, storing the matrices X, Y, and A in the memory, and performing numerical integration to solve first and second equations.
DOE Office of Scientific and Technical Information (OSTI.GOV)
De Supinski, B.; Caliga, D.
2017-09-28
The primary objective of this project was to develop memory optimization technology to efficiently deliver data to, and distribute data within, the SRC-6's Field Programmable Gate Array- ("FPGA") based Multi-Adaptive Processors (MAPs). The hardware/software approach was to explore efficient MAP configurations and generate the compiler technology to exploit those configurations. This memory accessing technology represents an important step towards making reconfigurable symmetric multi-processor (SMP) architectures that will be a costeffective solution for large-scale scientific computing.
Developing infrared array controller with software real time operating system
NASA Astrophysics Data System (ADS)
Sako, Shigeyuki; Miyata, Takashi; Nakamura, Tomohiko; Motohara, Kentaro; Uchimoto, Yuka Katsuno; Onaka, Takashi; Kataza, Hirokazu
2008-07-01
Real-time capabilities are required for a controller of a large format array to reduce a dead-time attributed by readout and data transfer. The real-time processing has been achieved by dedicated processors including DSP, CPLD, and FPGA devices. However, the dedicated processors have problems with memory resources, inflexibility, and high cost. Meanwhile, a recent PC has sufficient resources of CPUs and memories to control the infrared array and to process a large amount of frame data in real-time. In this study, we have developed an infrared array controller with a software real-time operating system (RTOS) instead of the dedicated processors. A Linux PC equipped with a RTAI extension and a dual-core CPU is used as a main computer, and one of the CPU cores is allocated to the real-time processing. A digital I/O board with DMA functions is used for an I/O interface. The signal-processing cores are integrated in the OS kernel as a real-time driver module, which is composed of two virtual devices of the clock processor and the frame processor tasks. The array controller with the RTOS realizes complicated operations easily, flexibly, and at a low cost.
Characterization of Stationary Distributions of Reflected Diffusions
2014-01-01
Reiman , M. I. (2003). Fluid and heavy traffic limits for a generalized processor sharing model. Ann. Appl. Probab., 13, 100-139. [37] Ramanan, K. and... Reiman , M. I. (2008). The heavy traffic limit of an unbalanced generalized processor sharing model. Ann. Appl. Probab., 18, 22-58. [38] Reed, J. and...Control and Computing. [39] Reiman , M. I. and Williams, R. J. (1988). A boundary property of semimartingale reflecting Brownian motions. Probab. Theor
Bermuda Triangle: a subsystem of the 168/E interfacing scheme used by Group B at SLAC
DOE Office of Scientific and Technical Information (OSTI.GOV)
Oxoby, G.J.; Levinson, L.J.; Trang, Q.H.
1979-12-01
The Bermuda Triangle system is a method of interfacing several 168/E microprocessors to a central system for control of the processors and overlaying their memories. The system is a three-way interface with I/O ports to a large buffer memory, a PDP11 Unibus and a bus to the 168/E processors. Data may be transferred bidirectionally between any two ports. Two Bermuda Triangles are used, one for the program memory and one for the data memory. The program buffer memory stores the overlay programs for the 168/E, and the data buffer memory, the incoming raw data, the data portion of the overlays,more » and the outgoing processed events. This buffering is necessary since the memories of 168/E microprocessors are small compared to the main program and the amount of data being processed. The link to the computer facility is via a Unibus to IBM channel interface. A PDP11/04 controls the data flow. 7 figures, 4 tables. (RWR)« less
Multiple channel data acquisition system
Crawley, H. Bert; Rosenberg, Eli I.; Meyer, W. Thomas; Gorbics, Mark S.; Thomas, William D.; McKay, Roy L.; Homer, Jr., John F.
1990-05-22
A multiple channel data acquisition system for the transfer of large amounts of data from a multiplicity of data channels has a plurality of modules which operate in parallel to convert analog signals to digital data and transfer that data to a communications host via a FASTBUS. Each module has a plurality of submodules which include a front end buffer (FEB) connected to input circuitry having an analog to digital converter with cache memory for each of a plurality of channels. The submodules are interfaced with the FASTBUS via a FASTBUS coupler which controls a module bus and a module memory. The system is triggered to effect rapid parallel data samplings which are stored to the cache memories. The cache memories are uploaded to the FEBs during which zero suppression occurs. The data in the FEBs is reformatted and compressed by a local processor during transfer to the module memory. The FASTBUS coupler is used by the communications host to upload the compressed and formatted data from the module memory. The local processor executes programs which are downloaded to the module memory through the FASTBUS coupler.
Multiple channel data acquisition system
Crawley, H.B.; Rosenberg, E.I.; Meyer, W.T.; Gorbics, M.S.; Thomas, W.D.; McKay, R.L.; Homer, J.F. Jr.
1990-05-22
A multiple channel data acquisition system for the transfer of large amounts of data from a multiplicity of data channels has a plurality of modules which operate in parallel to convert analog signals to digital data and transfer that data to a communications host via a FASTBUS. Each module has a plurality of submodules which include a front end buffer (FEB) connected to input circuitry having an analog to digital converter with cache memory for each of a plurality of channels. The submodules are interfaced with the FASTBUS via a FASTBUS coupler which controls a module bus and a module memory. The system is triggered to effect rapid parallel data samplings which are stored to the cache memories. The cache memories are uploaded to the FEBs during which zero suppression occurs. The data in the FEBs is reformatted and compressed by a local processor during transfer to the module memory. The FASTBUS coupler is used by the communications host to upload the compressed and formatted data from the module memory. The local processor executes programs which are downloaded to the module memory through the FASTBUS coupler. 25 figs.
Processors for wavelet analysis and synthesis: NIFS and TI-C80 MVP
NASA Astrophysics Data System (ADS)
Brooks, Geoffrey W.
1996-03-01
Two processors are considered for image quadrature mirror filtering (QMF). The neuromorphic infrared focal-plane sensor (NIFS) is an existing prototype analog processor offering high speed spatio-temporal Gaussian filtering, which could be used for the QMF low- pass function, and difference of Gaussian filtering, which could be used for the QMF high- pass function. Although not designed specifically for wavelet analysis, the biologically- inspired system accomplishes the most computationally intensive part of QMF processing. The Texas Instruments (TI) TMS320C80 Multimedia Video Processor (MVP) is a 32-bit RISC master processor with four advanced digital signal processors (DSPs) on a single chip. Algorithm partitioning, memory management and other issues are considered for optimal performance. This paper presents these considerations with simulated results leading to processor implementation of high-speed QMF analysis and synthesis.
NASA Astrophysics Data System (ADS)
Bikbov, I. S.; Zuikov, V. A.; Popov, I. I.; Popova, G. L.; Samartsev, V. V.
1995-10-01
An analysis is made of the results of an investigation of the physical principles underlying the operation of an associative optical memory and of processors utilising the photon (optical) echo phenomenon. The feasibility of constructing such optical memories is considered.
Concept of a programmable maintenance processor applicable to multiprocessing systems
NASA Technical Reports Server (NTRS)
Glover, Richard D.
1988-01-01
A programmable maintenance processor concept applicable to multiprocessing systems has been developed at the NASA Ames Research Center's Dryden Flight Research Facility. This stand-alone-processor is intended to provide support for system and application software testing as well as hardware diagnostics. An initial machanization has been incorporated into the extended aircraft interrogation and display system (XAIDS) which is multiprocessing general-purpose ground support equipment. The XAIDS maintenance processor has independent terminal and printer interfaces and a dedicated magnetic bubble memory that stores system test sequences entered from the terminal. This report describes the hardware and software embodied in this processor and shows a typical application in the check-out of a new XAIDS.
Data acquisition using the 168/E. [CERN ISR
DOE Office of Scientific and Technical Information (OSTI.GOV)
Carroll, J.T.; Cittolin, S.; Demoulin, M.
1983-03-01
Event sizes and data rates at the CERN anti p p collider compose a formidable environment for a high level trigger. A system using three 168/E processors for experiment UA1 real-time event selection is described. With 168/E data memory expanded to 512K bytes, each processor holds a complete event allowing a FORTRAN trigger algorithm access to data from the entire detector. A smart CAMAC interface reads five Remus branches in parallel transferring one word to the target processor every 0.5 ..mu..s. The NORD host computer can simultaneously read an accepted event from another processor.
NASA Technical Reports Server (NTRS)
Bartram, Peter N.
1989-01-01
The current Life Sciences Laboratory Equipment (LSLE) microcomputer for life sciences experiment data acquisition is now obsolete. Among the weaknesses of the current microcomputer are small memory size, relatively slow analog data sampling rates, and the lack of a bulk data storage device. While life science investigators normally prefer data to be transmitted to Earth as it is taken, this is not always possible. No down-link exists for experiments performed in the Shuttle middeck region. One important aspect of a replacement microcomputer is provision for in-flight storage of experimental data. The Write Once, Read Many (WORM) optical disk was studied because of its high storage density, data integrity, and the availability of a space-qualified unit. In keeping with the goals for a replacement microcomputer based upon commercially available components and standard interfaces, the system studied includes a Small Computer System Interface (SCSI) for interfacing the WORM drive. The system itself is designed around the STD bus, using readily available boards. Configurations examined were: (1) master processor board and slave processor board with the SCSI interface; (2) master processor with SCSI interface; (3) master processor with SCSI and Direct Memory Access (DMA); (4) master processor controlling a separate STD bus SCSI board; and (5) master processor controlling a separate STD bus SCSI board with DMA.
Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Imam, Neena; Humble, Travis S.; McCaskey, Alex
A content-addressable memory (CAM) stores key-value associations such that the key is recalled by providing its associated value. While CAM recall is traditionally performed using recurrent neural network models, we show how to solve this problem using adiabatic quantum optimization. Our approach maps the recurrent neural network to a commercially available quantum processing unit by taking advantage of the common underlying Ising spin model. We then assess the accuracy of the quantum processor to store key-value associations by quantifying recall performance against an ensemble of problem sets. We observe that different learning rules from the neural network community influence recallmore » accuracy but performance appears to be limited by potential noise in the processor. The strong connection established between quantum processors and neural network problems supports the growing intersection of these two ideas.« less
Face classification using electronic synapses
NASA Astrophysics Data System (ADS)
Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H.-S. Philip; Qian, He
2017-05-01
Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
Face classification using electronic synapses.
Yao, Peng; Wu, Huaqiang; Gao, Bin; Eryilmaz, Sukru Burc; Huang, Xueyao; Zhang, Wenqiang; Zhang, Qingtian; Deng, Ning; Shi, Luping; Wong, H-S Philip; Qian, He
2017-05-12
Conventional hardware platforms consume huge amount of energy for cognitive learning due to the data movement between the processor and the off-chip memory. Brain-inspired device technologies using analogue weight storage allow to complete cognitive tasks more efficiently. Here we present an analogue non-volatile resistive memory (an electronic synapse) with foundry friendly materials. The device shows bidirectional continuous weight modulation behaviour. Grey-scale face classification is experimentally demonstrated using an integrated 1024-cell array with parallel online training. The energy consumption within the analogue synapses for each iteration is 1,000 × (20 ×) lower compared to an implementation using Intel Xeon Phi processor with off-chip memory (with hypothetical on-chip digital resistive random access memory). The accuracy on test sets is close to the result using a central processing unit. These experimental results consolidate the feasibility of analogue synaptic array and pave the way toward building an energy efficient and large-scale neuromorphic system.
NASA Astrophysics Data System (ADS)
Li, Jiaqiang; Choutko, Vitaly; Xiao, Liyi
2018-03-01
Based on the collection of error data from the Alpha Magnetic Spectrometer (AMS) Digital Signal Processors (DSP), on-orbit Single Event Upsets (SEUs) of the DSP program memory are analyzed. The daily error distribution and time intervals between errors are calculated to evaluate the reliability of the system. The particle density distribution of International Space Station (ISS) orbit is presented and the effects from the South Atlantic Anomaly (SAA) and the geomagnetic poles are analyzed. The impact of solar events on the DSP program memory is carried out combining data analysis and Monte Carlo simulation (MC). From the analysis and simulation results, it is concluded that the area corresponding to the SAA is the main source of errors on the ISS orbit. Solar events can also cause errors on DSP program memory, but the effect depends on the on-orbit particle density.
Meeting the memory challenges of brain-scale network simulation.
Kunkel, Susanne; Potjans, Tobias C; Eppler, Jochen M; Plesser, Hans Ekkehard; Morrison, Abigail; Diesmann, Markus
2011-01-01
The development of high-performance simulation software is crucial for studying the brain connectome. Using connectome data to generate neurocomputational models requires software capable of coping with models on a variety of scales: from the microscale, investigating plasticity, and dynamics of circuits in local networks, to the macroscale, investigating the interactions between distinct brain regions. Prior to any serious dynamical investigation, the first task of network simulations is to check the consistency of data integrated in the connectome and constrain ranges for yet unknown parameters. Thanks to distributed computing techniques, it is possible today to routinely simulate local cortical networks of around 10(5) neurons with up to 10(9) synapses on clusters and multi-processor shared-memory machines. However, brain-scale networks are orders of magnitude larger than such local networks, in terms of numbers of neurons and synapses as well as in terms of computational load. Such networks have been investigated in individual studies, but the underlying simulation technologies have neither been described in sufficient detail to be reproducible nor made publicly available. Here, we discover that as the network model sizes approach the regime of meso- and macroscale simulations, memory consumption on individual compute nodes becomes a critical bottleneck. This is especially relevant on modern supercomputers such as the Blue Gene/P architecture where the available working memory per CPU core is rather limited. We develop a simple linear model to analyze the memory consumption of the constituent components of neuronal simulators as a function of network size and the number of cores used. This approach has multiple benefits. The model enables identification of key contributing components to memory saturation and prediction of the effects of potential improvements to code before any implementation takes place. As a consequence, development cycles can be shorter and less expensive. Applying the model to our freely available Neural Simulation Tool (NEST), we identify the software components dominant at different scales, and develop general strategies for reducing the memory consumption, in particular by using data structures that exploit the sparseness of the local representation of the network. We show that these adaptations enable our simulation software to scale up to the order of 10,000 processors and beyond. As memory consumption issues are likely to be relevant for any software dealing with complex connectome data on such architectures, our approach and our findings should be useful for researchers developing novel neuroinformatics solutions to the challenges posed by the connectome project.
Content addressable memory project
NASA Technical Reports Server (NTRS)
Hall, J. Storrs; Levy, Saul; Smith, Donald E.; Miyake, Keith M.
1992-01-01
A parameterized version of the tree processor was designed and tested (by simulation). The leaf processor design is 90 percent complete. We expect to complete and test a combination of tree and leaf cell designs in the next period. Work is proceeding on algorithms for the computer aided manufacturing (CAM), and once the design is complete we will begin simulating algorithms for large problems. The following topics are covered: (1) the practical implementation of content addressable memory; (2) design of a LEAF cell for the Rutgers CAM architecture; (3) a circuit design tool user's manual; and (4) design and analysis of efficient hierarchical interconnection networks.
Hardware Acceleration of Sparse Cognitive Algorithms
2016-05-01
Processor in Memory (PiM) extensions and a 65 nm ASIC version. They were compared against a 28 nm GPU baseline using the KTH video action recognition...30 Table 17. Memory Requirement of Proposed ASIC...for improvement of performance per unit of power for customized implementations of the Sparsey and Numenta Hierarchical Temporal Memory (HTM
Fast, Massively Parallel Data Processors
NASA Technical Reports Server (NTRS)
Heaton, Robert A.; Blevins, Donald W.; Davis, ED
1994-01-01
Proposed fast, massively parallel data processor contains 8x16 array of processing elements with efficient interconnection scheme and options for flexible local control. Processing elements communicate with each other on "X" interconnection grid with external memory via high-capacity input/output bus. This approach to conditional operation nearly doubles speed of various arithmetic operations.
Introduction to Parallel Computing
1992-05-01
Instruction Stream, Multiple Data Stream Machines .................... 19 2.4 Networks of M achines...independent memory units and connecting them to the processors by an interconnection network . Many different interconnection schemes have been considered, and...connected to the same processor at the same time. Crossbar switching networks are still too expensive to be practical for connecting large numbers of
A VME-based software trigger system using UNIX processors
NASA Astrophysics Data System (ADS)
Atmur, Robert; Connor, David F.; Molzon, William
1997-02-01
We have constructed a distributed computing platform with eight processors to assemble and filter data from digitization crates. The filtered data were transported to a tape-writing UNIX computer via ethernet. Each processor ran a UNIX operating system and was installed in its own VME crate. Each VME crate contained dual-port memories which interfaced with the digitizers. Using standard hardware and software (VME and UNIX) allows us to select from a wide variety of non-proprietary products and makes upgrades simpler, if they are necessary.
Single bus star connected reluctance drive and method
Fahimi, Babak; Shamsi, Pourya
2016-05-10
A system and methods for operating a switched reluctance machine includes a controller, an inverter connected to the controller and to the switched reluctance machine, a hysteresis control connected to the controller and to the inverter, a set of sensors connected to the switched reluctance machine and to the controller, the switched reluctance machine further including a set of phases the controller further comprising a processor and a memory connected to the processor, wherein the processor programmed to execute a control process and a generation process.
Benchmarking NWP Kernels on Multi- and Many-core Processors
NASA Astrophysics Data System (ADS)
Michalakes, J.; Vachharajani, M.
2008-12-01
Increased computing power for weather, climate, and atmospheric science has provided direct benefits for defense, agriculture, the economy, the environment, and public welfare and convenience. Today, very large clusters with many thousands of processors are allowing scientists to move forward with simulations of unprecedented size. But time-critical applications such as real-time forecasting or climate prediction need strong scaling: faster nodes and processors, not more of them. Moreover, the need for good cost- performance has never been greater, both in terms of performance per watt and per dollar. For these reasons, the new generations of multi- and many-core processors being mass produced for commercial IT and "graphical computing" (video games) are being scrutinized for their ability to exploit the abundant fine- grain parallelism in atmospheric models. We present results of our work to date identifying key computational kernels within the dynamics and physics of a large community NWP model, the Weather Research and Forecast (WRF) model. We benchmark and optimize these kernels on several different multi- and many-core processors. The goals are to (1) characterize and model performance of the kernels in terms of computational intensity, data parallelism, memory bandwidth pressure, memory footprint, etc. (2) enumerate and classify effective strategies for coding and optimizing for these new processors, (3) assess difficulties and opportunities for tool or higher-level language support, and (4) establish a continuing set of kernel benchmarks that can be used to measure and compare effectiveness of current and future designs of multi- and many-core processors for weather and climate applications.
Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore
DOE Office of Scientific and Technical Information (OSTI.GOV)
Liao, C; Quinlan, D J; Willcock, J J
2008-12-12
Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less
Parallel approach to incorporating face image information into dialogue processing
NASA Astrophysics Data System (ADS)
Ren, Fuji
2000-10-01
There are many kinds of so-called irregular expressions in natural dialogues. Even if the content of a conversation is the same in words, different meanings can be interpreted by a person's feeling or face expression. To have a good understanding of dialogues, it is required in a flexible dialogue processing system to infer the speaker's view properly. However, it is difficult to obtain the meaning of the speaker's sentences in various scenes using traditional methods. In this paper, a new approach for dialogue processing that incorporates information from the speaker's face is presented. We first divide conversation statements into several simple tasks. Second, we process each simple task using an independent processor. Third, we employ some speaker's face information to estimate the view of the speakers to solve ambiguities in dialogues. The approach presented in this paper can work efficiently, because independent processors run in parallel, writing partial results to a shared memory, incorporating partial results at appropriate points, and complementing each other. A parallel algorithm and a method for employing the face information in a dialogue machine translation will be discussed, and some results will be included in this paper.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2
Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.
2014-01-01
SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
Optics Program Modified for Multithreaded Parallel Computing
NASA Technical Reports Server (NTRS)
Lou, John; Bedding, Dave; Basinger, Scott
2006-01-01
A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems
NASA Technical Reports Server (NTRS)
Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.
2000-01-01
The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Fang, Wai-Chi; Huang, Kuan-Ju; Chou, Chia-Ching; Chang, Jui-Chung; Cauwenberghs, Gert; Jung, Tzyy-Ping
2014-01-01
This is a proposal for an efficient very-large-scale integration (VLSI) design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, implemented with TSMC 40 nm CMOS technology. ORICA is appropriate to be used in real-time EEG system to separate artifacts because of its highly efficient and real-time process features. The proposed ORICA processor is composed of an ORICA processing unit and a singular value decomposition (SVD) processing unit. Compared with previous work [1], this proposed ORICA processor has enhanced effectiveness and reduced hardware complexity by utilizing a deeper pipeline architecture, shared arithmetic processing unit, and shared registers. The 16-channel random signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are used to analyze the dependence of the source components, and the average correlation coefficient is 0.95452 between the original source signals and extracted ORICA signals. Finally, the proposed ORICA processor ASIC is implemented with TSMC 40 nm CMOS technology, and it consumes 15.72 mW at 100 MHz operating frequency.
Evaluating local indirect addressing in SIMD proc essors
NASA Technical Reports Server (NTRS)
Middleton, David; Tomboulian, Sherryl
1989-01-01
In the design of parallel computers, there exists a tradeoff between the number and power of individual processors. The single instruction stream, multiple data stream (SIMD) model of parallel computers lies at one extreme of the resulting spectrum. The available hardware resources are devoted to creating the largest possible number of processors, and consequently each individual processor must use the fewest possible resources. Disagreement exists as to whether SIMD processors should be able to generate addresses individually into their local data memory, or all processors should access the same address. The tradeoff is examined between the increased capability and the reduced number of processors that occurs in this single instruction stream, multiple, locally addressed, data (SIMLAD) model. Factors are assembled that affect this design choice, and the SIMLAD model is compared with the bare SIMD and the MIMD models.
Interactive high-resolution isosurface ray casting on multicore processors.
Wang, Qin; JaJa, Joseph
2008-01-01
We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Store-operate-coherence-on-value
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Dong; Heidelberger, Philip; Kumar, Sameer
A system, method and computer program product for performing various store-operate instructions in a parallel computing environment that includes a plurality of processors and at least one cache memory device. A queue in the system receives, from a processor, a store-operate instruction that specifies under which condition a cache coherence operation is to be invoked. A hardware unit in the system runs the received store-operate instruction. The hardware unit evaluates whether a result of the running the received store-operate instruction satisfies the condition. The hardware unit invokes a cache coherence operation on a cache memory address associated with the receivedmore » store-operate instruction if the result satisfies the condition. Otherwise, the hardware unit does not invoke the cache coherence operation on the cache memory device.« less
Processor-in-memory-and-storage architecture
DOE Office of Scientific and Technical Information (OSTI.GOV)
DeBenedictis, Erik
A method and apparatus for performing reliable general-purpose computing. Each sub-core of a plurality of sub-cores of a processor core processes a same instruction at a same time. A code analyzer receives a plurality of residues that represents a code word corresponding to the same instruction and an indication of whether the code word is a memory address code or a data code from the plurality of sub-cores. The code analyzer determines whether the plurality of residues are consistent or inconsistent. The code analyzer and the plurality of sub-cores perform a set of operations based on whether the code wordmore » is a memory address code or a data code and a determination of whether the plurality of residues are consistent or inconsistent.« less
Ethernet-Enabled Power and Communication Module for Embedded Processors
NASA Technical Reports Server (NTRS)
Perotti, Jose; Oostdyk, Rebecca
2010-01-01
The power and communications module is a printed circuit board (PCB) that has the capability of providing power to an embedded processor and converting Ethernet packets into serial data to transfer to the processor. The purpose of the new design is to address the shortcomings of previous designs, including limited bandwidth and program memory, lack of control over packet processing, and lack of support for timing synchronization. The new design of the module creates a robust serial-to-Ethernet conversion that is powered using the existing Ethernet cable. This innovation has a small form factor that allows it to power processors and transducers with minimal space requirements.
A new parallel-vector finite element analysis software on distributed-memory computers
NASA Technical Reports Server (NTRS)
Qin, Jiangning; Nguyen, Duc T.
1993-01-01
A new parallel-vector finite element analysis software package MPFEA (Massively Parallel-vector Finite Element Analysis) is developed for large-scale structural analysis on massively parallel computers with distributed-memory. MPFEA is designed for parallel generation and assembly of the global finite element stiffness matrices as well as parallel solution of the simultaneous linear equations, since these are often the major time-consuming parts of a finite element analysis. Block-skyline storage scheme along with vector-unrolling techniques are used to enhance the vector performance. Communications among processors are carried out concurrently with arithmetic operations to reduce the total execution time. Numerical results on the Intel iPSC/860 computers (such as the Intel Gamma with 128 processors and the Intel Touchstone Delta with 512 processors) are presented, including an aircraft structure and some very large truss structures, to demonstrate the efficiency and accuracy of MPFEA.
Radio-nuclide mixture identification using medium energy resolution detectors
Nelson, Karl Einar
2013-09-17
According to one embodiment, a method for identifying radio-nuclides includes receiving spectral data, extracting a feature set from the spectral data comparable to a plurality of templates in a template library, and using a branch and bound method to determine a probable template match based on the feature set and templates in the template library. In another embodiment, a device for identifying unknown radio-nuclides includes a processor, a multi-channel analyzer, and a memory operatively coupled to the processor, the memory having computer readable code stored thereon. The computer readable code is configured, when executed by the processor, to receive spectral data, to extract a feature set from the spectral data comparable to a plurality of templates in a template library, and to use a branch and bound method to determine a probable template match based on the feature set and templates in the template library.
VLSI processors for signal detection in SETI
NASA Technical Reports Server (NTRS)
Duluk, J. F.; Linscott, I. R.; Peterson, A. M.; Burr, J.; Ekroot, B.; Twicken, J.
1989-01-01
The objective of the Search for Extraterrestrial Intelligence (SETI) is to locate an artificially created signal coming from a distant star. This is done in two steps: (1) spectral analysis of an incoming radio frequency band, and (2) pattern detection for narrow-band signals. Both steps are computationally expensive and require the development of specially designed computer architectures. To reduce the size and cost of the SETI signal detection machine, two custom VLSI chips are under development. The first chip, the SETI DSP Engine, is used in the spectrum analyzer and is specially designed to compute Discrete Fourier Transforms (DFTs). It is a high-speed arithmetic processor that has two adders, one multiplier-accumulator, and three four-port memories. The second chip is a new type of Content-Addressable Memory. It is the heart of an associative processor that is used for pattern detection. Both chips incorporate many innovative circuits and architectural features.
VLSI processors for signal detection in SETI.
Duluk, J F; Linscott, I R; Peterson, A M; Burr, J; Ekroot, B; Twicken, J
1989-01-01
The objective of the Search for Extraterrestrial Intelligence (SETI) is to locate an artificially created signal coming from a distant star. This is done in two steps: (1) spectral analysis of an incoming radio frequency band, and (2) pattern detection for narrow-band signals. Both steps are computationally expensive and require the development of specially designed computer architectures. To reduce the size and cost of the SETI signal detection machine, two custom VLSI chips are under development. The first chip, the SETI DSP Engine, is used in the spectrum analyzer and is specially designed to compute Discrete Fourier Transforms (DFTs). It is a high-speed arithmetic processor that has two adders, one multiplier-accumulator, and three four-port memories. The second chip is a new type of Content-Addressable Memory. It is the heart of an associative processor that is used for pattern detection. Both chips incorporate many innovative circuits and architectural features.
QI2S - Quick Image Interpretation System
NASA Astrophysics Data System (ADS)
Naghmouchi, Jamin; Aviely, Peleg; Ginosar, Ran; Ober, Giovanna; Bischoff, Ole; Nadler, Ron; Guiser, David; Citroen, Meira; Freddi, Riccardo; Berekovic, Mladen
2015-09-01
The evolution of the Earth Observation mission will be driven by many factors, and the deveploment of new processing paradigms to facilitate data downlink, handling and storage will be a key factor. Next generation EO satellites will generate a great amount of data at a very high data rate, both radar and optical. Real-time onboard processing can be the solution to reduce data downlink and management on ground. Radiometric, geometric, and atmospheric corrections of EO data as well as material/object detection in addition to the well-known needs for image compression and signal processing can be performed directly on board and the aim of QI2S project is to demonstrate this. QI2S, a concept prototype system for novel onboard image processing and image interpretation which has been designed, developed and validated in the framework of an EU FP7 project, targets these needs and makes a significant step towards exceeding current roadmaps of leading space agencies for future payload processors. The QI2S system features multiple chip components of the RC64, a novel rad-hard 64-core signal processing chip, which targets DSP performance of 75 GMACs (16bit), 150 GOPS and 38 single precision GFLOPS while dissipating less than 10 Watts. It integrates advanced DSP cores with a multibank shared memory and a hardware scheduler, also supporting DDR2/3 memory and twelve 3.125 Gbps full duplex high-speed serial links using SpaceFibre and other protocols. The processor is being developed within the European FP7 Framework Program and will be qualified to the highest space standards.
System and method for cognitive processing for data fusion
NASA Technical Reports Server (NTRS)
Duong, Tuan A. (Inventor); Duong, Vu A. (Inventor)
2012-01-01
A system and method for cognitive processing of sensor data. A processor array receiving analog sensor data and having programmable interconnects, multiplication weights, and filters provides for adaptive learning in real-time. A static random access memory contains the programmable data for the processor array and the stored data is modified to provide for adaptive learning.
NASA Technical Reports Server (NTRS)
Arenstorf, Norbert S.; Jordan, Harry F.
1987-01-01
A barrier is a method for synchronizing a large number of concurrent computer processes. After considering some basic synchronization mechanisms, a collection of barrier algorithms with either linear or logarithmic depth are presented. A graphical model is described that profiles the execution of the barriers and other parallel programming constructs. This model shows how the interaction between the barrier algorithms and the work that they synchronize can impact their performance. One result is that logarithmic tree structured barriers show good performance when synchronizing fixed length work, while linear self-scheduled barriers show better performance when synchronizing fixed length work with an imbedded critical section. The linear barriers are better able to exploit the process skew associated with critical sections. Timing experiments, performed on an eighteen processor Flex/32 shared memory multiprocessor, that support these conclusions are detailed.
Evaluation of the Cedar memory system: Configuration of 16 by 16
NASA Technical Reports Server (NTRS)
Gallivan, K.; Jalby, W.; Wijshoff, H.
1991-01-01
Some basic results on the performance of the Cedar multiprocessor system are presented. Empirical results on the 16 processor 16 memory bank system configuration, which show the behavior of the Cedar system under different modes of operation are presented.
Ohmacht, Martin
2017-08-15
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Ohmacht, Martin
2014-09-09
In a multiprocessor system, a central memory synchronization module coordinates memory synchronization requests responsive to memory access requests in flight, a generation counter, and a reclaim pointer. The central module communicates via point-to-point communication. The module includes a global OR reduce tree for each memory access requesting device, for detecting memory access requests in flight. An interface unit is implemented associated with each processor requesting synchronization. The interface unit includes multiple generation completion detectors. The generation count and reclaim pointer do not pass one another.
Life sciences flight experiments microcomputer
NASA Technical Reports Server (NTRS)
Bartram, Peter N.
1987-01-01
A promising microcomputer configuration for the Spacelab Life Sciences Lab. Equipment inventory consists of multiple processors. One processor's use is reserved, with additional processors dedicated to real time input and output operations. A simple form of such a configuration, with a processor board for analog to digital conversion and another processor board for digital to analog conversion, was studied. The system used digital parallel data lines between the boards, operating independently of the system bus. Good performance of individual components was demonstrated: the analog to digital converter was at over 10,000 samples per second. The combination of the data transfer between boards with the input or output functions on each board slowed performance, with a maximum throughput of 2800 to 2900 analog samples per second. Any of several techniques, such as use of the system bus for data transfer or the addition of direct memory access hardware to the processor boards, should give significantly improved performance.
Crosetto, D.B.
1996-12-31
The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.
Novel processor architecture for onboard infrared sensors
NASA Astrophysics Data System (ADS)
Hihara, Hiroki; Iwasaki, Akira; Tamagawa, Nobuo; Kuribayashi, Mitsunobu; Hashimoto, Masanori; Mitsuyama, Yukio; Ochi, Hiroyuki; Onodera, Hidetoshi; Kanbara, Hiroyuki; Wakabayashi, Kazutoshi; Tada, Munehiro
2016-09-01
Infrared sensor system is a major concern for inter-planetary missions that investigate the nature and the formation processes of planets and asteroids. The infrared sensor system requires signal preprocessing functions that compensate for the intensity of infrared image sensors to get high quality data and high compression ratio through the limited capacity of transmission channels towards ground stations. For those implementations, combinations of Field Programmable Gate Arrays (FPGAs) and microprocessors are employed by AKATSUKI, the Venus Climate Orbiter, and HAYABUSA2, the asteroid probe. On the other hand, much smaller size and lower power consumption are demanded for future missions to accommodate more sensors. To fulfill this future demand, we developed a novel processor architecture which consists of reconfigurable cluster cores and programmable-logic cells with complementary atom switches. The complementary atom switches enable hardware programming without configuration memories, and thus soft-error on logic circuit connection is completely eliminated. This is a noteworthy advantage for space applications which cannot be found in conventional re-writable FPGAs. Almost one-tenth of lower power consumption is expected compared to conventional re-writable FPGAs because of the elimination of configuration memories. The proposed processor architecture can be reconfigured by behavioral synthesis with higher level language specification. Consequently, compensation functions are implemented in a single chip without accommodating program memories, which is accompanied with conventional microprocessors, while maintaining the comparable performance. This enables us to embed a processor element on each infrared signal detector output channel.
The mathematical theory of signal processing and compression-designs
NASA Astrophysics Data System (ADS)
Feria, Erlan H.
2006-05-01
The mathematical theory of signal processing, named processor coding, will be shown to inherently arise as the computational time dual of Shannon's mathematical theory of communication which is also known as source coding. Source coding is concerned with signal source memory space compression while processor coding deals with signal processor computational time compression. Their combination is named compression-designs and referred as Conde in short. A compelling and pedagogically appealing diagram will be discussed highlighting Conde's remarkable successful application to real-world knowledge-aided (KA) airborne moving target indicator (AMTI) radar.
Two Unipolar Terminal-Attractor-Based Associative Memories
NASA Technical Reports Server (NTRS)
Liu, Hua-Kuang; Wu, Chwan-Hwa
1995-01-01
Two unipolar mathematical models of electronic neural network functioning as terminal-attractor-based associative memory (TABAM) developed. Models comprise sets of equations describing interactions between time-varying inputs and outputs of neural-network memory, regarded as dynamical system. Simplifies design and operation of optoelectronic processor to implement TABAM performing associative recall of images. TABAM concept described in "Optoelectronic Terminal-Attractor-Based Associative Memory" (NPO-18790). Experimental optoelectronic apparatus that performed associative recall of binary images described in "Optoelectronic Inner-Product Neural Associative Memory" (NPO-18491).
Autonomous Telemetry Collection for Single-Processor Small Satellites
NASA Technical Reports Server (NTRS)
Speer, Dave
2003-01-01
For the Space Technology 5 mission, which is being developed under NASA's New Millennium Program, a single spacecraft processor will be required to do on-board real-time computations and operations associated with attitude control, up-link and down-link communications, science data processing, solid-state recorder management, power switching and battery charge management, experiment data collection, health and status data collection, etc. Much of the health and status information is in analog form, and each of the analog signals must be routed to the input of an analog-to-digital converter, converted to digital form, and then stored in memory. If the micro-operations of the analog data collection process are implemented in software, the processor may use up a lot of time either waiting for the analog signal to settle, waiting for the analog-to-digital conversion to complete, or servicing a large number of high frequency interrupts. In order to off-load a very busy processor, the collection and digitization of all analog spacecraft health and status data will be done autonomously by a field-programmable gate array that can configure the analog signal chain, control the analog-to-digital converter, and store the converted data in memory.
NASA Technical Reports Server (NTRS)
Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.
1993-01-01
This technical report contains the HOL listings of the specification of the design and major portions of the requirements for a commercially developed processor interface unit (or PIU). The PIU is an interface chip performing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. This report contains the actual HOL listings of the PIU specification as it currently exists. Section two of this report contains general-purpose HOL theories that support the PIU specification. These theories include definitions for the hardware components used in the PIU, our implementation of bit words, and our implementation of temporal logic. Section three contains the HOL listings for the PIU design specification. Aside from the PIU internal bus (I-Bus), this specification is complete. Section four contains the HOL listings for a major portion of the PIU requirements specification. Specifically, it contains most of the definition for the PIU behavior associated with memory accesses initiated by the local processor.
NASA Technical Reports Server (NTRS)
Szabo, Carl M., Jr.; Duncan, Adam R.; Label, Kenneth A.
2017-01-01
Testing of an Intel 14nm desktop processor was conducted under proton irradiation. We share lessons learned, demonstrating that complex devices beget further complex challenges requiring practical and theoretical investigative expertise to solve.
NASA Astrophysics Data System (ADS)
Weigand, R.
Two new processor devices have been developed for the use on board of spacecrafts. An 8-bit 8032-microcontroller targets typical controlling applications in instruments and sub-systems, or could be used as a main processor on small satellites, whereas the LEON 32-bit SPARC processor can be used for high performance controlling and data processing tasks. The ADV80S32 is fully compliant to the Intel 80x1 architecture and instruction set, extended by additional peripherals, 512 bytes on-chip RAM and a bootstrap PROM, which allows downloading the application software using the CCSDS PacketWire pro- tocol. The memory controller provides a de-multiplexed address/data bus, and allows to access up to 16 MB data and 8 MB program RAM. The peripherals have been de- signed for the specific needs of a spacecraft, such as serial interfaces compatible to RS232, PacketWire and TTC-B-01, counters/timers for extended duration and a CRC calculation unit accelerating the CCSDS TM/TC protocol. The 0.5 um Atmel manu- facturing technology (MG2RT) provides latch-up and total dose immunity; SEU fault immunity is implemented by using SEU hardened Flip-Flops and EDAC protection of internal and external memories. The maximum clock frequency of 20 MHz allows a processing power of 3 MIPS. Engineering samples are available. For SW develop- ment, various SW packages for the 8051 architecture are on the market. The LEON processor implements a 32-bit SPARC V8 architecture, including all the multiply and divide instructions, complemented by a floating-point unit (FPU). It includes several standard peripherals, such as timers/watchdog, interrupt controller, UARTs, parallel I/Os and a memory controller, allowing to use 8, 16 and 32 bit PROM, SRAM or memory mapped I/O. With on-chip separate instruction and data caches, almost one instruction per clock cycle can be reached in some applications. A 33-MHz 32-bit PCI master/target interface and a PCI arbiter allow operating the device in a plug-in card (for SW development on PC etc.), or to consider using it as a PCI master controller in an on-board system. Advanced SEU fault tolerance is in- troduced by design, using triple modular redundancy (TMR) flip-flops for all registers and EDAC protection for all memories. The device will be manufactured in a radia- tion hard Atmel 0.25 um technology, targeting 100 MHz processor clock frequency. The non fault-tolerant LEON processor VHDL model is available as free source code, and the SPARC architecture is a well-known industry standard. Therefore, know-how, software tools and operating systems are widely available.
Multitask neurovision processor with extensive feedback and feedforward connections
NASA Astrophysics Data System (ADS)
Gupta, Madan M.; Knopf, George K.
1991-11-01
A multi-task neuro-vision parameter which performs a variety of information processing operations associated with the early stages of biological vision is presented. The network architecture of this neuro-vision processor, called the positive-negative (PN) neural processor, is loosely based on the neural activity fields exhibited by thalamic and cortical nervous tissue layers. The computational operation performed by the processor arises from the strength of the recurrent feedback among the numerous positive and negative neural computing units. By adjusting the feedback connections it is possible to generate diverse dynamic behavior that may be used for short-term visual memory (STVM), spatio-temporal filtering (STF), and pulse frequency modulation (PFM). The information attributes that are to be processes may be regulated by modifying the feedforward connections from the signal space to the neural processor.
Multi-processing on supercomputers for computational aerodynamics
NASA Technical Reports Server (NTRS)
Yarrow, Maurice; Mehta, Unmeel B.
1990-01-01
The MIMD concept is applied, through multitasking, with relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. An existing single processor algorithm is mapped without the need for developing a new algorithm. The procedure of designing a code utilizing this approach is automated with the Unix stream editor. A Multiple Processor Multiple Grid (MPMG) code is developed as a demonstration of this approach. This code solves the three-dimensional, Reynolds-averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. This solver is applied to a generic, oblique-wing aircraft problem on a four-processor computer using one process for data management and nonparallel computations and three processes for pseudotime advance on three different grid systems.
Enabling Graph Appliance for Genome Assembly
DOE Office of Scientific and Technical Information (OSTI.GOV)
Singh, Rina; Graves, Jeffrey A; Lee, Sangkeun
2015-01-01
In recent years, there has been a huge growth in the amount of genomic data available as reads generated from various genome sequencers. The number of reads generated can be huge, ranging from hundreds to billions of nucleotide, each varying in size. Assembling such large amounts of data is one of the challenging computational problems for both biomedical and data scientists. Most of the genome assemblers developed have used de Bruijn graph techniques. A de Bruijn graph represents a collection of read sequences by billions of vertices and edges, which require large amounts of memory and computational power to storemore » and process. This is the major drawback to de Bruijn graph assembly. Massively parallel, multi-threaded, shared memory systems can be leveraged to overcome some of these issues. The objective of our research is to investigate the feasibility and scalability issues of de Bruijn graph assembly on Cray s Urika-GD system; Urika-GD is a high performance graph appliance with a large shared memory and massively multithreaded custom processor designed for executing SPARQL queries over large-scale RDF data sets. However, to the best of our knowledge, there is no research on representing a de Bruijn graph as an RDF graph or finding Eulerian paths in RDF graphs using SPARQL for potential genome discovery. In this paper, we address the issues involved in representing a de Bruin graphs as RDF graphs and propose an iterative querying approach for finding Eulerian paths in large RDF graphs. We evaluate the performance of our implementation on real world ebola genome datasets and illustrate how genome assembly can be accomplished with Urika-GD using iterative SPARQL queries.« less
Development of 3-Year Roadmap to Transform the Discipline of Systems Engineering
2010-03-31
quickly humans could physically construct them. Indeed, magnetic core memory was entirely constructed by human hands until it was superseded by...For their mainframe computers, IBM develops the applications, operating system, computer hardware and microprocessors (off the shelf standard memory ...processor developers work on potential computational and memory pipelines to support the required performance capabilities and use the available transistors
A sparse matrix algorithm on the Boolean vector machine
NASA Technical Reports Server (NTRS)
Wagner, Robert A.; Patrick, Merrell L.
1988-01-01
VLSI technology is being used to implement a prototype Boolean Vector Machine (BVM), which is a large network of very small processors with equally small memories that operate in SIMD mode; these use bit-serial arithmetic, and communicate via cube-connected cycles network. The BVM's bit-serial arithmetic and the small memories of individual processors are noted to compromise the system's effectiveness in large numerical problem applications. Attention is presently given to the implementation of a basic matrix-vector iteration algorithm for space matrices of the BVM, in order to generate over 1 billion useful floating-point operations/sec for this iteration algorithm. The algorithm is expressed in a novel language designated 'BVM'.
Event parallelism: Distributed memory parallel computing for high energy physics experiments
NASA Astrophysics Data System (ADS)
Nash, Thomas
1989-12-01
This paper describes the present and expected future development of distributed memory parallel computers for high energy physics experiments. It covers the use of event parallel microprocessor farms, particularly at Fermilab, including both ACP multiprocessors and farms of MicroVAXES. These systems have proven very cost effective in the past. A case is made for moving to the more open environment of UNIX and RISC processors. The 2nd Generation ACP Multiprocessor System, which is based on powerful RISC system, is described. Given the promise of still more extraordinary increases in processor performance, a new emphasis on point to point, rather than bussed, communication will be required. Developments in this direction are described.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Smith, Tyler Barratt; Urrea, Jorge Mario
2012-06-01
The aim of the Authenticating Cache architecture is to ensure that machine instructions in a Read Only Memory (ROM) are legitimate from the time the ROM image is signed (immediately after compilation) to the time they are placed in the cache for the processor to consume. The proposed architecture allows the detection of ROM image modifications during distribution or when it is loaded into memory. It also ensures that modified instructions will not execute in the processor-as the cache will not be loaded with a page that fails an integrity check. The authenticity of the instruction stream can also bemore » verified in this architecture. The combination of integrity and authenticity assurance greatly improves the security profile of a system.« less
Compression of CCD raw images for digital still cameras
NASA Astrophysics Data System (ADS)
Sriram, Parthasarathy; Sudharsanan, Subramania
2005-03-01
Lossless compression of raw CCD images captured using color filter arrays has several benefits. The benefits include improved storage capacity, reduced memory bandwidth, and lower power consumption for digital still camera processors. The paper discusses the benefits in detail and proposes the use of a computationally efficient block adaptive scheme for lossless compression. Experimental results are provided that indicate that the scheme performs well for CCD raw images attaining compression factors of more than two. The block adaptive method also compares favorably with JPEG-LS. A discussion is provided indicating how the proposed lossless coding scheme can be incorporated into digital still camera processors enabling lower memory bandwidth and storage requirements.
Vector generator scan converter
Moore, James M.; Leighton, James F.
1990-01-01
High printing speeds for graphics data are achieved with a laser printer by transmitting compressed graphics data from a main processor over an I/O (input/output) channel to a vector generator scan converter which reconstructs a full graphics image for input to the laser printer through a raster data input port. The vector generator scan converter includes a microprocessor with associated microcode memory containing a microcode instruction set, a working memory for storing compressed data, vector generator hardward for drawing a full graphic image from vector parameters calculated by the microprocessor, image buffer memory for storing the reconstructed graphics image and an output scanner for reading the graphics image data and inputting the data to the printer. The vector generator scan converter eliminates the bottleneck created by the I/O channel for transmitting graphics data from the main processor to the laser printer, and increases printer speed up to thirty fold.
Vector generator scan converter
Moore, J.M.; Leighton, J.F.
1988-02-05
High printing speeds for graphics data are achieved with a laser printer by transmitting compressed graphics data from a main processor over an I/O channel to a vector generator scan converter which reconstructs a full graphics image for input to the laser printer through a raster data input port. The vector generator scan converter includes a microprocessor with associated microcode memory containing a microcode instruction set, a working memory for storing compressed data, vector generator hardware for drawing a full graphic image from vector parameters calculated by the microprocessor, image buffer memory for storing the reconstructed graphics image and an output scanner for reading the graphics image data and inputting the data to the printer. The vector generator scan converter eliminates the bottleneck created by the I/O channel for transmitting graphics data from the main processor to the laser printer, and increases printer speed up to thirty fold. 7 figs.
CDL description of the CDC 6600 stunt box
NASA Technical Reports Server (NTRS)
Hertzog, J. B.
1971-01-01
The CDC 6600 central memory control (stunt box) is described utilizing CDL (Computer Design Language), block diagrams, and text. The stunt box is a clearing house for all central memory references from the 6600 central and peripheral processors. Since memory requests can be issued simultaneously, the stunt box must be capable of assigning priorities to requests, of labeling requests so that the data will be distributed correctly, and of remembering rejected addresses due to memory conflicts.
Organization Theory and Memory for Prose: A Review of the Literature
ERIC Educational Resources Information Center
Shimmerlik, Susan M.
1978-01-01
Organization theory emphasizes groupings of items on the basis of a variety of characteristics, and the role of the learner as an active processor or encoder of information. Research on organization theory as it is applied to memory and recall of prose is reviewed here. (BW)
VOP memory management in MPEG-4
NASA Astrophysics Data System (ADS)
Vaithianathan, Karthikeyan; Panchanathan, Sethuraman
2001-03-01
MPEG-4 is a multimedia standard that requires Video Object Planes (VOPs). Generation of VOPs for any kind of video sequence is still a challenging problem that largely remains unsolved. Nevertheless, if this problem is treated by imposing certain constraints, solutions for specific application domains can be found. MPEG-4 applications in mobile devices is one such domain where the opposite goals namely low power and high throughput are required to be met. Efficient memory management plays a major role in reducing the power consumption. Specifically, efficient memory management for VOPs is difficult because the lifetimes of these objects vary and these life times may be overlapping. Varying life times of the objects requires dynamic memory management where memory fragmentation is a key problem that needs to be addressed. In general, memory management systems address this problem by following a combination of strategy, policy and mechanism. For MPEG4 based mobile devices that lack instruction processors, a hardware based memory management solution is necessary. In MPEG4 based mobile devices that have a RISC processor, using a Real time operating system (RTOS) for this memory management task is not expected to be efficient because the strategies and policies used by the ROTS is often tuned for handling memory segments of smaller sizes compared to object sizes. Hence, a memory management scheme specifically tuned for VOPs is important. In this paper, different strategies, policies and mechanisms for memory management are considered and an efficient combination is proposed for the case of VOP memory management along with a hardware architecture, which can handle the proposed combination.
Organization of brain tissue - Is the brain a noisy processor.
NASA Technical Reports Server (NTRS)
Adey, W. R.
1972-01-01
This paper presents some thoughts on functional organization in cerebral tissue. 'Spontaneous' wave and unit firing are considered as essential phenomena in the handling of information. Various models are discussed which have been suggested to describe the pseudorandom behavior of brain cells, leading to a view of the brain as an information processor and its role in learning, memory, remembering and forgetting.
NASA Astrophysics Data System (ADS)
Fang, Y.; Hou, J.; Engel, D.; Lin, G.; Yin, J.; Han, B.; Fang, Z.; Fountoulakis, V.
2011-12-01
In this study, we introduce an uncertainty quantification (UQ) software framework for carbon sequestration, with the focus of studying being the effect of spatial heterogeneity of reservoir properties on CO2 migration. We use a sequential Gaussian method (SGSIM) to generate realizations of permeability fields with various spatial statistical attributes. To deal with the computational difficulties, we integrate the following ideas/approaches: 1) firstly, we use three different sampling approaches (probabilistic collocation, quasi-Monte Carlo, and adaptive sampling approaches) to reduce the required forward calculations while trying to explore the parameter space and quantify the input uncertainty; 2) secondly, we use eSTOMP as the forward modeling simulator. eSTOMP is implemented using the Global Arrays toolkit (GA) that is based on one-sided inter-processor communication and supports a shared memory programming style on distributed memory platforms. It provides highly-scalable performance. It uses a data model to partition most of the large scale data structures into a relatively small number of distinct classes. The lower level simulator infrastructure (e.g. meshing support, associated data structures, and data mapping to processors) is separated from the higher level physics and chemistry algorithmic routines using a grid component interface; and 3) besides the faster model and more efficient algorithms to speed up the forward calculation, we built an adaptive system infrastructure to select the best possible data transfer mechanisms, to optimally allocate system resources to improve performance, and to integrate software packages and data for composing carbon sequestration simulation, computation, analysis, estimation and visualization. We will demonstrate the framework with a given CO2 injection scenario in a heterogeneous sandstone reservoir.
Set processing in a network environment. [data bases and magnetic disks and tapes
NASA Technical Reports Server (NTRS)
Hardgrave, W. T.
1975-01-01
A combination of a local network, a mass storage system, and an autonomous set processor serving as a data/storage management machine is described. Its characteristics include: content-accessible data bases usable from all connected devices; efficient storage/access of large data bases; simple and direct programming with data manipulation and storage management handled by the set processor; simple data base design and entry from source representation to set processor representation with no predefinition necessary; capability available for user sort/order specification; significant reduction in tape/disk pack storage and mounts; flexible environment that allows upgrading hardware/software configuration without causing major interruptions in service; minimal traffic on data communications network; and improved central memory usage on large processors.
A distributed fault-tolerant signal processor /FTSP/
NASA Astrophysics Data System (ADS)
Bonneau, R. J.; Evett, R. C.; Young, M. J.
1980-01-01
A digital fault-tolerant signal processor (FTSP), an example of a self-repairing programmable system is analyzed. The design configuration is discussed in terms of fault tolerance, system-level fault detection, isolation and common memory. Special attention is given to the FDIR (fault detection isolation and reconfiguration) logic, noting that the reconfiguration decisions are based on configuration, summary status, end-around tests, and north marker/synchro data. Several mechanisms of fault detection are described which initiate reconfiguration at different levels. It is concluded that the reliability of a signal processor can be significantly enhanced by the use of fault-tolerant techniques.
Method for simultaneous overlapped communications between neighboring processors in a multiple
Benner, Robert E.; Gustafson, John L.; Montry, Gary R.
1991-01-01
A parallel computing system and method having improved performance where a program is concurrently run on a plurality of nodes for reducing total processing time, each node having a processor, a memory, and a predetermined number of communication channels connected to the node and independently connected directly to other nodes. The present invention improves performance of performance of the parallel computing system by providing a system which can provide efficient communication between the processors and between the system and input and output devices. A method is also disclosed which can locate defective nodes with the computing system.
HyperForest: A high performance multi-processor architecture for real-time intelligent systems
DOE Office of Scientific and Technical Information (OSTI.GOV)
Garcia, P. Jr.; Rebeil, J.P.; Pollard, H.
1997-04-01
Intelligent Systems are characterized by the intensive use of computer power. The computer revolution of the last few years is what has made possible the development of the first generation of Intelligent Systems. Software for second generation Intelligent Systems will be more complex and will require more powerful computing engines in order to meet real-time constraints imposed by new robots, sensors, and applications. A multiprocessor architecture was developed that merges the advantages of message-passing and shared-memory structures: expendability and real-time compliance. The HyperForest architecture will provide an expandable real-time computing platform for computationally intensive Intelligent Systems and open the doorsmore » for the application of these systems to more complex tasks in environmental restoration and cleanup projects, flexible manufacturing systems, and DOE`s own production and disassembly activities.« less
Unobtrusive Software and System Health Management with R2U2 on a Parallel MIMD Coprocessor
NASA Technical Reports Server (NTRS)
Schumann, Johann; Moosbrugger, Patrick
2017-01-01
Dynamic monitoring of software and system health of a complex cyber-physical system requires observers that continuously monitor variables of the embedded software in order to detect anomalies and reason about root causes. There exists a variety of techniques for code instrumentation, but instrumentation might change runtime behavior and could require costly software re-certification. In this paper, we present R2U2E, a novel realization of our real-time, Realizable, Responsive, and Unobtrusive Unit (R2U2). The R2U2E observers are executed in parallel on a dedicated 16-core EPIPHANY co-processor, thereby avoiding additional computational overhead to the system under observation. A DMA-based shared memory access architecture allows R2U2E to operate without any code instrumentation or program interference.
Principles for problem aggregation and assignment in medium scale multiprocessors
NASA Technical Reports Server (NTRS)
Nicol, David M.; Saltz, Joel H.
1987-01-01
One of the most important issues in parallel processing is the mapping of workload to processors. This paper considers a large class of problems having a high degree of potential fine grained parallelism, and execution requirements that are either not predictable, or are too costly to predict. The main issues in mapping such a problem onto medium scale multiprocessors are those of aggregation and assignment. We study a method of parameterized aggregation that makes few assumptions about the workload. The mapping of aggregate units of work onto processors is uniform, and exploits locality of workload intensity to balance the unknown workload. In general, a finer aggregate granularity leads to a better balance at the price of increased communication/synchronization costs; the aggregation parameters can be adjusted to find a reasonable granularity. The effectiveness of this scheme is demonstrated on three model problems: an adaptive one-dimensional fluid dynamics problem with message passing, a sparse triangular linear system solver on both a shared memory and a message-passing machine, and a two-dimensional time-driven battlefield simulation employing message passing. Using the model problems, the tradeoffs are studied between balanced workload and the communication/synchronization costs. Finally, an analytical model is used to explain why the method balances workload and minimizes the variance in system behavior.
Progress Towards a Rad-Hydro Code for Modern Computing Architectures LA-UR-10-02825
NASA Astrophysics Data System (ADS)
Wohlbier, J. G.; Lowrie, R. B.; Bergen, B.; Calef, M.
2010-11-01
We are entering an era of high performance computing where data movement is the overwhelming bottleneck to scalable performance, as opposed to the speed of floating-point operations per processor. All multi-core hardware paradigms, whether heterogeneous or homogeneous, be it the Cell processor, GPGPU, or multi-core x86, share this common trait. In multi-physics applications such as inertial confinement fusion or astrophysics, one may be solving multi-material hydrodynamics with tabular equation of state data lookups, radiation transport, nuclear reactions, and charged particle transport in a single time cycle. The algorithms are intensely data dependent, e.g., EOS, opacity, nuclear data, and multi-core hardware memory restrictions are forcing code developers to rethink code and algorithm design. For the past two years LANL has been funding a small effort referred to as Multi-Physics on Multi-Core to explore ideas for code design as pertaining to inertial confinement fusion and astrophysics applications. The near term goals of this project are to have a multi-material radiation hydrodynamics capability, with tabular equation of state lookups, on cartesian and curvilinear block structured meshes. In the longer term we plan to add fully implicit multi-group radiation diffusion and material heat conduction, and block structured AMR. We will report on our progress to date.
Potential of minicomputer/array-processor system for nonlinear finite-element analysis
NASA Technical Reports Server (NTRS)
Strohkorb, G. A.; Noor, A. K.
1983-01-01
The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.
Design of RISC Processor Using VHDL and Cadence
NASA Astrophysics Data System (ADS)
Moslehpour, Saeid; Puliroju, Chandrasekhar; Abu-Aisheh, Akram
The project deals about development of a basic RISC processor. The processor is designed with basic architecture consisting of internal modules like clock generator, memory, program counter, instruction register, accumulator, arithmetic and logic unit and decoder. This processor is mainly used for simple general purpose like arithmetic operations and which can be further developed for general purpose processor by increasing the size of the instruction register. The processor is designed in VHDL by using Xilinx 8.1i version. The present project also serves as an application of the knowledge gained from past studies of the PSPICE program. The study will show how PSPICE can be used to simplify massive complex circuits designed in VHDL Synthesis. The purpose of the project is to explore the designed RISC model piece by piece, examine and understand the Input/ Output pins, and to show how the VHDL synthesis code can be converted to a simplified PSPICE model. The project will also serve as a collection of various research materials about the pieces of the circuit.
Fault tolerant, radiation hard, high performance digital signal processor
NASA Technical Reports Server (NTRS)
Holmann, Edgar; Linscott, Ivan R.; Maurer, Michael J.; Tyler, G. L.; Libby, Vibeke
1990-01-01
An architecture has been developed for a high-performance VLSI digital signal processor that is highly reliable, fault-tolerant, and radiation-hard. The signal processor, part of a spacecraft receiver designed to support uplink radio science experiments at the outer planets, organizes the connections between redundant arithmetic resources, register files, and memory through a shuffle exchange communication network. The configuration of the network and the state of the processor resources are all under microprogram control, which both maps the resources according to algorithmic needs and reconfigures the processing should a failure occur. In addition, the microprogram is reloadable through the uplink to accommodate changes in the science objectives throughout the course of the mission. The processor will be implemented with silicon compiler tools, and its design will be verified through silicon compilation simulation at all levels from the resources to full functionality. By blending reconfiguration with redundancy the processor implementation is fault-tolerant and reliable, and possesses the long expected lifetime needed for a spacecraft mission to the outer planets.
Solving very large, sparse linear systems on mesh-connected parallel computers
NASA Technical Reports Server (NTRS)
Opsahl, Torstein; Reif, John
1987-01-01
The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.
Parallel volume ray-casting for unstructured-grid data on distributed-memory architectures
NASA Technical Reports Server (NTRS)
Ma, Kwan-Liu
1995-01-01
As computing technology continues to advance, computational modeling of scientific and engineering problems produces data of increasing complexity: large in size and unstructured in shape. Volume visualization of such data is a challenging problem. This paper proposes a distributed parallel solution that makes ray-casting volume rendering of unstructured-grid data practical. Both the data and the rendering process are distributed among processors. At each processor, ray-casting of local data is performed independent of the other processors. The global image composing processes, which require inter-processor communication, are overlapped with the local ray-casting processes to achieve maximum parallel efficiency. This algorithm differs from previous ones in four ways: it is completely distributed, less view-dependent, reasonably scalable, and flexible. Without using dynamic load balancing, test results on the Intel Paragon using from two to 128 processors show, on average, about 60% parallel efficiency.
Crosetto, Dario B.
1996-01-01
The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.
Bhanot, Gyan [Princeton, NJ; Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY
2009-09-08
Class network routing is implemented in a network such as a computer network comprising a plurality of parallel compute processors at nodes thereof. Class network routing allows a compute processor to broadcast a message to a range (one or more) of other compute processors in the computer network, such as processors in a column or a row. Normally this type of operation requires a separate message to be sent to each processor. With class network routing pursuant to the invention, a single message is sufficient, which generally reduces the total number of messages in the network as well as the latency to do a broadcast. Class network routing is also applied to dense matrix inversion algorithms on distributed memory parallel supercomputers with hardware class function (multicast) capability. This is achieved by exploiting the fact that the communication patterns of dense matrix inversion can be served by hardware class functions, which results in faster execution times.
Study on fault-tolerant processors for advanced launch system
NASA Technical Reports Server (NTRS)
Shin, Kang G.; Liu, Jyh-Charn
1990-01-01
Issues related to the reliability of a redundant system with large main memory are addressed. The Fault-Tolerant Processor (FTP) for the Advanced Launch System (ALS) is used as a basis for the presentation. When the system is free of latent faults, the probability of system crash due to multiple channel faults is shown to be insignificant even when voting on the outputs of computing channels is infrequent. Using channel error maskers (CEMs) is shown to improve reliability more effectively than increasing redundancy or the number of channels for applications with long mission times. Even without using a voter, most memory errors can be immediately corrected by those CEMs implemented with conventional coding techniques. In addition to their ability to enhance system reliability, CEMs (with a very low hardware overhead) can be used to dramatically reduce not only the need of memory realignment, but also the time required to realign channel memories in case, albeit rare, such a need arises. Using CEMs, two different schemes were developed to solve the memory realignment problem. In both schemes, most errors are corrected by CEMs, and the remaining errors are masked by a voter.
Local wavelet transform: a cost-efficient custom processor for space image compression
NASA Astrophysics Data System (ADS)
Masschelein, Bart; Bormans, Jan G.; Lafruit, Gauthier
2002-11-01
Thanks to its intrinsic scalability features, the wavelet transform has become increasingly popular as decorrelator in image compression applications. Throuhgput, memory requirements and complexity are important parameters when developing hardware image compression modules. An implementation of the classical, global wavelet transform requires large memory sizes and implies a large latency between the availability of the input image and the production of minimal data entities for entropy coding. Image tiling methods, as proposed by JPEG2000, reduce the memory sizes and the latency, but inevitably introduce image artefacts. The Local Wavelet Transform (LWT), presented in this paper, is a low-complexity wavelet transform architecture using a block-based processing that results in the same transformed images as those obtained by the global wavelet transform. The architecture minimizes the processing latency with a limited amount of memory. Moreover, as the LWT is an instruction-based custom processor, it can be programmed for specific tasks, such as push-broom processing of infinite-length satelite images. The features of the LWT makes it appropriate for use in space image compression, where high throughput, low memory sizes, low complexity, low power and push-broom processing are important requirements.
AFTOMS Technology Issues and Alternatives Report
1989-12-01
color , resolu- power requirements, physi- tion; memory , processor speed; cal and weather rugged- IAN interfaces, etc,) f,: these ness. display...Telephone and Telegraph 3 CD-I Compact Disk - Interactive CD-ROM Compact Disk-Read Only Memory CGM Computer Graphics Metafile CNWDI Critical Nuclear...Database Management System RFP Request For Proposal 3 RFS Remote File System ROM Read Only Memory 3 S SA-ALC San Antonio Air Logistics Center 3 SAC
Efficiently modeling neural networks on massively parallel computers
NASA Technical Reports Server (NTRS)
Farber, Robert M.
1993-01-01
Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
Development of Improved Modeling and Analysis Techniques for Dynamics of Shell Structures
1991-07-24
Engineering Sciences and Center for Space Structures and Control University of Colorado,Campus Box 429 Boulder, Colorado 80309 Accesion :or -.... ... i...system architecture ; third, to implement a decomposi- tion/mapping procedure that matches as far as possible the layout of the processors to the...element computations. In particular. we address issues that are related to the processor memory size. to the SIMD architecture and to the fast
Computer-Aided Fabrication of Integrated Circuits
1989-09-30
baseline CMOS process. One result of this effort was the identification of several residual bugs in the PATRAN graphics processor . The vendor promises...virtual memory. The internal Nubus architecture uses a 32-bit LISP processor running at 10 megahertz (100 ns clock period). An ethernet controller is...For different patterns, we need different masks for the photo step, and for dif- ferent micro -structures of the wafers, we need different etching
Energy consumption estimation of an OMAP-based Android operating system
NASA Astrophysics Data System (ADS)
González, Gabriel; Juárez, Eduardo; Castro, Juan José; Sanz, César
2011-05-01
System-level energy optimization of battery-powered multimedia embedded systems has recently become a design goal. The poor operational time of multimedia terminals makes computationally demanding applications impractical in real scenarios. For instance, the so-called smart-phones are currently unable to remain in operation longer than several hours. The OMAP3530 processor basically consists of two processing cores, a General Purpose Processor (GPP) and a Digital Signal Processor (DSP). The former, an ARM Cortex-A8 processor, is aimed to run a generic Operating System (OS) while the latter, a DSP core based on the C64x+, has architecture optimized for video processing. The BeagleBoard, a commercial prototyping board based on the OMAP processor, has been used to test the Android Operating System and measure its performance. The board has 128 MB of SDRAM external memory, 256 MB of Flash external memory and several interfaces. Note that the clock frequency of the ARM and DSP OMAP cores is 600 MHz and 430 MHz, respectively. This paper describes the energy consumption estimation of the processes and multimedia applications of an Android v1.6 (Donut) OS on the OMAP3530-Based BeagleBoard. In addition, tools to communicate the two processing cores have been employed. A test-bench to profile the OS resource usage has been developed. As far as the energy estimates concern, the OMAP processor energy consumption model provided by the manufacturer has been used. The model is basically divided in two energy components. The former, the baseline core energy, describes the energy consumption that is independent of any chip activity. The latter, the module active energy, describes the energy consumed by the active modules depending on resource usage.
Programmable fuzzy associative memory processor
NASA Astrophysics Data System (ADS)
Shao, Lan; Liu, Liren; Li, Guoqiang
1996-02-01
An optical system based on the method of spatial area-coding and multiple image scheme is proposed for fuzzy associative memory processing. Fuzzy maximum operation is accomplished by a ferroelectric liquid crystal PROM instead of a computer-based approach. A relative subsethood is introduced here to be used as a criterion for the recall evaluation.
Respecting Relations: Memory Access and Antecedent Retrieval in Incremental Sentence Processing
ERIC Educational Resources Information Center
Kush, Dave W.
2013-01-01
This dissertation uses the processing of anaphoric relations to probe how linguistic information is encoded in and retrieved from memory during real-time sentence comprehension. More specifically, the dissertation attempts to resolve a tension between the demands of a linguistic processor implemented in a general-purpose cognitive architecture and…
Peregrine System | High-Performance Computing | NREL
) and longer-term (/projects) storage. These file systems are mounted on all nodes. Peregrine has three -2670 Xeon processors and 64 GB of memory. In addition to mounting the /home, /nopt, /projects and # cores/node Memory/node Peak (DP) performance per node 88 Intel Xeon E5-2670 "Sandy Bridge" 8
The Effect of NUMA Tunings on CPU Performance
NASA Astrophysics Data System (ADS)
Hollowell, Christopher; Caramarcu, Costin; Strecker-Kellogg, William; Wong, Antonio; Zaytsev, Alexandr
2015-12-01
Non-Uniform Memory Access (NUMA) is a memory architecture for symmetric multiprocessing (SMP) systems where each processor is directly connected to separate memory. Indirect access to other CPU's (remote) RAM is still possible, but such requests are slower as they must also pass through that memory's controlling CPU. In concert with a NUMA-aware operating system, the NUMA hardware architecture can help eliminate the memory performance reductions generally seen in SMP systems when multiple processors simultaneously attempt to access memory. The x86 CPU architecture has supported NUMA for a number of years. Modern operating systems such as Linux support NUMA-aware scheduling, where the OS attempts to schedule a process to the CPU directly attached to the majority of its RAM. In Linux, it is possible to further manually tune the NUMA subsystem using the numactl utility. With the release of Red Hat Enterprise Linux (RHEL) 6.3, the numad daemon became available in this distribution. This daemon monitors a system's NUMA topology and utilization, and automatically makes adjustments to optimize locality. As the number of cores in x86 servers continues to grow, efficient NUMA mappings of processes to CPUs/memory will become increasingly important. This paper gives a brief overview of NUMA, and discusses the effects of manual tunings and numad on the performance of the HEPSPEC06 benchmark, and ATLAS software.
Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; LeCun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe
2012-01-01
Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons. PMID:22518097
Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; Lecun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe
2012-01-01
Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Nash, T.; Atac, R.; Cook, A.
1989-03-06
The ACPMAPS multipocessor is a highly cost effective, local memory parallel computer with a hypercube or compound hypercube architecture. Communication requires the attention of only the two communicating nodes. The design is aimed at floating point intensive, grid like problems, particularly those with extreme computing requirements. The processing nodes of the system are single board array processors, each with a peak power of 20 Mflops, supported by 8 Mbytes of data and 2 Mbytes of instruction memory. The system currently being assembled has a peak power of 5 Gflops. The nodes are based on the Weitek XL Chip set. Themore » system delivers performance at approximately $300/Mflop. 8 refs., 4 figs.« less
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Chao; Pouransari, Hadi; Rajamanickam, Sivasankaran
We present a parallel hierarchical solver for general sparse linear systems on distributed-memory machines. For large-scale problems, this fully algebraic algorithm is faster and more memory-efficient than sparse direct solvers because it exploits the low-rank structure of fill-in blocks. Depending on the accuracy of low-rank approximations, the hierarchical solver can be used either as a direct solver or as a preconditioner. The parallel algorithm is based on data decomposition and requires only local communication for updating boundary data on every processor. Moreover, the computation-to-communication ratio of the parallel algorithm is approximately the volume-to-surface-area ratio of the subdomain owned by everymore » processor. We also provide various numerical results to demonstrate the versatility and scalability of the parallel algorithm.« less
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures
NASA Technical Reports Server (NTRS)
Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.
2012-01-01
In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY
DOE Office of Scientific and Technical Information (OSTI.GOV)
Barhen, Jacob; Kerekes, Ryan A; ST Charles, Jesse Lee
2008-01-01
High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlationmore » processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.« less
Optimal processor assignment for pipeline computations
NASA Technical Reports Server (NTRS)
Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath
1991-01-01
The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.
Performing an allreduce operation using shared memory
Archer, Charles J [Rochester, MN; Dozsa, Gabor [Ardsley, NY; Ratterman, Joseph D [Rochester, MN; Smith, Brian E [Rochester, MN
2012-04-17
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Performing an allreduce operation using shared memory
Archer, Charles J; Dozsa, Gabor; Ratterman, Joseph D; Smith, Brian E
2014-06-10
Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit.
Second Report of the Multirate Processor (MRP) for Digital Voice Communications.
1982-09-30
machine are: * two arithmetic logic units (ALUs)-one for data processing, and the other for address generation, * two memorys -6144 words (70 bits per word...of program memory , and 6094 words (16 bits per word) of data memory , q * input/output through modem and teletype, -15 .9 S-;. KANG AND FRANSEN Table...provides a measure of intelligibility and allows one to evaluate the discriminability of six distinctive features: voicing, nasality, sustention
DOE Office of Scientific and Technical Information (OSTI.GOV)
Zhang, Yao; Balaprakash, Prasanna; Meng, Jiayuan
We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list ofmore » architectural scaling options for future processor designs.« less
FPGA-Based Reconfigurable Processor for Ultrafast Interlaced Ultrasound and Photoacoustic Imaging
Alqasemi, Umar; Li, Hai; Aguirre, Andrés; Zhu, Quing
2016-01-01
In this paper, we report, to the best of our knowledge, a unique field-programmable gate array (FPGA)-based reconfigurable processor for real-time interlaced co-registered ultrasound and photoacoustic imaging and its application in imaging tumor dynamic response. The FPGA is used to control, acquire, store, delay-and-sum, and transfer the data for real-time co-registered imaging. The FPGA controls the ultrasound transmission and ultrasound and photoacoustic data acquisition process of a customized 16-channel module that contains all of the necessary analog and digital circuits. The 16-channel module is one of multiple modules plugged into a motherboard; their beamformed outputs are made available for a digital signal processor (DSP) to access using an external memory interface (EMIF). The FPGA performs a key role through ultrafast reconfiguration and adaptation of its structure to allow real-time switching between the two imaging modes, including transmission control, laser synchronization, internal memory structure, beamforming, and EMIF structure and memory size. It performs another role by parallel accessing of internal memories and multi-thread processing to reduce the transfer of data and the processing load on the DSP. Furthermore, because the laser will be pulsing even during ultrasound pulse-echo acquisition, the FPGA ensures that the laser pulses are far enough from the pulse-echo acquisitions by appropriate time-division multiplexing (TDM). A co-registered ultrasound and photoacoustic imaging system consisting of four FPGA modules (64-channels) is constructed, and its performance is demonstrated using phantom targets and in vivo mouse tumor models. PMID:22828830
FPGA-based reconfigurable processor for ultrafast interlaced ultrasound and photoacoustic imaging.
Alqasemi, Umar; Li, Hai; Aguirre, Andrés; Zhu, Quing
2012-07-01
In this paper, we report, to the best of our knowledge, a unique field-programmable gate array (FPGA)-based reconfigurable processor for real-time interlaced co-registered ultrasound and photoacoustic imaging and its application in imaging tumor dynamic response. The FPGA is used to control, acquire, store, delay-and-sum, and transfer the data for real-time co-registered imaging. The FPGA controls the ultrasound transmission and ultrasound and photoacoustic data acquisition process of a customized 16-channel module that contains all of the necessary analog and digital circuits. The 16-channel module is one of multiple modules plugged into a motherboard; their beamformed outputs are made available for a digital signal processor (DSP) to access using an external memory interface (EMIF). The FPGA performs a key role through ultrafast reconfiguration and adaptation of its structure to allow real-time switching between the two imaging modes, including transmission control, laser synchronization, internal memory structure, beamforming, and EMIF structure and memory size. It performs another role by parallel accessing of internal memories and multi-thread processing to reduce the transfer of data and the processing load on the DSP. Furthermore, because the laser will be pulsing even during ultrasound pulse-echo acquisition, the FPGA ensures that the laser pulses are far enough from the pulse-echo acquisitions by appropriate time-division multiplexing (TDM). A co-registered ultrasound and photoacoustic imaging system consisting of four FPGA modules (64-channels) is constructed, and its performance is demonstrated using phantom targets and in vivo mouse tumor models.
A Parallel Workload Model and its Implications for Processor Allocation
1996-11-01
with SEV or AVG, both of which can tolerate c = 0.4 { 0.6 before their performance deteriorates signi cantly. On the other hand, Setia [10] has...Sanjeev. K Setia . The interaction between memory allocation and adaptive partitioning in message-passing multicomputers. In IPPS Workshop on Job...Scheduling Strategies for Parallel Processing, pages 89{99, 1995. [11] Sanjeev K. Setia and Satish K. Tripathi. An analysis of several processor
Analog Delta-Back-Propagation Neural-Network Circuitry
NASA Technical Reports Server (NTRS)
Eberhart, Silvio
1990-01-01
Changes in synapse weights due to circuit drifts suppressed. Proposed fully parallel analog version of electronic neural-network processor based on delta-back-propagation algorithm. Processor able to "learn" when provided with suitable combinations of inputs and enforced outputs. Includes programmable resistive memory elements (corresponding to synapses), conductances (synapse weights) adjusted during learning. Buffer amplifiers, summing circuits, and sample-and-hold circuits arranged in layers of electronic neurons in accordance with delta-back-propagation algorithm.
Two-dimensional systolic-array architecture for pixel-level vision tasks
NASA Astrophysics Data System (ADS)
Vijverberg, Julien A.; de With, Peter H. N.
2010-05-01
This paper presents ongoing work on the design of a two-dimensional (2D) systolic array for image processing. This component is designed to operate on a multi-processor system-on-chip. In contrast with other 2D systolic-array architectures and many other hardware accelerators, we investigate the applicability of executing multiple tasks in a time-interleaved fashion on the Systolic Array (SA). This leads to a lower external memory bandwidth and better load balancing of the tasks on the different processing tiles. To enable the interleaving of tasks, we add a shadow-state register for fast task switching. To reduce the number of accesses to the external memory, we propose to share the communication assist between consecutive tasks. A preliminary, non-functional version of the SA has been synthesized for an XV4S25 FPGA device and yields a maximum clock frequency of 150 MHz requiring 1,447 slices and 5 memory blocks. Mapping tasks from video content-analysis applications from literature on the SA yields reductions in the execution time of 1-2 orders of magnitude compared to the software implementation. We conclude that the choice for an SA architecture is useful, but a scaled version of the SA featuring less logic with fewer processing and pipeline stages yielding a lower clock frequency, would be sufficient for a video analysis system-on-chip.
Bayer image parallel decoding based on GPU
NASA Astrophysics Data System (ADS)
Hu, Rihui; Xu, Zhiyong; Wei, Yuxing; Sun, Shaohua
2012-11-01
In the photoelectrical tracking system, Bayer image is decompressed in traditional method, which is CPU-based. However, it is too slow when the images become large, for example, 2K×2K×16bit. In order to accelerate the Bayer image decoding, this paper introduces a parallel speedup method for NVIDA's Graphics Processor Unit (GPU) which supports CUDA architecture. The decoding procedure can be divided into three parts: the first is serial part, the second is task-parallelism part, and the last is data-parallelism part including inverse quantization, inverse discrete wavelet transform (IDWT) as well as image post-processing part. For reducing the execution time, the task-parallelism part is optimized by OpenMP techniques. The data-parallelism part could advance its efficiency through executing on the GPU as CUDA parallel program. The optimization techniques include instruction optimization, shared memory access optimization, the access memory coalesced optimization and texture memory optimization. In particular, it can significantly speed up the IDWT by rewriting the 2D (Tow-dimensional) serial IDWT into 1D parallel IDWT. Through experimenting with 1K×1K×16bit Bayer image, data-parallelism part is 10 more times faster than CPU-based implementation. Finally, a CPU+GPU heterogeneous decompression system was designed. The experimental result shows that it could achieve 3 to 5 times speed increase compared to the CPU serial method.
Implementing Molecular Dynamics for Hybrid High Performance Computers - 1. Short Range Forces
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Wang, Peng; Plimpton, Steven J
The use of accelerators such as general-purpose graphics processing units (GPGPUs) have become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power requirements. Hybrid high performance computers, machines with more than one type of floating-point processor, are now becoming more prevalent due to these advantages. In this work, we discuss several important issues in porting a large molecular dynamics code for use on parallel hybrid machines - 1) choosing a hybrid parallel decomposition that works on central processing units (CPUs) with distributed memory and accelerator cores with shared memory,more » 2) minimizing the amount of code that must be ported for efficient acceleration, 3) utilizing the available processing power from both many-core CPUs and accelerators, and 4) choosing a programming model for acceleration. We present our solution to each of these issues for short-range force calculation in the molecular dynamics package LAMMPS. We describe algorithms for efficient short range force calculation on hybrid high performance machines. We describe a new approach for dynamic load balancing of work between CPU and accelerator cores. We describe the Geryon library that allows a single code to compile with both CUDA and OpenCL for use on a variety of accelerators. Finally, we present results on a parallel test cluster containing 32 Fermi GPGPUs and 180 CPU cores.« less
Provably unbounded memory advantage in stochastic simulation using quantum mechanics
NASA Astrophysics Data System (ADS)
Garner, Andrew J. P.; Liu, Qing; Thompson, Jayne; Vedral, Vlatko; Gu, mile
2017-10-01
Simulating the stochastic evolution of real quantities on a digital computer requires a trade-off between the precision to which these quantities are approximated, and the memory required to store them. The statistical accuracy of the simulation is thus generally limited by the internal memory available to the simulator. Here, using tools from computational mechanics, we show that quantum processors with a fixed finite memory can simulate stochastic processes of real variables to arbitrarily high precision. This demonstrates a provable, unbounded memory advantage that a quantum simulator can exhibit over its best possible classical counterpart.
Adaptive packet switch with an optical core (demonstrator)
NASA Astrophysics Data System (ADS)
Abdo, Ahmad; Bishtein, Vadim; Clark, Stewart A.; Dicorato, Pino; Lu, David T.; Paredes, Sofia A.; Taebi, Sareh; Hall, Trevor J.
2004-11-01
A three-stage opto-electronic packet switch architecture is described consisting of a reconfigurable optical centre stage surrounded by two electronic buffering stages partitioned into sectors to ease memory contention. A Flexible Bandwidth Provision (FBP) algorithm, implemented on a soft-core processor, is used to change the configuration of the input sectors and optical centre stage to set up internal paths that will provide variable bandwidth to serve the traffic. The switch is modeled by a bipartite graph built from a service matrix, which is a function of the arriving traffic. The bipartite graph is decomposed by solving an edge-colouring problem and the resulting permutations are used to configure the switch. Simulation results show that this architecture exhibits a dramatic reduction of complexity and increased potential for scalability, at the price of only a modest spatial speed-up k, 1
Architectural Techniques For Managing Non-volatile Caches
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mittal, Sparsh
As chip power dissipation becomes a critical challenge in scaling processor performance, computer architects are forced to fundamentally rethink the design of modern processors and hence, the chip-design industry is now at a major inflection point in its hardware roadmap. The high leakage power and low density of SRAM poses serious obstacles in its use for designing large on-chip caches and for this reason, researchers are exploring non-volatile memory (NVM) devices, such as spin torque transfer RAM, phase change RAM and resistive RAM. However, since NVMs are not strictly superior to SRAM, effective architectural techniques are required for making themmore » a universal memory solution. This book discusses techniques for designing processor caches using NVM devices. It presents algorithms and architectures for improving their energy efficiency, performance and lifetime. It also provides both qualitative and quantitative evaluation to help the reader gain insights and motivate them to explore further. This book will be highly useful for beginners as well as veterans in computer architecture, chip designers, product managers and technical marketing professionals.« less
Dewaraja, Yuni K; Ljungberg, Michael; Majumdar, Amitava; Bose, Abhijit; Koral, Kenneth F
2002-02-01
This paper reports the implementation of the SIMIND Monte Carlo code on an IBM SP2 distributed memory parallel computer. Basic aspects of running Monte Carlo particle transport calculations on parallel architectures are described. Our parallelization is based on equally partitioning photons among the processors and uses the Message Passing Interface (MPI) library for interprocessor communication and the Scalable Parallel Random Number Generator (SPRNG) to generate uncorrelated random number streams. These parallelization techniques are also applicable to other distributed memory architectures. A linear increase in computing speed with the number of processors is demonstrated for up to 32 processors. This speed-up is especially significant in Single Photon Emission Computed Tomography (SPECT) simulations involving higher energy photon emitters, where explicit modeling of the phantom and collimator is required. For (131)I, the accuracy of the parallel code is demonstrated by comparing simulated and experimental SPECT images from a heart/thorax phantom. Clinically realistic SPECT simulations using the voxel-man phantom are carried out to assess scatter and attenuation correction.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase I is complete for the development of a Computational Fluid Dynamics parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
NASA Astrophysics Data System (ADS)
Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos
1990-03-01
The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
Research in Parallel Algorithms and Software for Computational Aerosciences
NASA Technical Reports Server (NTRS)
Domel, Neal D.
1996-01-01
Phase 1 is complete for the development of a computational fluid dynamics CFD) parallel code with automatic grid generation and adaptation for the Euler analysis of flow over complex geometries. SPLITFLOW, an unstructured Cartesian grid code developed at Lockheed Martin Tactical Aircraft Systems, has been modified for a distributed memory/massively parallel computing environment. The parallel code is operational on an SGI network, Cray J90 and C90 vector machines, SGI Power Challenge, and Cray T3D and IBM SP2 massively parallel machines. Parallel Virtual Machine (PVM) is the message passing protocol for portability to various architectures. A domain decomposition technique was developed which enforces dynamic load balancing to improve solution speed and memory requirements. A host/node algorithm distributes the tasks. The solver parallelizes very well, and scales with the number of processors. Partially parallelized and non-parallelized tasks consume most of the wall clock time in a very fine grain environment. Timing comparisons on a Cray C90 demonstrate that Parallel SPLITFLOW runs 2.4 times faster on 8 processors than its non-parallel counterpart autotasked over 8 processors.
GaAs Supercomputing: Architecture, Language, And Algorithms For Image Processing
NASA Astrophysics Data System (ADS)
Johl, John T.; Baker, Nick C.
1988-10-01
The application of high-speed GaAs processors in a parallel system matches the demanding computational requirements of image processing. The architecture of the McDonnell Douglas Astronautics Company (MDAC) vector processor is described along with the algorithms and language translator. Most image and signal processing algorithms can utilize parallel processing and show a significant performance improvement over sequential versions. The parallelization performed by this system is within each vector instruction. Since each vector has many elements, each requiring some computation, useful concurrent arithmetic operations can easily be performed. Balancing the memory bandwidth with the computation rate of the processors is an important design consideration for high efficiency and utilization. The architecture features a bus-based execution unit consisting of four to eight 32-bit GaAs RISC microprocessors running at a 200 MHz clock rate for a peak performance of 1.6 BOPS. The execution unit is connected to a vector memory with three buses capable of transferring two input words and one output word every 10 nsec. The address generators inside the vector memory perform different vector addressing modes and feed the data to the execution unit. The functions discussed in this paper include basic MATRIX OPERATIONS, 2-D SPATIAL CONVOLUTION, HISTOGRAM, and FFT. For each of these algorithms, assembly language programs were run on a behavioral model of the system to obtain performance figures.
Parallel implementation of an adaptive and parameter-free N-body integrator
NASA Astrophysics Data System (ADS)
Pruett, C. David; Ingham, William H.; Herman, Ralph D.
2011-05-01
Previously, Pruett et al. (2003) [3] described an N-body integrator of arbitrarily high order M with an asymptotic operation count of O(MN). The algorithm's structure lends itself readily to data parallelization, which we document and demonstrate here in the integration of point-mass systems subject to Newtonian gravitation. High order is shown to benefit parallel efficiency. The resulting N-body integrator is robust, parameter-free, highly accurate, and adaptive in both time-step and order. Moreover, it exhibits linear speedup on distributed parallel processors, provided that each processor is assigned at least a handful of bodies. Program summaryProgram title: PNB.f90 Catalogue identifier: AEIK_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIK_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3052 No. of bytes in distributed program, including test data, etc.: 68 600 Distribution format: tar.gz Programming language: Fortran 90 and OpenMPI Computer: All shared or distributed memory parallel processors Operating system: Unix/Linux Has the code been vectorized or parallelized?: The code has been parallelized but has not been explicitly vectorized. RAM: Dependent upon N Classification: 4.3, 4.12, 6.5 Nature of problem: High accuracy numerical evaluation of trajectories of N point masses each subject to Newtonian gravitation. Solution method: Parallel and adaptive extrapolation in time via power series of arbitrary degree. Running time: 5.1 s for the demo program supplied with the package.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB
NASA Technical Reports Server (NTRS)
Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.
2017-01-01
Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
Pierce, Paul E.
1986-01-01
A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Pierce, P.E.
A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.
Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures
2017-10-04
Report: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures The views, opinions and/or findings contained in this...Chapel Hill Title: Efficient Numeric and Geometric Computations using Heterogeneous Shared Memory Architectures Report Term: 0-Other Email: dm...algorithms for scientific and geometric computing by exploiting the power and performance efficiency of heterogeneous shared memory architectures . These
A Comparison Between the PLM and the MC68020 as Prolog Processors
1988-01-01
Continnt &OII P0111ter CP Memory X6_ofset(MP) A11ument Register 6 A6 Memory X7_ofset(MP) A11ument Register 7 A7 Memory X6_ofaet(MP) Tempor&ry Register 6...get_vuia.ble_Y iaput. Permeunt nria.ble Yi &Dd &rgumeat ~Jl8ler XJ output: fuDctioD move the content of Xj iato Yi get_va.na.ble_Y: move.! Xi.·4
Brandon, Nicole R; Beike, Denise R; Cole, Holly E
2017-07-01
Autobiographical memories (AMs) can be used to create and maintain closeness with others [Alea, N., & Bluck, S. (2003). Why are you telling me that? A conceptual model of the social function of autobiographical memory. Memory, 11(2), 165-178]. However, the differential effects of memory specificity are not well established. Two studies with 148 participants tested whether the order in which autobiographical knowledge (AK) and specific episodic AM (EAM) are shared affects feelings of closeness. Participants read two memories hypothetically shared by each of four strangers. The strangers first shared either AK or an EAM, and then shared either AK or an EAM. Participants were randomly assigned to read either positive or negative AMs from the strangers. Findings suggest that people feel closer to those who share positive AMs in the same way they construct memories: starting with general and moving to specific.
System on a chip with MPEG-4 capability
NASA Astrophysics Data System (ADS)
Yassa, Fathy; Schonfeld, Dan
2002-12-01
Current products supporting video communication applications rely on existing computer architectures. RISC processors have been used successfully in numerous applications over several decades. DSP processors have become ubiquitous in signal processing and communication applications. Real-time applications such as speech processing in cellular telephony rely extensively on the computational power of these processors. Video processors designed to implement the computationally intensive codec operations have also been used to address the high demands of video communication applications (e.g., cable set-top boxes and DVDs). This paper presents an overview of a system-on-chip (SOC) architecture used for real-time video in wireless communication applications. The SOC specifications answer to the system requirements imposed by the application environment. A CAM-based video processor is used to accelerate data intensive video compression tasks such as motion estimations and filtering. Other components are dedicated to system level data processing and audio processing. A rich set of I/Os allows the SOC to communicate with other system components such as baseband and memory subsystems.
FAST: framework for heterogeneous medical image computing and visualization.
Smistad, Erik; Bozorgi, Mohammadmehdi; Lindseth, Frank
2015-11-01
Computer systems are becoming increasingly heterogeneous in the sense that they consist of different processors, such as multi-core CPUs and graphic processing units. As the amount of medical image data increases, it is crucial to exploit the computational power of these processors. However, this is currently difficult due to several factors, such as driver errors, processor differences, and the need for low-level memory handling. This paper presents a novel FrAmework for heterogeneouS medical image compuTing and visualization (FAST). The framework aims to make it easier to simultaneously process and visualize medical images efficiently on heterogeneous systems. FAST uses common image processing programming paradigms and hides the details of memory handling from the user, while enabling the use of all processors and cores on a system. The framework is open-source, cross-platform and available online. Code examples and performance measurements are presented to show the simplicity and efficiency of FAST. The results are compared to the insight toolkit (ITK) and the visualization toolkit (VTK) and show that the presented framework is faster with up to 20 times speedup on several common medical imaging algorithms. FAST enables efficient medical image computing and visualization on heterogeneous systems. Code examples and performance evaluations have demonstrated that the toolkit is both easy to use and performs better than existing frameworks, such as ITK and VTK.
Instruction-level performance modeling and characterization of multimedia applications
DOE Office of Scientific and Technical Information (OSTI.GOV)
Luo, Y.; Cameron, K.W.
1999-06-01
One of the challenges for characterizing and modeling realistic multimedia applications is the lack of access to source codes. On-chip performance counters effectively resolve this problem by monitoring run-time behaviors at the instruction-level. This paper presents a novel technique of characterizing and modeling workloads at the instruction level for realistic multimedia applications using hardware performance counters. A variety of instruction counts are collected from some multimedia applications, such as RealPlayer, GSM Vocoder, MPEG encoder/decoder, and speech synthesizer. These instruction counts can be used to form a set of abstract characteristic parameters directly related to a processor`s architectural features. Based onmore » microprocessor architectural constraints and these calculated abstract parameters, the architectural performance bottleneck for a specific application can be estimated. Meanwhile, the bottleneck estimation can provide suggestions about viable architectural/functional improvement for certain workloads. The biggest advantage of this new characterization technique is a better understanding of processor utilization efficiency and architectural bottleneck for each application. This technique also provides predictive insight of future architectural enhancements and their affect on current codes. In this paper the authors also attempt to model architectural effect on processor utilization without memory influence. They derive formulas for calculating CPI{sub 0}, CPI without memory effect, and they quantify utilization of architectural parameters. These equations are architecturally diagnostic and predictive in nature. Results provide promise in code characterization, and empirical/analytical modeling.« less
Load Balancing Unstructured Adaptive Grids for CFD Problems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid
1996-01-01
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. A dynamic load balancing method is presented that balances the workload across all processors with a global view. After each parallel tetrahedral mesh adaption, the method first determines if the new mesh is sufficiently unbalanced to warrant a repartitioning. If so, the adapted mesh is repartitioned, with new partitions assigned to processors so that the redistribution cost is minimized. The new partitions are accepted only if the remapping cost is compensated by the improved load balance. Results indicate that this strategy is effective for large-scale scientific computations on distributed-memory multiprocessors.
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor
NASA Technical Reports Server (NTRS)
Gilliland, M. C.; Smith, B. J.; Calvert, W.
1976-01-01
The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
Systems and methods for reconfiguring input devices
NASA Technical Reports Server (NTRS)
Lancaster, Jeff (Inventor); De Mers, Robert E. (Inventor)
2012-01-01
A system includes an input device having first and second input members configured to be activated by a user. The input device is configured to generate activation signals associated with activation of the first and second input members, and each of the first and second input members are associated with an input function. A processor is coupled to the input device and configured to receive the activation signals. A memory coupled to the processor, and includes a reconfiguration module configured to store the input functions assigned to the first and second input members and, upon execution of the processor, to reconfigure the input functions assigned to the input members when the first input member is inoperable.
Ssip-a processor interconnection simulator
DOE Office of Scientific and Technical Information (OSTI.GOV)
Navaux, P.; Weber, R.; Prezzi, J.
1982-01-01
Recent growing interest in multiple processor architectures has given rise to the study of procesor-memory interconnections for the determination of better architectures. This paper concerns the development of the SSIP-sistema simulador de interconexao de processadores (processor interconnection simulating system) which allows the evaluation of different interconnection structures comparing its performance in order to provide parameters which would help the designer to define an architcture. A wide spectrum of systems may be evaluated, and their behaviour observed due to the features incorporated into the simulator program. The system modelling and the simulator program implementation are described. Some results that can bemore » obtained are shown, along with the discussion of their usefulness. 12 references.« less
Parallel solution of high-order numerical schemes for solving incompressible flows
NASA Technical Reports Server (NTRS)
Milner, Edward J.; Lin, Avi; Liou, May-Fun; Blech, Richard A.
1993-01-01
A new parallel numerical scheme for solving incompressible steady-state flows is presented. The algorithm uses a finite-difference approach to solving the Navier-Stokes equations. The algorithms are scalable and expandable. They may be used with only two processors or with as many processors as are available. The code is general and expandable. Any size grid may be used. Four processors of the NASA LeRC Hypercluster were used to solve for steady-state flow in a driven square cavity. The Hypercluster was configured in a distributed-memory, hypercube-like architecture. By using a 50-by-50 finite-difference solution grid, an efficiency of 74 percent (a speedup of 2.96) was obtained.
MILC Code Performance on High End CPU and GPU Supercomputer Clusters
NASA Astrophysics Data System (ADS)
DeTar, Carleton; Gottlieb, Steven; Li, Ruizi; Toussaint, Doug
2018-03-01
With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.
Implementing Molecular Dynamics on Hybrid High Performance Computers - Three-Body Potentials
DOE Office of Scientific and Technical Information (OSTI.GOV)
Brown, W Michael; Yamada, Masako
The use of coprocessors or accelerators such as graphics processing units (GPUs) has become popular in scientific computing applications due to their low cost, impressive floating-point capabilities, high memory bandwidth, and low electrical power re- quirements. Hybrid high-performance computers, defined as machines with nodes containing more than one type of floating-point processor (e.g. CPU and GPU), are now becoming more prevalent due to these advantages. Although there has been extensive research into methods to efficiently use accelerators to improve the performance of molecular dynamics (MD) employing pairwise potential energy models, little is reported in the literature for models that includemore » many-body effects. 3-body terms are required for many popular potentials such as MEAM, Tersoff, REBO, AIREBO, Stillinger-Weber, Bond-Order Potentials, and others. Because the per-atom simulation times are much higher for models incorporating 3-body terms, there is a clear need for efficient algo- rithms usable on hybrid high performance computers. Here, we report a shared-memory force-decomposition for 3-body potentials that avoids memory conflicts to allow for a deterministic code with substantial performance improvements on hybrid machines. We describe modifications necessary for use in distributed memory MD codes and show results for the simulation of water with Stillinger-Weber on the hybrid Titan supercomputer. We compare performance of the 3-body model to the SPC/E water model when using accelerators. Finally, we demonstrate that our approach can attain a speedup of 5.1 with acceleration on Titan for production simulations to study water droplet freezing on a surface.« less
Portable parallel stochastic optimization for the design of aeropropulsion components
NASA Technical Reports Server (NTRS)
Sues, Robert H.; Rhodes, G. S.
1994-01-01
This report presents the results of Phase 1 research to develop a methodology for performing large-scale Multi-disciplinary Stochastic Optimization (MSO) for the design of aerospace systems ranging from aeropropulsion components to complete aircraft configurations. The current research recognizes that such design optimization problems are computationally expensive, and require the use of either massively parallel or multiple-processor computers. The methodology also recognizes that many operational and performance parameters are uncertain, and that uncertainty must be considered explicitly to achieve optimum performance and cost. The objective of this Phase 1 research was to initialize the development of an MSO methodology that is portable to a wide variety of hardware platforms, while achieving efficient, large-scale parallelism when multiple processors are available. The first effort in the project was a literature review of available computer hardware, as well as review of portable, parallel programming environments. The first effort was to implement the MSO methodology for a problem using the portable parallel programming language, Parallel Virtual Machine (PVM). The third and final effort was to demonstrate the example on a variety of computers, including a distributed-memory multiprocessor, a distributed-memory network of workstations, and a single-processor workstation. Results indicate the MSO methodology can be well-applied towards large-scale aerospace design problems. Nearly perfect linear speedup was demonstrated for computation of optimization sensitivity coefficients on both a 128-node distributed-memory multiprocessor (the Intel iPSC/860) and a network of workstations (speedups of almost 19 times achieved for 20 workstations). Very high parallel efficiencies (75 percent for 31 processors and 60 percent for 50 processors) were also achieved for computation of aerodynamic influence coefficients on the Intel. Finally, the multi-level parallelization strategy that will be needed for large-scale MSO problems was demonstrated to be highly efficient. The same parallel code instructions were used on both platforms, demonstrating portability. There are many applications for which MSO can be applied, including NASA's High-Speed-Civil Transport, and advanced propulsion systems. The use of MSO will reduce design and development time and testing costs dramatically.
Thread mapping using system-level model for shared memory multicores
NASA Astrophysics Data System (ADS)
Mitra, Reshmi
Exploring thread-to-core mapping options for a parallel application on a multicore architecture is computationally very expensive. For the same algorithm, the mapping strategy (MS) with the best response time may change with data size and thread counts. The primary challenge is to design a fast, accurate and automatic framework for exploring these MSs for large data-intensive applications. This is to ensure that the users can explore the design space within reasonable machine hours, without thorough understanding on how the code interacts with the platform. Response time is related to the cycles per instructions retired (CPI), taking into account both active and sleep states of the pipeline. This work establishes a hybrid approach, based on Markov Chain Model (MCM) and Model Tree (MT) for system-level steady state CPI prediction. It is designed for shared memory multicore processors with coarse-grained multithreading. The thread status is represented by the MCM states. The program characteristics are modeled as the transition probabilities, representing the system moving between active and suspended thread states. The MT model extrapolates these probabilities for the actual application size (AS) from the smaller AS performance. This aspect of the framework, along with, the use of mathematical expressions for the actual AS performance information, results in a tremendous reduction in the CPI prediction time. The framework is validated using an electromagnetics application. The average performance prediction error for steady state CPI results with 12 different MSs is less than 1%. The total run time of model is of the order of minutes, whereas the actual application execution time is in terms of days.
Microlaser-based compact optical neuro-processors (Invited Paper)
NASA Astrophysics Data System (ADS)
Paek, Eung Gi; Chan, Winston K.; Zah, Chung-En; Cheung, Kwok-wai; Curtis, L.; Chang-Hasnain, Constance J.
1992-10-01
This paper reviews the recent progress in the development of holographic neural networks using surface-emitting laser diode arrays (SELDAs). Since the previous work on ultrafast holographic memory readout system and a robust incoherent correlator, progress has been made in several areas: the use of an array of monolithic `neurons' to reconstruct holographic memories; two-dimensional (2-D) wavelength-division multiplexing (WDM) for image transmission through a single-mode fiber; and finally, an associative memory using time- division multiplexing (TDM). Experimental demonstrations on these are presented.
7 CFR 1435.315 - Adjustments to proportionate shares.
Code of Federal Regulations, 2010 CFR
2010-01-01
... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.
Code of Federal Regulations, 2011 CFR
2011-01-01
... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.
Code of Federal Regulations, 2014 CFR
2014-01-01
... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.
Code of Federal Regulations, 2012 CFR
2012-01-01
... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.
Code of Federal Regulations, 2013 CFR
2013-01-01
... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
Processing-in-Memory Enabled Graphics Processors for 3D Rendering
DOE Office of Scientific and Technical Information (OSTI.GOV)
Xie, Chenhao; Song, Shuaiwen; Wang, Jing
2017-02-06
The performance of 3D rendering of Graphics Processing Unit that convents 3D vector stream into 2D frame with 3D image effects significantly impact users’ gaming experience on modern computer systems. Due to the high texture throughput in 3D rendering, main memory bandwidth becomes a critical obstacle for improving the overall rendering performance. 3D stacked memory systems such as Hybrid Memory Cube (HMC) provide opportunities to significantly overcome the memory wall by directly connecting logic controllers to DRAM dies. Based on the observation that texel fetches significantly impact off-chip memory traffic, we propose two architectural designs to enable Processing-In-Memory based GPUmore » for efficient 3D rendering.« less
Implementation and Performance Analysis of Parallel Assignment Algorithms on a Hypercube Computer.
1987-12-01
coupled pro- cessors because of the degree of interaction between processors imposed by the global memory [HwB84]. Another sub-class of MIMD... interaction between the individual processors [MuA87]. Many of the commercial MIMD computers available today are loosely coupled [HwB84]. 2.1.3 The Hypercube...Alpha-beta is a method usually employed in the solution of two-person zero-sum games like chess and checkers [Qui87]. The ha sic approach of the alpha
NASA Technical Reports Server (NTRS)
Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez
2006-01-01
Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.
Global Load Balancing with Parallel Mesh Adaption on Distributed-Memory Systems
NASA Technical Reports Server (NTRS)
Biswas, Rupak; Oliker, Leonid; Sohn, Andrew
1996-01-01
Dynamic mesh adaption on unstructured grids is a powerful tool for efficiently computing unsteady problems to resolve solution features of interest. Unfortunately, this causes load imbalance among processors on a parallel machine. This paper describes the parallel implementation of a tetrahedral mesh adaption scheme and a new global load balancing method. A heuristic remapping algorithm is presented that assigns partitions to processors such that the redistribution cost is minimized. Results indicate that the parallel performance of the mesh adaption code depends on the nature of the adaption region and show a 35.5X speedup on 64 processors of an SP2 when 35% of the mesh is randomly adapted. For large-scale scientific computations, our load balancing strategy gives almost a sixfold reduction in solver execution times over non-balanced loads. Furthermore, our heuristic remapper yields processor assignments that are less than 3% off the optimal solutions but requires only 1% of the computational time.
A high-speed digital signal processor for atmospheric radar, part 7.3A
NASA Technical Reports Server (NTRS)
Brosnahan, J. W.; Woodard, D. M.
1984-01-01
The Model SP-320 device is a monolithic realization of a complex general purpose signal processor, incorporating such features as a 32-bit ALU, a 16-bit x 16-bit combinatorial multiplier, and a 16-bit barrel shifter. The SP-320 is designed to operate as a slave processor to a host general purpose computer in applications such as coherent integration of a radar return signal in multiple ranges, or dedicated FFT processing. Presently available is an I/O module conforming to the Intel Multichannel interface standard; other I/O modules will be designed to meet specific user requirements. The main processor board includes input and output FIFO (First In First Out) memories, both with depths of 4096 W, to permit asynchronous operation between the source of data and the host computer. This design permits burst data rates in excess of 5 MW/s.
Discrete sensitivity derivatives of the Navier-Stokes equations with a parallel Krylov solver
NASA Technical Reports Server (NTRS)
Ajmani, Kumud; Taylor, Arthur C., III
1994-01-01
This paper solves an 'incremental' form of the sensitivity equations derived by differentiating the discretized thin-layer Navier Stokes equations with respect to certain design variables of interest. The equations are solved with a parallel, preconditioned Generalized Minimal RESidual (GMRES) solver on a distributed-memory architecture. The 'serial' sensitivity analysis code is parallelized by using the Single Program Multiple Data (SPMD) programming model, domain decomposition techniques, and message-passing tools. Sensitivity derivatives are computed for low and high Reynolds number flows over a NACA 1406 airfoil on a 32-processor Intel Hypercube, and found to be identical to those computed on a single-processor Cray Y-MP. It is estimated that the parallel sensitivity analysis code has to be run on 40-50 processors of the Intel Hypercube in order to match the single-processor processing time of a Cray Y-MP.
Center for Technology for Advanced Scientific Componet Software (TASCS)
DOE Office of Scientific and Technical Information (OSTI.GOV)
Govindaraju, Madhusudhan
Advanced Scientific Computing Research Computer Science FY 2010Report Center for Technology for Advanced Scientific Component Software: Distributed CCA State University of New York, Binghamton, NY, 13902 Summary The overall objective of Binghamton's involvement is to work on enhancements of the CCA environment, motivated by the applications and research initiatives discussed in the proposal. This year we are working on re-focusing our design and development efforts to develop proof-of-concept implementations that have the potential to significantly impact scientific components. We worked on developing parallel implementations for non-hydrostatic code and worked on a model coupling interface for biogeochemical computations coded in MATLAB.more » We also worked on the design and implementation modules that will be required for the emerging MapReduce model to be effective for scientific applications. Finally, we focused on optimizing the processing of scientific datasets on multi-core processors. Research Details We worked on the following research projects that we are working on applying to CCA-based scientific applications. 1. Non-Hydrostatic Hydrodynamics: Non-static hydrodynamics are significantly more accurate at modeling internal waves that may be important in lake ecosystems. Non-hydrostatic codes, however, are significantly more computationally expensive, often prohibitively so. We have worked with Chin Wu at the University of Wisconsin to parallelize non-hydrostatic code. We have obtained a speed up of about 26 times maximum. Although this is significant progress, we hope to improve the performance further, such that it becomes a practical alternative to hydrostatic codes. 2. Model-coupling for water-based ecosystems: To answer pressing questions about water resources requires that physical models (hydrodynamics) be coupled with biological and chemical models. Most hydrodynamics codes are written in Fortran, however, while most ecologists work in MATLAB. This disconnect creates a great barrier. To address this, we are working on a model coupling interface that will allow biogeochemical computations written in MATLAB to couple with Fortran codes. This will greatly improve the productivity of ecosystem scientists. 2. Low overhead and Elastic MapReduce Implementation Optimized for Memory and CPU-Intensive Applications: Since its inception, MapReduce has frequently been associated with Hadoop and large-scale datasets. Its deployment at Amazon in the cloud, and its applications at Yahoo! for large-scale distributed document indexing and database building, among other tasks, have thrust MapReduce to the forefront of the data processing application domain. The applicability of the paradigm however extends far beyond its use with data intensive applications and diskbased systems, and can also be brought to bear in processing small but CPU intensive distributed applications. MapReduce however carries its own burdens. Through experiments using Hadoop in the context of diverse applications, we uncovered latencies and delay conditions potentially inhibiting the expected performance of a parallel execution in CPU-intensive applications. Furthermore, as it currently stands, MapReduce is favored for data-centric applications, and as such tends to be solely applied to disk-based applications. The paradigm, falls short in bringing its novelty to diskless systems dedicated to in-memory applications, and compute intensive programs processing much smaller data, but requiring intensive computations. In this project, we focused both on the performance of processing large-scale hierarchical data in distributed scientific applications, as well as the processing of smaller but demanding input sizes primarily used in diskless, and memory resident I/O systems. We designed LEMO-MR [1], a Low overhead, elastic, configurable for in- memory applications, and on-demand fault tolerance, an optimized implementation of MapReduce, for both on disk and in memory applications. We conducted experiments to identify not only the necessary components of this model, but also trade offs and factors to be considered. We have initial results to show the efficacy of our implementation in terms of potential speedup that can be achieved for representative data sets used by cloud applications. We have quantified the performance gains exhibited by our MapReduce implementation over Apache Hadoop in a compute intensive environment. 3. Cache Performance Optimization for Processing XML and HDF-based Application Data on Multi-core Processors: It is important to design and develop scientific middleware libraries to harness the opportunities presented by emerging multi-core processors. Implementations of scientific middleware and applications that do not adapt to the programming paradigm when executing on emerging processors can severely impact the overall performance. In this project, we focused on the utilization of the L2 cache, which is a critical shared resource on chip multiprocessors (CMP). The access pattern of the shared L2 cache, which is dependent on how the application schedules and assigns processing work to each thread, can either enhance or hurt the ability to hide memory latency on a multi-core processor. Therefore, while processing scientific datasets such as HDF5, it is essential to conduct fine-grained analysis of cache utilization, to inform scheduling decisions in multi-threaded programming. In this project, using the TAU toolkit for performance feedback from dual- and quad-core machines, we conducted performance analysis and recommendations on how processing threads can be scheduled on multi-core nodes to enhance the performance of a class of scientific applications that requires processing of HDF5 data. In particular, we quantified the gains associated with the use of the adaptations we have made to the Cache-Affinity and Balanced-Set scheduling algorithms to improve L2 cache performance, and hence the overall application execution time [2]. References: 1. Zacharia Fadika, Madhusudhan Govindaraju, ``MapReduce Implementation for Memory-Based and Processing Intensive Applications'', accepted in 2nd IEEE International Conference on Cloud Computing Technology and Science, Indianapolis, USA, Nov 30 - Dec 3, 2010. 2. Rajdeep Bhowmik, Madhusudhan Govindaraju, ``Cache Performance Optimization for Processing XML-based Application Data on Multi-core Processors'', in proceedings of The 10th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, May 17-20, 2010, Melbourne, Victoria, Australia. Contact Information: Madhusudhan Govindaraju Binghamton University State University of New York (SUNY) mgovinda@cs.binghamton.edu Phone: 607-777-4904« less
MODA A Framework for Memory Centric Performance Characterization
DOE Office of Scientific and Technical Information (OSTI.GOV)
Shrestha, Sunil; Su, Chun-Yi; White, Amanda M.
2012-06-29
In the age of massive parallelism, the focus of performance analysis has switched from the processor and related structures to the memory and I/O resources. Adapting to this new reality, a performance analysis tool has to provide a way to analyze resource usage to pinpoint existing and potential problems in a given application. This paper provides an overview of the Memory Observant Data Analysis (MODA) tool, a memory-centric tool first implemented on the Cray XMT supercomputer. Throughout the paper, MODA's capabilities have been showcased with experiments done on matrix multiply and Graph-500 application codes.
On the Floating Point Performance of the i860 Microprocessor
NASA Technical Reports Server (NTRS)
Lee, King; Kutler, Paul (Technical Monitor)
1997-01-01
The i860 microprocessor is a pipelined processor that can deliver two double precision floating point results every clock. It is being used in the Touchstone project to develop a teraflop computer by the year 2000. With such high computational capabilities it was expected that memory bandwidth would limit performance on many kernels. Measured performance of three kernels showed performance is less than what memory bandwidth limitations would predict. This paper develops a model that explains the discrepancy in terms of memory latencies and points to some problems involved in moving data from memory to the arithmetic pipelines.
Shared Semantics and the Use of Organizational Memories for E-Mail Communications.
ERIC Educational Resources Information Center
Schwartz, David G.
1998-01-01
Examines the use of shared semantics information to link concepts in an organizational memory to e-mail communications. Presents a framework for determining shared semantics based on organizational and personal user profiles. Illustrates how shared semantics are used by the HyperMail system to help link organizational memories (OM) content to…
Development of the SEASIS instrument for SEDSAT
NASA Technical Reports Server (NTRS)
Maier, Mark W.
1996-01-01
Two SEASIS experiment objectives are key: take images that allow three axis attitude determination and take multi-spectral images of the earth. During the tether mission it is also desirable to capture images for the recoiling tether from the endmass perspective (which has never been observed). SEASIS must store all its imagery taken during the tether mission until the earth downlink can be established. SEASIS determines attitude with a panoramic camera and performs earth observation with a telephoto lens camera. Camera video is digitized, compressed, and stored in solid state memory. These objectives are addressed through the following architectural choices: (1) A camera system using a Panoramic Annular Lens (PAL). This lens has a 360 deg. azimuthal field of view by a +45 degree vertical field measured from a plan normal to the lens boresight axis. It has been shown in Mr. Mark Steadham's UAH M.S. thesis that his camera can determine three axis attitude anytime the earth and one other recognizable celestial object (for example, the sun) is in the field of view. This will be essentially all the time during tether deployment. (2) A second camera system using telephoto lens and filter wheel. The camera is a black and white standard video camera. The filters are chosen to cover the visible spectral bands of remote sensing interest. (3) A processor and mass memory arrangement linked to the cameras. Video signals from the cameras are digitized, compressed in the processor, and stored in a large static RAM bank. The processor is a multi-chip module consisting of a T800 Transputer and three Zoran floating point Digital Signal Processors. This processor module was supplied under ARPA contract by the Space Computer Corporation to demonstrate its use in space.
Fast Fourier Transform Co-Processor (FFTC)- Towards Embedded GFLOPs
NASA Astrophysics Data System (ADS)
Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Wite, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland
2012-08-01
Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co- Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment.In frame of the ESA activity “Fast Fourier Transform DSP Co-processor (FFTC)” (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following:Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP.The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance.The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT- based processing tasks.A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses.The presentation will give and overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Fast Fourier Transform Co-processor (FFTC), towards embedded GFLOPs
NASA Astrophysics Data System (ADS)
Kuehl, Christopher; Liebstueckel, Uwe; Tejerina, Isaac; Uemminghaus, Michael; Witte, Felix; Kolb, Michael; Suess, Martin; Weigand, Roland; Kopp, Nicholas
2012-10-01
Many signal processing applications and algorithms perform their operations on the data in the transform domain to gain efficiency. The Fourier Transform Co-Processor has been developed with the aim to offload General Purpose Processors from performing these transformations and therefore to boast the overall performance of a processing module. The IP of the commercial PowerFFT processor has been selected and adapted to meet the constraints of the space environment. In frame of the ESA activity "Fast Fourier Transform DSP Co-processor (FFTC)" (ESTEC/Contract No. 15314/07/NL/LvH/ma) the objectives were the following: • Production of prototypes of a space qualified version of the commercial PowerFFT chip called FFTC based on the PowerFFT IP. • The development of a stand-alone FFTC Accelerator Board (FTAB) based on the FFTC including the Controller FPGA and SpaceWire Interfaces to verify the FFTC function and performance. The FFTC chip performs its calculations with floating point precision. Stand alone it is capable computing FFTs of up to 1K complex samples in length in only 10μsec. This corresponds to an equivalent processing performance of 4.7 GFlops. In this mode the maximum sustained data throughput reaches 6.4Gbit/s. When connected to up to 4 EDAC protected SDRAM memory banks the FFTC can perform long FFTs with up to 1M complex samples in length or multidimensional FFT-based processing tasks. A Controller FPGA on the FTAB takes care of the SDRAM addressing. The instructions commanded via the Controller FPGA are used to set up the data flow and generate the memory addresses. The paper will give an overview on the project, including the results of the validation of the FFTC ASIC prototypes.
Towards the formal specification of the requirements and design of a processor interface unit
NASA Technical Reports Server (NTRS)
Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.
1993-01-01
Work to formally specify the requirements and design of a Processor Interface Unit (PIU), a single-chip subsystem providing memory interface, bus interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system, is described. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance free operation, or both. The approaches that were developed for modeling the PIU requirements and for composition of the PIU subcomponents at high levels of abstraction are described. These approaches were used to specify and verify a nontrivial subset of the PIU behavior. The PIU specification in Higher Order Logic (HOL) is documented in a companion NASA contractor report entitled 'Towards the Formal Specification of the Requirements and Design of a Processor Interfacs Unit - HOL Listings.' The subsequent verification approach and HOL listings are documented in NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit' and NASA contractor report entitled 'Towards the Formal Verification of the Requirements and Design of a Processor Interface Unit - HOL Listings.'
Vertical Launch System Loadout Planner
2015-03-01
United States Navy USS United States’ Ship VBA Visual Basic for Applications VLP VLS Loadout Planner VLS Vertical Launch System...with 32 gigabytes of random access memory and eight processors, General Algebraic Modeling System (GAMS) CPLEX version 24 (GAMS, 2015) solves this...problem in ten minutes to an integer tolerance of 10%. The GAMS interpreter and CPLEX solver require 75 Megabytes of random access memory for this
Fast Pixel Buffer For Processing With Lookup Tables
NASA Technical Reports Server (NTRS)
Fisher, Timothy E.
1992-01-01
Proposed scheme for buffering data on intensities of picture elements (pixels) of image increases rate or processing beyond that attainable when data read, one pixel at time, from main image memory. Scheme applied in design of specialized image-processing circuitry. Intended to optimize performance of processor in which electronic equivalent of address-lookup table used to address those pixels in main image memory required for processing.
Early MIMD experience on the CRAY X-MP
NASA Astrophysics Data System (ADS)
Rhoades, Clifford E.; Stevens, K. G.
1985-07-01
This paper describes some early experience with converting four physics simulation programs to the CRAY X-MP, a current Multiple Instruction, Multiple Data (MIMD) computer consisting of two processors each with an architecture similar to that of the CRAY-1. As a multi-processor, the CRAY X-MP together with the high speed Solid-state Storage Device (SSD) in an ideal machine upon which to study MIMD algorithms for solving the equations of mathematical physics because it is fast enough to run real problems. The computer programs used in this study are all FORTRAN versions of original production codes. They range in sophistication from a one-dimensional numerical simulation of collisionless plasma to a two-dimensional hydrodynamics code with heat flow to a couple of three-dimensional fluid dynamics codes with varying degrees of viscous modeling. Early research with a dual processor configuration has shown speed-ups ranging from 1.55 to 1.98. It has been observed that a few simple extensions to FORTRAN allow a typical programmer to achieve a remarkable level of efficiency. These extensions involve the concept of memory local to a concurrent subprogram and memory common to all concurrent subprograms.
A parallel implementation of an off-lattice individual-based model of multicellular populations
NASA Astrophysics Data System (ADS)
Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe
2015-07-01
As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
On the relationship between parallel computation and graph embedding
DOE Office of Scientific and Technical Information (OSTI.GOV)
Gupta, A.K.
1989-01-01
The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sizedmore » binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.« less
HTMT-class Latency Tolerant Parallel Architecture for Petaflops Scale Computation
NASA Technical Reports Server (NTRS)
Sterling, Thomas; Bergman, Larry
2000-01-01
Computational Aero Sciences and other numeric intensive computation disciplines demand computing throughputs substantially greater than the Teraflops scale systems only now becoming available. The related fields of fluids, structures, thermal, combustion, and dynamic controls are among the interdisciplinary areas that in combination with sufficient resolution and advanced adaptive techniques may force performance requirements towards Petaflops. This will be especially true for compute intensive models such as Navier-Stokes are or when such system models are only part of a larger design optimization computation involving many design points. Yet recent experience with conventional MPP configurations comprising commodity processing and memory components has shown that larger scale frequently results in higher programming difficulty and lower system efficiency. While important advances in system software and algorithms techniques have had some impact on efficiency and programmability for certain classes of problems, in general it is unlikely that software alone will resolve the challenges to higher scalability. As in the past, future generations of high-end computers may require a combination of hardware architecture and system software advances to enable efficient operation at a Petaflops level. The NASA led HTMT project has engaged the talents of a broad interdisciplinary team to develop a new strategy in high-end system architecture to deliver petaflops scale computing in the 2004/5 timeframe. The Hybrid-Technology, MultiThreaded parallel computer architecture incorporates several advanced technologies in combination with an innovative dynamic adaptive scheduling mechanism to provide unprecedented performance and efficiency within practical constraints of cost, complexity, and power consumption. The emerging superconductor Rapid Single Flux Quantum electronics can operate at 100 GHz (the record is 770 GHz) and one percent of the power required by convention semiconductor logic. Wave Division Multiplexing optical communications can approach a peak per fiber bandwidth of 1 Tbps and the new Data Vortex network topology employing this technology can connect tens of thousands of ports providing a bi-section bandwidth on the order of a Petabyte per second with latencies well below 100 nanoseconds, even under heavy loads. Processor-in-Memory (PIM) technology combines logic and memory on the same chip exposing the internal bandwidth of the memory row buffers at low latency. And holographic storage photorefractive storage technologies provide high-density memory with access a thousand times faster than conventional disk technologies. Together these technologies enable a new class of shared memory system architecture with a peak performance in the range of a Petaflops but size and power requirements comparable to today's largest Teraflops scale systems. To achieve high-sustained performance, HTMT combines an advanced multithreading processor architecture with a memory-driven coarse-grained latency management strategy called "percolation", yielding high efficiency while reducing the much of the parallel programming burden. This paper will present the basic system architecture characteristics made possible through this series of advanced technologies and then give a detailed description of the new percolation approach to runtime latency management.
Federal Register 2010, 2011, 2012, 2013, 2014
2011-01-19
... submitted on or before March 21, 2011. ADDRESSES: Direct all written comments to Diana Hynek, Departmental... fisheries. Program components include quota share allocation, processor quota share allocation, individual... Binding Arbitration process, and fee collection. II. Method of Collection Responses are mailed, except the...
Formal design specification of a Processor Interface Unit
NASA Technical Reports Server (NTRS)
Fura, David A.; Windley, Phillip J.; Cohen, Gerald C.
1992-01-01
This report describes work to formally specify the requirements and design of a processor interface unit (PIU), a single-chip subsystem providing memory-interface bus-interface, and additional support services for a commercial microprocessor within a fault-tolerant computer system. This system, the Fault-Tolerant Embedded Processor (FTEP), is targeted towards applications in avionics and space requiring extremely high levels of mission reliability, extended maintenance-free operation, or both. The need for high-quality design assurance in such applications is an undisputed fact, given the disastrous consequences that even a single design flaw can produce. Thus, the further development and application of formal methods to fault-tolerant systems is of critical importance as these systems see increasing use in modern society.
Boyle, Peter A.; Christ, Norman H.; Gara, Alan; Mawhinney, Robert D.; Ohmacht, Martin; Sugavanam, Krishnan
2012-12-11
A prefetch system improves a performance of a parallel computing system. The parallel computing system includes a plurality of computing nodes. A computing node includes at least one processor and at least one memory device. The prefetch system includes at least one stream prefetch engine and at least one list prefetch engine. The prefetch system operates those engines simultaneously. After the at least one processor issues a command, the prefetch system passes the command to a stream prefetch engine and a list prefetch engine. The prefetch system operates the stream prefetch engine and the list prefetch engine to prefetch data to be needed in subsequent clock cycles in the processor in response to the passed command.
The role of neuroimaging in the discovery of processing stages. A review.
Mulder, G; Wijers, A A; Lange, J J; Buijink, B M; Mulder, L J; Willemsen, A T; Paans, A M
1995-11-01
In this contribution we show how neuroimaging methods can augment behavioural methods to discover processing stages. Event Related Brain Potentials (ERPs), Brain Electrical Source Analysis (BESA) and regional changes in cerebral blood flow (rCBF) do not necessarily require behavioural responses. With the aid of rCBF we are able to discover several cortical and subcortical brain systems (processors) active in selective attention and memory search tasks. BESA describes cortical activity with high temporal resolution in terms of a limited number of neural generators within these brain systems. The combination of behavioural methods and neuroimaging provides a picture of the functional architecture of the brain. The review is organized around three processors: the Visual, Cognitive and Manual Motor Processors.
Autonomic Closure for Turbulent Flows Using Approximate Bayesian Computation
NASA Astrophysics Data System (ADS)
Doronina, Olga; Christopher, Jason; Hamlington, Peter; Dahm, Werner
2017-11-01
Autonomic closure is a new technique for achieving fully adaptive and physically accurate closure of coarse-grained turbulent flow governing equations, such as those solved in large eddy simulations (LES). Although autonomic closure has been shown in recent a priori tests to more accurately represent unclosed terms than do dynamic versions of traditional LES models, the computational cost of the approach makes it challenging to implement for simulations of practical turbulent flows at realistically high Reynolds numbers. The optimization step used in the approach introduces large matrices that must be inverted and is highly memory intensive. In order to reduce memory requirements, here we propose to use approximate Bayesian computation (ABC) in place of the optimization step, thereby yielding a computationally-efficient implementation of autonomic closure that trades memory-intensive for processor-intensive computations. The latter challenge can be overcome as co-processors such as general purpose graphical processing units become increasingly available on current generation petascale and exascale supercomputers. In this work, we outline the formulation of ABC-enabled autonomic closure and present initial results demonstrating the accuracy and computational cost of the approach.
[Development of a video image system for wireless capsule endoscopes based on DSP].
Yang, Li; Peng, Chenglin; Wu, Huafeng; Zhao, Dechun; Zhang, Jinhua
2008-02-01
A video image recorder to record video picture for wireless capsule endoscopes was designed. TMS320C6211 DSP of Texas Instruments Inc. is the core processor of this system. Images are periodically acquired from Composite Video Broadcast Signal (CVBS) source and scaled by video decoder (SAA7114H). Video data is transported from high speed buffer First-in First-out (FIFO) to Digital Signal Processor (DSP) under the control of Complex Programmable Logic Device (CPLD). This paper adopts JPEG algorithm for image coding, and the compressed data in DSP was stored to Compact Flash (CF) card. TMS320C6211 DSP is mainly used for image compression and data transporting. Fast Discrete Cosine Transform (DCT) algorithm and fast coefficient quantization algorithm are used to accelerate operation speed of DSP and decrease the executing code. At the same time, proper address is assigned for each memory, which has different speed;the memory structure is also optimized. In addition, this system uses plenty of Extended Direct Memory Access (EDMA) to transport and process image data, which results in stable and high performance.
CoNNeCT Baseband Processor Module Boot Code SoftWare (BCSW)
NASA Technical Reports Server (NTRS)
Yamamoto, Clifford K.; Orozco, David S.; Byrne, D. J.; Allen, Steven J.; Sahasrabudhe, Adit; Lang, Minh
2012-01-01
This software provides essential startup and initialization routines for the CoNNeCT baseband processor module (BPM) hardware upon power-up. A command and data handling (C&DH) interface is provided via 1553 and diagnostic serial interfaces to invoke operational, reconfiguration, and test commands within the code. The BCSW has features unique to the hardware it is responsible for managing. In this case, the CoNNeCT BPM is configured with an updated CPU (Atmel AT697 SPARC processor) and a unique set of memory and I/O peripherals that require customized software to operate. These features include configuration of new AT697 registers, interfacing to a new HouseKeeper with a flash controller interface, a new dual Xilinx configuration/scrub interface, and an updated 1553 remote terminal (RT) core. The BCSW is intended to provide a "safe" mode for the BPM when initially powered on or when an unexpected trap occurs, causing the processor to reset. The BCSW allows the 1553 bus controller in the spacecraft or payload controller to operate the BPM over 1553 to upload code; upload Xilinx bit files; perform rudimentary tests; read, write, and copy the non-volatile flash memory; and configure the Xilinx interface. Commands also exist over 1553 to cause the CPU to jump or call a specified address to begin execution of user-supplied code. This may be in the form of a real-time operating system, test routine, or specific application code to run on the BPM.
Scalable Domain Decomposed Monte Carlo Particle Transport
NASA Astrophysics Data System (ADS)
O'Brien, Matthew Joseph
In this dissertation, we present the parallel algorithms necessary to run domain decomposed Monte Carlo particle transport on large numbers of processors (millions of processors). Previous algorithms were not scalable, and the parallel overhead became more computationally costly than the numerical simulation. The main algorithms we consider are: • Domain decomposition of constructive solid geometry: enables extremely large calculations in which the background geometry is too large to fit in the memory of a single computational node. • Load Balancing: keeps the workload per processor as even as possible so the calculation runs efficiently. • Global Particle Find: if particles are on the wrong processor, globally resolve their locations to the correct processor based on particle coordinate and background domain. • Visualizing constructive solid geometry, sourcing particles, deciding that particle streaming communication is completed and spatial redecomposition. These algorithms are some of the most important parallel algorithms required for domain decomposed Monte Carlo particle transport. We demonstrate that our previous algorithms were not scalable, prove that our new algorithms are scalable, and run some of the algorithms up to 2 million MPI processes on the Sequoia supercomputer.
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array
NASA Astrophysics Data System (ADS)
Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul
2008-04-01
This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Xyce parallel electronic simulator : users' guide.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Mei, Ting; Rankin, Eric Lamont; Thornquist, Heidi K.
2011-05-01
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers; (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-artmore » algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only); and (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.« less
NASA Technical Reports Server (NTRS)
Connelly, L. C.
1977-01-01
The mission planning processor is a user oriented tool for consumables management and is part of the total consumables subsystem management concept. The approach to be used in developing a working model of the mission planning processor is documented. The approach includes top-down design, structured programming techniques, and application of NASA approved software development standards. This development approach: (1) promotes cost effective software development, (2) enhances the quality and reliability of the working model, (3) encourages the sharing of the working model through a standard approach, and (4) promotes portability of the working model to other computer systems.
Performance and scalability evaluation of "Big Memory" on Blue Gene Linux.
DOE Office of Scientific and Technical Information (OSTI.GOV)
Yoshii, K.; Iskra, K.; Naik, H.
2011-05-01
We address memory performance issues observed in Blue Gene Linux and discuss the design and implementation of 'Big Memory' - an alternative, transparent memory space introduced to eliminate the memory performance issues. We evaluate the performance of Big Memory using custom memory benchmarks, NAS Parallel Benchmarks, and the Parallel Ocean Program, at a scale of up to 4,096 nodes. We find that Big Memory successfully resolves the performance issues normally encountered in Blue Gene Linux. For the ocean simulation program, we even find that Linux with Big Memory provides better scalability than does the lightweight compute node kernel designed solelymore » for high-performance applications. Originally intended exclusively for compute node tasks, our new memory subsystem dramatically improves the performance of certain I/O node applications as well. We demonstrate this performance using the central processor of the LOw Frequency ARray radio telescope as an example.« less
High Performance Computing Assets for Ocean Acoustics Research
2016-11-18
independently on processing units with access to a typically available amount of memory, say 16 or 32 gigabytes. Our models require each processor to...allow results to be obtained with limited amounts of memory available to individual processing units (with no time frame for successful completion...put into use. One file server computer to store simulation output has also been purchased. The first workstation has 28 CPU cores, dual- thread , (56
A Heterogeneous Multiprocessor Graphics System Using Processor-Enhanced Memories
1989-02-01
frames per second, font generation directly from conic spline descriptions, and rapid calculation of radiosity form factors. The hardware consists of...generality for rendering curved surfaces, volume data, objects dcscri id with Constructive Solid Geometry, for rendering scenes using the radiosity ...f.aces and for computing a spherical radiosity lighting model (see Section 7.6). Custom Memory Chips \\ 208 bits x 128 pixels - Renderer Board ix p o a
Schedulers with load-store queue awareness
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Tong; Eichenberger, Alexandre E.; Jacob, Arpith C.
2017-02-07
In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.
Schedulers with load-store queue awareness
DOE Office of Scientific and Technical Information (OSTI.GOV)
Chen, Tong; Eichenberger, Alexandre E.; Jacob, Arpith C.
2017-01-24
In one embodiment, a computer-implemented method includes tracking a size of a load-store queue (LSQ) during compile time of a program. The size of the LSQ is time-varying and indicates how many memory access instructions of the program are on the LSQ. The method further includes scheduling, by a computer processor, a plurality of memory access instructions of the program based on the size of the LSQ.
Implementation Of The Configurable Fault Tolerant System Experiment On NPSAT 1
2016-03-01
REPORT TYPE AND DATES COVERED Master’s thesis 4. TITLE AND SUBTITLE IMPLEMENTATION OF THE CONFIGURABLE FAULT TOLERANT SYSTEM EXPERIMENT ON NPSAT...open-source microprocessor without interlocked pipeline stages (MIPS) based processor softcore, a cached memory structure capable of accessing double...data rate type three and secure digital card memories, an interface to the main satellite bus, and XILINX’s soft error mitigation softcore. The
NASA Astrophysics Data System (ADS)
Jaber, Khalid Mohammad; Alia, Osama Moh'd.; Shuaib, Mohammed Mahmod
2018-03-01
Finding the optimal parameters that can reproduce experimental data (such as the velocity-density relation and the specific flow rate) is a very important component of the validation and calibration of microscopic crowd dynamic models. Heavy computational demand during parameter search is a known limitation that exists in a previously developed model known as the Harmony Search-Based Social Force Model (HS-SFM). In this paper, a parallel-based mechanism is proposed to reduce the computational time and memory resource utilisation required to find these parameters. More specifically, two MATLAB-based multicore techniques (parfor and create independent jobs) using shared memory are developed by taking advantage of the multithreading capabilities of parallel computing, resulting in a new framework called the Parallel Harmony Search-Based Social Force Model (P-HS-SFM). The experimental results show that the parfor-based P-HS-SFM achieved a better computational time of about 26 h, an efficiency improvement of ? 54% and a speedup factor of 2.196 times in comparison with the HS-SFM sequential processor. The performance of the P-HS-SFM using the create independent jobs approach is also comparable to parfor with a computational time of 26.8 h, an efficiency improvement of about 30% and a speedup of 2.137 times.
NASA Technical Reports Server (NTRS)
Plesea, Lucian
2006-01-01
A computer program automatically builds large, full-resolution mosaics of multispectral images of Earth landmasses from images acquired by Landsat 7, complete with matching of colors and blending between adjacent scenes. While the code has been used extensively for Landsat, it could also be used for other data sources. A single mosaic of as many as 8,000 scenes, represented by more than 5 terabytes of data and the largest set produced in this work, demonstrated what the code could do to provide global coverage. The program first statistically analyzes input images to determine areas of coverage and data-value distributions. It then transforms the input images from their original universal transverse Mercator coordinates to other geographical coordinates, with scaling. It applies a first-order polynomial brightness correction to each band in each scene. It uses a data-mask image for selecting data and blending of input scenes. Under control by a user, the program can be made to operate on small parts of the output image space, with check-point and restart capabilities. The program runs on SGI IRIX computers. It is capable of parallel processing using shared-memory code, large memories, and tens of central processing units. It can retrieve input data and store output data at locations remote from the processors on which it is executed.
Green Secure Processors: Towards Power-Efficient Secure Processor Design
NASA Astrophysics Data System (ADS)
Chhabra, Siddhartha; Solihin, Yan
With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.
Testing New Programming Paradigms with NAS Parallel Benchmarks
NASA Technical Reports Server (NTRS)
Jin, H.; Frumkin, M.; Schultz, M.; Yan, J.
2000-01-01
Over the past decade, high performance computing has evolved rapidly, not only in hardware architectures but also with increasing complexity of real applications. Technologies have been developing to aim at scaling up to thousands of processors on both distributed and shared memory systems. Development of parallel programs on these computers is always a challenging task. Today, writing parallel programs with message passing (e.g. MPI) is the most popular way of achieving scalability and high performance. However, writing message passing programs is difficult and error prone. Recent years new effort has been made in defining new parallel programming paradigms. The best examples are: HPF (based on data parallelism) and OpenMP (based on shared memory parallelism). Both provide simple and clear extensions to sequential programs, thus greatly simplify the tedious tasks encountered in writing message passing programs. HPF is independent of memory hierarchy, however, due to the immaturity of compiler technology its performance is still questionable. Although use of parallel compiler directives is not new, OpenMP offers a portable solution in the shared-memory domain. Another important development involves the tremendous progress in the internet and its associated technology. Although still in its infancy, Java promisses portability in a heterogeneous environment and offers possibility to "compile once and run anywhere." In light of testing these new technologies, we implemented new parallel versions of the NAS Parallel Benchmarks (NPBs) with HPF and OpenMP directives, and extended the work with Java and Java-threads. The purpose of this study is to examine the effectiveness of alternative programming paradigms. NPBs consist of five kernels and three simulated applications that mimic the computation and data movement of large scale computational fluid dynamics (CFD) applications. We started with the serial version included in NPB2.3. Optimization of memory and cache usage was applied to several benchmarks, noticeably BT and SP, resulting in better sequential performance. In order to overcome the lack of an HPF performance model and guide the development of the HPF codes, we employed an empirical performance model for several primitives found in the benchmarks. We encountered a few limitations of HPF, such as lack of supporting the "REDISTRIBUTION" directive and no easy way to handle irregular computation. The parallelization with OpenMP directives was done at the outer-most loop level to achieve the largest granularity. The performance of six HPF and OpenMP benchmarks is compared with their MPI counterparts for the Class-A problem size in the figure in next page. These results were obtained on an SGI Origin2000 (195MHz) with MIPSpro-f77 compiler 7.2.1 for OpenMP and MPI codes and PGI pghpf-2.4.3 compiler with MPI interface for HPF programs.
Applications Performance on NAS Intel Paragon XP/S - 15#
NASA Technical Reports Server (NTRS)
Saini, Subhash; Simon, Horst D.; Copper, D. M. (Technical Monitor)
1994-01-01
The Numerical Aerodynamic Simulation (NAS) Systems Division received an Intel Touchstone Sigma prototype model Paragon XP/S- 15 in February, 1993. The i860 XP microprocessor with an integrated floating point unit and operating in dual -instruction mode gives peak performance of 75 million floating point operations (NIFLOPS) per second for 64 bit floating point arithmetic. It is used in the Paragon XP/S-15 which has been installed at NAS, NASA Ames Research Center. The NAS Paragon has 208 nodes and its peak performance is 15.6 GFLOPS. Here, we will report on early experience using the Paragon XP/S- 15. We have tested its performance using both kernels and applications of interest to NAS. We have measured the performance of BLAS 1, 2 and 3 both assembly-coded and Fortran coded on NAS Paragon XP/S- 15. Furthermore, we have investigated the performance of a single node one-dimensional FFT, a distributed two-dimensional FFT and a distributed three-dimensional FFT Finally, we measured the performance of NAS Parallel Benchmarks (NPB) on the Paragon and compare it with the performance obtained on other highly parallel machines, such as CM-5, CRAY T3D, IBM SP I, etc. In particular, we investigated the following issues, which can strongly affect the performance of the Paragon: a. Impact of the operating system: Intel currently uses as a default an operating system OSF/1 AD from the Open Software Foundation. The paging of Open Software Foundation (OSF) server at 22 MB to make more memory available for the application degrades the performance. We found that when the limit of 26 NIB per node out of 32 MB available is reached, the application is paged out of main memory using virtual memory. When the application starts paging, the performance is considerably reduced. We found that dynamic memory allocation can help applications performance under certain circumstances. b. Impact of data cache on the i860/XP: We measured the performance of the BLAS both assembly coded and Fortran coded. We found that the measured performance of assembly-coded BLAS is much less than what memory bandwidth limitation would predict. The influence of data cache on different sizes of vectors is also investigated using one-dimensional FFTs. c. Impact of processor layout: There are several different ways processors can be laid out within the two-dimensional grid of processors on the Paragon. We have used the FFT example to investigate performance differences based on processors layout.
Distributed Processor/Memory Architectures Design Program
1975-02-01
Event Scheduling Plo 31 Globat LAl Message Input Event Sicheduling Fhou ..... ............... 106 32 It tc Iata Representation...298 138 GEX LEX Scheduling Phlmophy ....... ...................... 300 139 Executive Comirol Herarchy... Scheduler Subroutine lnterrelatiomhips . ..... ................. 312 145 Task Scheduler Message Scatuer. . ...... ....................... 315 146
NASA Technical Reports Server (NTRS)
Stehle, Roy H.; Ogier, Richard G.
1993-01-01
Alternatives for realizing a packet-based network switch for use on a frequency division multiple access/time division multiplexed (FDMA/TDM) geostationary communication satellite were investigated. Each of the eight downlink beams supports eight directed dwells. The design needed to accommodate multicast packets with very low probability of loss due to contention. Three switch architectures were designed and analyzed. An output-queued, shared bus system yielded a functionally simple system, utilizing a first-in, first-out (FIFO) memory per downlink dwell, but at the expense of a large total memory requirement. A shared memory architecture offered the most efficiency in memory requirements, requiring about half the memory of the shared bus design. The processing requirement for the shared-memory system adds system complexity that may offset the benefits of the smaller memory. An alternative design using a shared memory buffer per downlink beam decreases circuit complexity through a distributed design, and requires at most 1000 packets of memory more than the completely shared memory design. Modifications to the basic packet switch designs were proposed to accommodate circuit-switched traffic, which must be served on a periodic basis with minimal delay. Methods for dynamically controlling the downlink dwell lengths were developed and analyzed. These methods adapt quickly to changing traffic demands, and do not add significant complexity or cost to the satellite and ground station designs. Methods for reducing the memory requirement by not requiring the satellite to store full packets were also proposed and analyzed. In addition, optimal packet and dwell lengths were computed as functions of memory size for the three switch architectures.
System, methods and apparatus for program optimization for multi-threaded processor architectures
Bastoul, Cedric; Lethin, Richard A; Leung, Allen K; Meister, Benoit J; Szilagyi, Peter; Vasilache, Nicolas T; Wohlford, David E
2015-01-06
Methods, apparatus and computer software product for source code optimization are provided. In an exemplary embodiment, a first custom computing apparatus is used to optimize the execution of source code on a second computing apparatus. In this embodiment, the first custom computing apparatus contains a memory, a storage medium and at least one processor with at least one multi-stage execution unit. The second computing apparatus contains at least two multi-stage execution units that allow for parallel execution of tasks. The first custom computing apparatus optimizes the code for parallelism, locality of operations and contiguity of memory accesses on the second computing apparatus. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims.
Computer systems and methods for the query and visualization multidimensional databases
Stolte, Chris; Tang, Diane L.; Hanrahan, Patrick
2017-04-25
A method of generating a data visualization is performed at a computer having a display, one or more processors, and memory. The memory stores one or more programs for execution by the one or more processors. The process receives user specification of a plurality of characteristics of a data visualization. The data visualization is based on data from a multidimensional database. The characteristics specify at least x-position and y-position of data marks corresponding to tuples of data retrieved from the database. The process generates a data visualization according to the specified plurality of characteristics. The data visualization has an x-axis defined based on data for one or more first fields from the database that specify x-position of the data marks and the data visualization has a y-axis defined based on data for one or more second fields from the database that specify y-position of the data marks.
Dynamic Load Balancing for Adaptive Computations on Distributed-Memory Machines
NASA Technical Reports Server (NTRS)
1999-01-01
Dynamic load balancing is central to adaptive mesh-based computations on large-scale parallel computers. The principal investigator has investigated various issues on the dynamic load balancing problem under NASA JOVE and JAG rants. The major accomplishments of the project are two graph partitioning algorithms and a load balancing framework. The S-HARP dynamic graph partitioner is known to be the fastest among the known dynamic graph partitioners to date. It can partition a graph of over 100,000 vertices in 0.25 seconds on a 64- processor Cray T3E distributed-memory multiprocessor while maintaining the scalability of over 16-fold speedup. Other known and widely used dynamic graph partitioners take over a second or two while giving low scalability of a few fold speedup on 64 processors. These results have been published in journals and peer-reviewed flagship conferences.
Gilgamesh: A Multithreaded Processor-In-Memory Architecture for Petaflops Computing
NASA Technical Reports Server (NTRS)
Sterling, T. L.; Zima, H. P.
2002-01-01
Processor-in-Memory (PIM) architectures avoid the von Neumann bottleneck in conventional machines by integrating high-density DRAM and CMOS logic on the same chip. Parallel systems based on this new technology are expected to provide higher scalability, adaptability, robustness, fault tolerance and lower power consumption than current MPPs or commodity clusters. In this paper we describe the design of Gilgamesh, a PIM-based massively parallel architecture, and elements of its execution model. Gilgamesh extends existing PIM capabilities by incorporating advanced mechanisms for virtualizing tasks and data and providing adaptive resource management for load balancing and latency tolerance. The Gilgamesh execution model is based on macroservers, a middleware layer which supports object-based runtime management of data and threads allowing explicit and dynamic control of locality and load balancing. The paper concludes with a discussion of related research activities and an outlook to future work.
Staging memory for massively parallel processor
NASA Technical Reports Server (NTRS)
Batcher, Kenneth E. (Inventor)
1988-01-01
The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.
A microcomputer interface for a digital audio processor-based data recording system.
Croxton, T L; Stump, S J; Armstrong, W M
1987-10-01
An inexpensive interface is described that performs direct transfer of digitized data from the digital audio processor and video cassette recorder based data acquisition system designed by Bezanilla (1985, Biophys. J., 47:437-441) to an IBM PC/XT microcomputer. The FORTRAN callable software that drives this interface is capable of controlling the video cassette recorder and starting data collection immediately after recognition of a segment of previously collected data. This permits piecewise analysis of long intervals of data that would otherwise exceed the memory capability of the microcomputer.
Wolinski, Christophe Czeslaw [Los Alamos, NM; Gokhale, Maya B [Los Alamos, NM; McCabe, Kevin Peter [Los Alamos, NM
2011-01-18
Fabric-based computing systems and methods are disclosed. A fabric-based computing system can include a polymorphous computing fabric that can be customized on a per application basis and a host processor in communication with said polymorphous computing fabric. The polymorphous computing fabric includes a cellular architecture that can be highly parameterized to enable a customized synthesis of fabric instances for a variety of enhanced application performances thereof. A global memory concept can also be included that provides the host processor random access to all variables and instructions associated with the polymorphous computing fabric.
Programmable architecture for pixel level processing tasks in lightweight strapdown IR seekers
NASA Astrophysics Data System (ADS)
Coates, James L.
1993-06-01
Typical processing tasks associated with missile IR seeker applications are described, and a straw man suite of algorithms is presented. A fully programmable multiprocessor architecture is realized on a multimedia video processor (MVP) developed by Texas Instruments. The MVP combines the elements of RISC, floating point, advanced DSPs, graphics processors, display and acquisition control, RAM, and external memory. Front end pixel level tasks typical of missile interceptor applications, operating on 256 x 256 sensor imagery, can be processed at frame rates exceeding 100 Hz in a single MVP chip.
Computing NLTE Opacities -- Node Level Parallel Calculation
DOE Office of Scientific and Technical Information (OSTI.GOV)
Holladay, Daniel
Presentation. The goal: to produce a robust library capable of computing reasonably accurate opacities inline with the assumption of LTE relaxed (non-LTE). Near term: demonstrate acceleration of non-LTE opacity computation. Far term (if funded): connect to application codes with in-line capability and compute opacities. Study science problems. Use efficient algorithms that expose many levels of parallelism and utilize good memory access patterns for use on advanced architectures. Portability to multiple types of hardware including multicore processors, manycore processors such as KNL, GPUs, etc. Easily coupled to radiation hydrodynamics and thermal radiative transfer codes.
A microcomputer interface for a digital audio processor-based data recording system.
Croxton, T L; Stump, S J; Armstrong, W M
1987-01-01
An inexpensive interface is described that performs direct transfer of digitized data from the digital audio processor and video cassette recorder based data acquisition system designed by Bezanilla (1985, Biophys. J., 47:437-441) to an IBM PC/XT microcomputer. The FORTRAN callable software that drives this interface is capable of controlling the video cassette recorder and starting data collection immediately after recognition of a segment of previously collected data. This permits piecewise analysis of long intervals of data that would otherwise exceed the memory capability of the microcomputer. PMID:3676444
Enabling MPEG-2 video playback in embedded systems through improved data cache efficiency
NASA Astrophysics Data System (ADS)
Soderquist, Peter; Leeser, Miriam E.
1999-01-01
Digital video decoding, enabled by the MPEG-2 Video standard, is an important future application for embedded systems, particularly PDAs and other information appliances. Many such system require portability and wireless communication capabilities, and thus face severe limitations in size and power consumption. This places a premium on integration and efficiency, and favors software solutions for video functionality over specialized hardware. The processors in most embedded system currently lack the computational power needed to perform video decoding, but a related and equally important problem is the required data bandwidth, and the need to cost-effectively insure adequate data supply. MPEG data sets are very large, and generate significant amounts of excess memory traffic for standard data caches, up to 100 times the amount required for decoding. Meanwhile, cost and power limitations restrict cache sizes in embedded systems. Some systems, including many media processors, eliminate caches in favor of memories under direct, painstaking software control in the manner of digital signal processors. Yet MPEG data has locality which caches can exploit if properly optimized, providing fast, flexible, and automatic data supply. We propose a set of enhancements which target the specific needs of the heterogeneous types within the MPEG decoder working set. These optimizations significantly improve the efficiency of small caches, reducing cache-memory traffic by almost 70 percent, and can make an enhanced 4 KB cache perform better than a standard 1 MB cache. This performance improvement can enable high-resolution, full frame rate video playback in cheaper, smaller system than woudl otherwise be possible.
A Wearable Healthcare System With a 13.7 μA Noise Tolerant ECG Processor.
Izumi, Shintaro; Yamashita, Ken; Nakano, Masanao; Kawaguchi, Hiroshi; Kimura, Hiromitsu; Marumoto, Kyoji; Fuchikami, Takaaki; Fujimori, Yoshikazu; Nakajima, Hiroshi; Shiga, Toshikazu; Yoshimoto, Masahiko
2015-10-01
To prevent lifestyle diseases, wearable bio-signal monitoring systems for daily life monitoring have attracted attention. Wearable systems have strict size and weight constraints, which impose significant limitations of the battery capacity and the signal-to-noise ratio of bio-signals. This report describes an electrocardiograph (ECG) processor for use with a wearable healthcare system. It comprises an analog front end, a 12-bit ADC, a robust Instantaneous Heart Rate (IHR) monitor, a 32-bit Cortex-M0 core, and 64 Kbyte Ferroelectric Random Access Memory (FeRAM). The IHR monitor uses a short-term autocorrelation (STAC) algorithm to improve the heart-rate detection accuracy despite its use in noisy conditions. The ECG processor chip consumes 13.7 μA for heart rate logging application.
Computing an operating parameter of a unified power flow controller
Wilson, David G.; Robinett, III, Rush D.
2017-12-26
A Unified Power Flow Controller described herein comprises a sensor that outputs at least one sensed condition, a processor that receives the at least one sensed condition, a memory that comprises control logic that is executable by the processor; and power electronics that comprise power storage, wherein the processor causes the power electronics to selectively cause the power storage to act as one of a power generator or a load based at least in part upon the at least one sensed condition output by the sensor and the control logic, and wherein at least one operating parameter of the power electronics is designed to facilitate maximal transmittal of electrical power generated at a variable power generation system to a grid system while meeting power constraints set forth by the electrical power grid.
Multi-level Hierarchical Poly Tree computer architectures
NASA Technical Reports Server (NTRS)
Padovan, Joe; Gute, Doug
1990-01-01
Based on the concept of hierarchical substructuring, this paper develops an optimal multi-level Hierarchical Poly Tree (HPT) parallel computer architecture scheme which is applicable to the solution of finite element and difference simulations. Emphasis is given to minimizing computational effort, in-core/out-of-core memory requirements, and the data transfer between processors. In addition, a simplified communications network that reduces the number of I/O channels between processors is presented. HPT configurations that yield optimal superlinearities are also demonstrated. Moreover, to generalize the scope of applicability, special attention is given to developing: (1) multi-level reduction trees which provide an orderly/optimal procedure by which model densification/simplification can be achieved, as well as (2) methodologies enabling processor grading that yields architectures with varying types of multi-level granularity.