limited processor sharing: Topics by Science.gov

Sample records for limited processor sharing

Characterization of Stationary Distributions of Reflected Diffusions

DTIC Science & Technology

2014-01-01

Reiman , M. I. (2003). Fluid and heavy traffic limits for a generalized processor sharing model. Ann. Appl. Probab., 13, 100-139. [37] Ramanan, K. and... Reiman , M. I. (2008). The heavy traffic limit of an unbalanced generalized processor sharing model. Ann. Appl. Probab., 18, 22-58. [38] Reed, J. and...Control and Computing. [39] Reiman , M. I. and Williams, R. J. (1988). A boundary property of semimartingale reflecting Brownian motions. Probab. Theor
Interconnect Performance Evaluation of SGI Altix 3700 BX2, Cray X1, Cray Opteron Cluster, and Dell PowerEdge

NASA Technical Reports Server (NTRS)

Fatoohi, Rod; Saini, Subbash; Ciotti, Robert

2006-01-01

We study the performance of inter-process communication on four high-speed multiprocessor systems using a set of communication benchmarks. The goal is to identify certain limiting factors and bottlenecks with the interconnect of these systems as well as to compare these interconnects. We measured network bandwidth using different number of communicating processors and communication patterns, such as point-to-point communication, collective communication, and dense communication patterns. The four platforms are: a 512-processor SGI Altix 3700 BX2 shared-memory machine with 3.2 GB/s links; a 64-processor (single-streaming) Cray XI shared-memory machine with 32 1.6 GB/s links; a 128-processor Cray Opteron cluster using a Myrinet network; and a 1280-node Dell PowerEdge cluster with an InfiniBand network. Our, results show the impact of the network bandwidth and topology on the overall performance of each interconnect.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G; Salapura, Valentina

2014-12-02

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Shared performance monitor in a multiprocessor system

DOEpatents

Chiu, George; Gara, Alan G.; Salapura, Valentina

2012-07-24

A performance monitoring unit (PMU) and method for monitoring performance of events occurring in a multiprocessor system. The multiprocessor system comprises a plurality of processor devices units, each processor device for generating signals representing occurrences of events in the processor device, and, a single shared counter resource for performance monitoring. The performance monitor unit is shared by all processor cores in the multiprocessor system. The PMU comprises: a plurality of performance counters each for counting signals representing occurrences of events from one or more the plurality of processor units in the multiprocessor system; and, a plurality of input devices for receiving the event signals from one or more processor devices of the plurality of processor units, the plurality of input devices programmable to select event signals for receipt by one or more of the plurality of performance counters for counting, wherein the PMU is shared between multiple processing units, or within a group of processors in the multiprocessing system. The PMU is further programmed to monitor event signals issued from non-processor devices.
Parallel ALLSPD-3D: Speeding Up Combustor Analysis Via Parallel Processing

NASA Technical Reports Server (NTRS)

Fricker, David M.

1997-01-01

The ALLSPD-3D Computational Fluid Dynamics code for reacting flow simulation was run on a set of benchmark test cases to determine its parallel efficiency. These test cases included non-reacting and reacting flow simulations with varying numbers of processors. Also, the tests explored the effects of scaling the simulation with the number of processors in addition to distributing a constant size problem over an increasing number of processors. The test cases were run on a cluster of IBM RS/6000 Model 590 workstations with ethernet and ATM networking plus a shared memory SGI Power Challenge L workstation. The results indicate that the network capabilities significantly influence the parallel efficiency, i.e., a shared memory machine is fastest and ATM networking provides acceptable performance. The limitations of ethernet greatly hamper the rapid calculation of flows using ALLSPD-3D.
7 CFR 1435.310 - Sharing processors' allocations with producers.

Code of Federal Regulations, 2011 CFR

2011-01-01

... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.

Code of Federal Regulations, 2010 CFR

2010-01-01

... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.

Code of Federal Regulations, 2012 CFR

2012-01-01

... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.

Code of Federal Regulations, 2014 CFR

2014-01-01

... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
7 CFR 1435.310 - Sharing processors' allocations with producers.

Code of Federal Regulations, 2013 CFR

2013-01-01

... CREDIT CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.310 Sharing processors' allocations with producers. (a) Every sugar beet and sugarcane processor must provide CCC a certification that: (1) The processor...
Communications systems and methods for subsea processors

DOEpatents

Gutierrez, Jose; Pereira, Luis

2016-04-26

A subsea processor may be located near the seabed of a drilling site and used to coordinate operations of underwater drilling components. The subsea processor may be enclosed in a single interchangeable unit that fits a receptor on an underwater drilling component, such as a blow-out preventer (BOP). The subsea processor may issue commands to control the BOP and receive measurements from sensors located throughout the BOP. A shared communications bus may interconnect the subsea processor and underwater components and the subsea processor and a surface or onshore network. The shared communications bus may be operated according to a time division multiple access (TDMA) scheme.
Multiprocessor shared-memory information exchange

DOE Office of Scientific and Technical Information (OSTI.GOV)

Santoline, L.L.; Bowers, M.D.; Crew, A.W.

1989-02-01

In distributed microprocessor-based instrumentation and control systems, the inter-and intra-subsystem communication requirements ultimately form the basis for the overall system architecture. This paper describes a software protocol which addresses the intra-subsystem communications problem. Specifically the protocol allows for multiple processors to exchange information via a shared-memory interface. The authors primary goal is to provide a reliable means for information to be exchanged between central application processor boards (masters) and dedicated function processor boards (slaves) in a single computer chassis. The resultant Multiprocessor Shared-Memory Information Exchange (MSMIE) protocol, a standard master-slave shared-memory interface suitable for use in nuclear safety systems, ismore » designed to pass unidirectional buffers of information between the processors while providing a minimum, deterministic cycle time for this data exchange.« less
Using SDI-12 with ST microelectronics MCU's

DOE Office of Scientific and Technical Information (OSTI.GOV)

Saari, Alexandra; Hinzey, Shawn Adrian; Frigo, Janette Rose

2015-09-03

ST Microelectronics microcontrollers and processors are readily available, capable and economical processors. Unfortunately they lack a broad user base like similar offerings from Texas Instrument, Atmel, or Microchip. All of these devices could be useful in economical devices for remote sensing applications used with environmental sensing. With the increased need for environmental studies, and limited budgets, flexibility in hardware is very important. To that end, and in an effort to increase open support of ST devices, I am sharing my teams' experience in interfacing a common environmental sensor communication protocol (SDI-12) with ST devices.
Distributed processor allocation for launching applications in a massively connected processors complex

DOEpatents

Pedretti, Kevin

2008-11-18

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.
Hypercluster - Parallel processing for computational mechanics

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1988-01-01

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.
Ordering of guarded and unguarded stores for no-sync I/O

DOEpatents

Gara, Alan; Ohmacht, Martin

2013-06-25

A parallel computing system processes at least one store instruction. A first processor core issues a store instruction. A first queue, associated with the first processor core, stores the store instruction. A second queue, associated with a first local cache memory device of the first processor core, stores the store instruction. The first processor core updates first data in the first local cache memory device according to the store instruction. The third queue, associated with at least one shared cache memory device, stores the store instruction. The first processor core invalidates second data, associated with the store instruction, in the at least one shared cache memory. The first processor core invalidates third data, associated with the store instruction, in other local cache memory devices of other processor cores. The first processor core flushing only the first queue.
Visualization Co-Processing of a CFD Simulation

NASA Technical Reports Server (NTRS)

Vaziri, Arsi

1999-01-01

OVERFLOW, a widely used CFD simulation code, is combined with a visualization system, pV3, to experiment with an environment for simulation/visualization co-processing on a SGI Origin 2000 computer(O2K) system. The shared memory version of the solver is used with the O2K 'pfa' preprocessor invoked to automatically discover parallelism in the source code. No other explicit parallelism is enabled. In order to study the scaling and performance of the visualization co-processing system, sample runs are made with different processor groups in the range of 1 to 254 processors. The data exchange between the visualization system and the simulation system is rapid enough for user interactivity when the problem size is small. This shared memory version of OVERFLOW, with minimal parallelization, does not scale well to an increasing number of available processors. The visualization task takes about 18 to 30% of the total processing time and does not appear to be a major contributor to the poor scaling. Improper load balancing and inter-processor communication overhead are contributors to this poor performance. Work is in progress which is aimed at obtaining improved parallel performance of the solver and removing the limitations of serial data transfer to pV3 by examining various parallelization/communication strategies, including the use of the explicit message passing.
Conditional load and store in a shared memory

DOEpatents

Blumrich, Matthias A; Ohmacht, Martin

2015-02-03

A method, system and computer program product for implementing load-reserve and store-conditional instructions in a multi-processor computing system. The computing system includes a multitude of processor units and a shared memory cache, and each of the processor units has access to the memory cache. In one embodiment, the method comprises providing the memory cache with a series of reservation registers, and storing in these registers addresses reserved in the memory cache for the processor units as a result of issuing load-reserve requests. In this embodiment, when one of the processor units makes a request to store data in the memory cache using a store-conditional request, the reservation registers are checked to determine if an address in the memory cache is reserved for that processor unit. If an address in the memory cache is reserved for that processor, the data are stored at this address.
Data preprocessing for determining outer/inner parallelization in the nested loop problem using OpenMP

NASA Astrophysics Data System (ADS)

Handhika, T.; Bustamam, A.; Ernastuti, Kerami, D.

2017-07-01

Multi-thread programming using OpenMP on the shared-memory architecture with hyperthreading technology allows the resource to be accessed by multiple processors simultaneously. Each processor can execute more than one thread for a certain period of time. However, its speedup depends on the ability of the processor to execute threads in limited quantities, especially the sequential algorithm which contains a nested loop. The number of the outer loop iterations is greater than the maximum number of threads that can be executed by a processor. The thread distribution technique that had been found previously only be applied by the high-level programmer. This paper generates a parallelization procedure for low-level programmer in dealing with 2-level nested loop problems with the maximum number of threads that can be executed by a processor is smaller than the number of the outer loop iterations. Data preprocessing which is related to the number of the outer loop and the inner loop iterations, the computational time required to execute each iteration and the maximum number of threads that can be executed by a processor are used as a strategy to determine which parallel region that will produce optimal speedup.
Method for prefetching non-contiguous data structures

DOEpatents

Blumrich, Matthias A [Ridgefield, CT; Chen, Dong [Croton On Hudson, NY; Coteus, Paul W [Yorktown Heights, NY; Gara, Alan G [Mount Kisco, NY; Giampapa, Mark E [Irvington, NY; Heidelberger, Philip [Cortlandt Manor, NY; Hoenicke, Dirk [Ossining, NY; Ohmacht, Martin [Brewster, NY; Steinmacher-Burow, Burkhard D [Mount Kisco, NY; Takken, Todd E [Mount Kisco, NY; Vranas, Pavlos M [Bedford Hills, NY

2009-05-05

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processor only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple perfecting for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefect rather than some other predictive algorithm. This enables hardware to effectively prefect memory access patterns that are non-contiguous, but repetitive.

On nonlinear finite element analysis in single-, multi- and parallel-processors

NASA Technical Reports Server (NTRS)

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.
Rapid recovery from transient faults in the fault-tolerant processor with fault-tolerant shared memory

NASA Technical Reports Server (NTRS)

Harper, Richard E.; Butler, Bryan P.

1990-01-01

The Draper fault-tolerant processor with fault-tolerant shared memory (FTP/FTSM), which is designed to allow application tasks to continue execution during the memory alignment process, is described. Processor performance is not affected by memory alignment. In addition, the FTP/FTSM incorporates a hardware scrubber device to perform the memory alignment quickly during unused memory access cycles. The FTP/FTSM architecture is described, followed by an estimate of the time required for channel reintegration.
Architectures for reasoning in parallel

NASA Technical Reports Server (NTRS)

Hall, Lawrence O.

1989-01-01

The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.
Parallel discrete event simulation: A shared memory approach

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1987-01-01

With traditional event list techniques, evaluating a detailed discrete event simulation model can often require hours or even days of computation time. Parallel simulation mimics the interacting servers and queues of a real system by assigning each simulated entity to a processor. By eliminating the event list and maintaining only sufficient synchronization to insure causality, parallel simulation can potentially provide speedups that are linear in the number of processors. A set of shared memory experiments is presented using the Chandy-Misra distributed simulation algorithm to simulate networks of queues. Parameters include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential simulation of most queueing network models.
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Each processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
Low latency memory access and synchronization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Blumrich, Matthias A.; Chen, Dong; Coteus, Paul W.

A low latency memory system access is provided in association with a weakly-ordered multiprocessor system. Bach processor in the multiprocessor shares resources, and each shared resource has an associated lock within a locking device that provides support for synchronization between the multiple processors in the multiprocessor and the orderly sharing of the resources. A processor only has permission to access a resource when it owns the lock associated with that resource, and an attempt by a processor to own a lock requires only a single load operation, rather than a traditional atomic load followed by store, such that the processormore » only performs a read operation and the hardware locking device performs a subsequent write operation rather than the processor. A simple prefetching for non-contiguous data structures is also disclosed. A memory line is redefined so that in addition to the normal physical memory data, every line includes a pointer that is large enough to point to any other line in the memory, wherein the pointers to determine which memory line to prefetch rather than some other predictive algorithm. This enables hardware to effectively prefetch memory access patterns that are non-contiguous, but repetitive.« less
A message passing kernel for the hypercluster parallel processing test bed

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Quealy, Angela; Cole, Gary L.

1989-01-01

A Message-Passing Kernel (MPK) for the Hypercluster parallel-processing test bed is described. The Hypercluster is being developed at the NASA Lewis Research Center to support investigations of parallel algorithms and architectures for computational fluid and structural mechanics applications. The Hypercluster resembles the hypercube architecture except that each node consists of multiple processors communicating through shared memory. The MPK efficiently routes information through the Hypercluster, using a message-passing protocol when necessary and faster shared-memory communication whenever possible. The MPK also interfaces all of the processors with the Hypercluster operating system (HYCLOPS), which runs on a Front-End Processor (FEP). This approach distributes many of the I/O tasks to the Hypercluster processors and eliminates the need for a separate I/O support program on the FEP.
50 CFR 680.40 - Crab Quota Share (QS), Processor QS (PQS), Individual Fishing Quota (IFQ), and Individual...

Code of Federal Regulations, 2010 CFR

2010-10-01

... 50 Wildlife and Fisheries 9 2010-10-01 2010-10-01 false Crab Quota Share (QS), Processor QS (PQS... established based on the regional designations determined on August 1, 2005. QS or PQS issued after this date... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
Parallel discrete event simulation using shared memory

NASA Technical Reports Server (NTRS)

Reed, Daniel A.; Malony, Allen D.; Mccredie, Bradley D.

1988-01-01

With traditional event-list techniques, evaluating a detailed discrete-event simulation-model can often require hours or even days of computation time. By eliminating the event list and maintaining only sufficient synchronization to ensure causality, parallel simulation can potentially provide speedups that are linear in the numbers of processors. A set of shared-memory experiments, using the Chandy-Misra distributed-simulation algorithm, to simulate networks of queues is presented. Parameters of the study include queueing network topology and routing probabilities, number of processors, and assignment of network nodes to processors. These experiments show that Chandy-Misra distributed simulation is a questionable alternative to sequential-simulation of most queueing network models.
Parallelising a molecular dynamics algorithm on a multi-processor workstation

NASA Astrophysics Data System (ADS)

Müller-Plathe, Florian

1990-12-01

The Verlet neighbour-list algorithm is parallelised for a multi-processor Hewlett-Packard/Apollo DN10000 workstation. The implementation makes use of memory shared between the processors. It is a genuine master-slave approach by which most of the computational tasks are kept in the master process and the slaves are only called to do part of the nonbonded forces calculation. The implementation features elements of both fine-grain and coarse-grain parallelism. Apart from three calls to library routines, two of which are standard UNIX calls, and two machine-specific language extensions, the whole code is written in standard Fortran 77. Hence, it may be expected that this parallelisation concept can be transfered in parts or as a whole to other multi-processor shared-memory computers. The parallel code is routinely used in production work.
Multiprocessing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1990-01-01

Very little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPs or more) in computational aerodynamics to significantly improve turnaround time. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, the improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) through multi-tasking is applied via a strategy which requires relatively minor modifications to an existing code for a single processor. Essentially, this approach maps the available memory to multiple processors, exploiting the C-FORTRAN-Unix interface. The existing single processor code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor. As a demonstration of this approach, a Multiple Processor Multiple Grid (MPMG) code is developed. It is capable of using nine processors, and can be easily extended to a larger number of processors. This code solves the three-dimensional, Reynolds averaged, thin-layer and slender-layer Navier-Stokes equations with an implicit, approximately factored and diagonalized method. The solver is applied to generic oblique-wing aircraft problem on a four processor Cray-2 computer. A tricubic interpolation scheme is developed to increase the accuracy of coupling of overlapped grids. For the oblique-wing aircraft problem, a speedup of two in elapsed (turnaround) time is observed in a saturated time-sharing environment.
Vascular system modeling in parallel environment - distributed and shared memory approaches

PubMed Central

Jurczuk, Krzysztof; Kretowski, Marek; Bezy-Wendling, Johanne

2011-01-01

The paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages and therefore this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multi-core machines, show that both algorithms provide a significant speedup. PMID:21550891
Error recovery in shared memory multiprocessors using private caches

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1990-01-01

The problem of recovering from processor transient faults in shared memory multiprocesses systems is examined. A user-transparent checkpointing and recovery scheme using private caches is presented. Processes can recover from errors due to faulty processors by restarting from the checkpointed computation state. Implementation techniques using checkpoint identifiers and recovery stacks are examined as a means of reducing performance degradation in processor utilization during normal execution. This cache-based checkpointing technique prevents rollback propagation, provides rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions to take error latency into account are presented.
System and method for programmable bank selection for banked memory subsystems

DOEpatents

Blumrich, Matthias A.; Chen, Dong; Gara, Alan G.; Giampapa, Mark E.; Hoenicke, Dirk; Ohmacht, Martin; Salapura, Valentina; Sugavanam, Krishnan

2010-09-07

A programmable memory system and method for enabling one or more processor devices access to shared memory in a computing environment, the shared memory including one or more memory storage structures having addressable locations for storing data. The system comprises: one or more first logic devices associated with a respective one or more processor devices, each first logic device for receiving physical memory address signals and programmable for generating a respective memory storage structure select signal upon receipt of pre-determined address bit values at selected physical memory address bit locations; and, a second logic device responsive to each of the respective select signal for generating an address signal used for selecting a memory storage structure for processor access. The system thus enables each processor device of a computing environment memory storage access distributed across the one or more memory storage structures.
Solutions and debugging for data consistency in multiprocessors with noncoherent caches

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bernstein, D.; Mendelson, B.; Breternitz, M. Jr.

1995-02-01

We analyze two important problems that arise in shared-memory multiprocessor systems. The stale data problem involves ensuring that data items in local memory of individual processors are current, independent of writes done by other processors. False sharing occurs when two processors have copies of the same shared data block but update different portions of the block. The false sharing problem involves guaranteeing that subsequent writes are properly combined. In modern architectures these problems are usually solved in hardware, by exploiting mechanisms for hardware controlled cache consistency. This leads to more expensive and nonscalable designs. Therefore, we are concentrating on softwaremore » methods for ensuring cache consistency that would allow for affordable and scalable multiprocessing systems. Unfortunately, providing software control is nontrivial, both for the compiler writer and for the application programmer. For this reason we are developing a debugging environment that will facilitate the development of compiler-based techniques and will help the programmer to tune his or her application using explicit cache management mechanisms. We extend the notion of a race condition for IBM Shared Memory System POWER/4, taking into consideration its noncoherent caches, and propose techniques for detection of false sharing problems. Identification of the stale data problem is discussed as well, and solutions are suggested.« less
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

NASA Astrophysics Data System (ADS)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
An efficient ASIC implementation of 16-channel on-line recursive ICA processor for real-time EEG system.

PubMed

Fang, Wai-Chi; Huang, Kuan-Ju; Chou, Chia-Ching; Chang, Jui-Chung; Cauwenberghs, Gert; Jung, Tzyy-Ping

2014-01-01

This is a proposal for an efficient very-large-scale integration (VLSI) design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, implemented with TSMC 40 nm CMOS technology. ORICA is appropriate to be used in real-time EEG system to separate artifacts because of its highly efficient and real-time process features. The proposed ORICA processor is composed of an ORICA processing unit and a singular value decomposition (SVD) processing unit. Compared with previous work [1], this proposed ORICA processor has enhanced effectiveness and reduced hardware complexity by utilizing a deeper pipeline architecture, shared arithmetic processing unit, and shared registers. The 16-channel random signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are used to analyze the dependence of the source components, and the average correlation coefficient is 0.95452 between the original source signals and extracted ORICA signals. Finally, the proposed ORICA processor ASIC is implemented with TSMC 40 nm CMOS technology, and it consumes 15.72 mW at 100 MHz operating frequency.
Direct access inter-process shared memory

DOEpatents

Brightwell, Ronald B; Pedretti, Kevin; Hudson, Trammell B

2013-10-22

A technique for directly sharing physical memory between processes executing on processor cores is described. The technique includes loading a plurality of processes into the physical memory for execution on a corresponding plurality of processor cores sharing the physical memory. An address space is mapped to each of the processes by populating a first entry in a top level virtual address table for each of the processes. The address space of each of the processes is cross-mapped into each of the processes by populating one or more subsequent entries of the top level virtual address table with the first entry in the top level virtual address table from other processes.
Fast and Accurate Simulation of the Cray XMT Multithreaded Supercomputer

DOE Office of Scientific and Technical Information (OSTI.GOV)

Villa, Oreste; Tumeo, Antonino; Secchi, Simone

Irregular applications, such as data mining and analysis or graph-based computations, show unpredictable memory/network access patterns and control structures. Highly multithreaded architectures with large processor counts, like the Cray MTA-1, MTA-2 and XMT, appear to address their requirements better than commodity clusters. However, the research on highly multithreaded systems is currently limited by the lack of adequate architectural simulation infrastructures due to issues such as size of the machines, memory footprint, simulation speed, accuracy and customization. At the same time, Shared-memory MultiProcessors (SMPs) with multi-core processors have become an attractive platform to simulate large scale machines. In this paper, wemore » introduce a cycle-level simulator of the highly multithreaded Cray XMT supercomputer. The simulator runs unmodified XMT applications. We discuss how we tackled the challenges posed by its development, detailing the techniques introduced to make the simulation as fast as possible while maintaining a high accuracy. By mapping XMT processors (ThreadStorm with 128 hardware threads) to host computing cores, the simulation speed remains constant as the number of simulated processors increases, up to the number of available host cores. The simulator supports zero-overhead switching among different accuracy levels at run-time and includes a network model that takes into account contention. On a modern 48-core SMP host, our infrastructure simulates a large set of irregular applications 500 to 2000 times slower than real time when compared to a 128-processor XMT, while remaining within 10\\% of accuracy. Emulation is only from 25 to 200 times slower than real time.« less
High-performance computing — an overview

NASA Astrophysics Data System (ADS)

Marksteiner, Peter

1996-08-01

An overview of high-performance computing (HPC) is given. Different types of computer architectures used in HPC are discussed: vector supercomputers, high-performance RISC processors, various parallel computers like symmetric multiprocessors, workstation clusters, massively parallel processors. Software tools and programming techniques used in HPC are reviewed: vectorizing compilers, optimization and vector tuning, optimization for RISC processors; parallel programming techniques like shared-memory parallelism, message passing and data parallelism; and numerical libraries.

A Parallel Rendering Algorithm for MIMD Architectures

NASA Technical Reports Server (NTRS)

Crockett, Thomas W.; Orloff, Tobias

1991-01-01

Applications such as animation and scientific visualization demand high performance rendering of complex three dimensional scenes. To deliver the necessary rendering rates, highly parallel hardware architectures are required. The challenge is then to design algorithms and software which effectively use the hardware parallelism. A rendering algorithm targeted to distributed memory MIMD architectures is described. For maximum performance, the algorithm exploits both object-level and pixel-level parallelism. The behavior of the algorithm is examined both analytically and experimentally. Its performance for large numbers of processors is found to be limited primarily by communication overheads. An experimental implementation for the Intel iPSC/860 shows increasing performance from 1 to 128 processors across a wide range of scene complexities. It is shown that minimal modifications to the algorithm will adapt it for use on shared memory architectures as well.
An Adaptive Insertion and Promotion Policy for Partitioned Shared Caches

NASA Astrophysics Data System (ADS)

Mahrom, Norfadila; Liebelt, Michael; Raof, Rafikha Aliana A.; Daud, Shuhaizar; Hafizah Ghazali, Nur

2018-03-01

Cache replacement policies in chip multiprocessors (CMP) have been investigated extensively and proven able to enhance shared cache management. However, competition among multiple processors executing different threads that require simultaneous access to a shared memory may cause cache contention and memory coherence problems on the chip. These issues also exist due to some drawbacks of the commonly used Least Recently Used (LRU) policy employed in multiprocessor systems, which are because of the cache lines residing in the cache longer than required. In image processing analysis of for example extra pulmonary tuberculosis (TB), an accurate diagnosis for tissue specimen is required. Therefore, a fast and reliable shared memory management system to execute algorithms for processing vast amount of specimen image is needed. In this paper, the effects of the cache replacement policy in a partitioned shared cache are investigated. The goal is to quantify whether better performance can be achieved by using less complex replacement strategies. This paper proposes a Middle Insertion 2 Positions Promotion (MI2PP) policy to eliminate cache misses that could adversely affect the access patterns and the throughput of the processors in the system. The policy employs a static predefined insertion point, near distance promotion, and the concept of ownership in the eviction policy to effectively improve cache thrashing and to avoid resource stealing among the processors.
Asynchronous Communication Scheme For Hypercube Computer

NASA Technical Reports Server (NTRS)

Madan, Herb S.

1988-01-01

Scheme devised for asynchronous-message communication system for Mark III hypercube concurrent-processor network. Network consists of up to 1,024 processing elements connected electrically as though were at corners of 10-dimensional cube. Each node contains two Motorola 68020 processors along with Motorola 68881 floating-point processor utilizing up to 4 megabytes of shared dynamic random-access memory. Scheme intended to support applications requiring passage of both polled or solicited and unsolicited messages.
Tactical Operations Analysis Support Facility.

DTIC Science & Technology

1981-05-01

Punch/Reader 2 DMC-11AR DDCMP Micro Processor 2 DMC-11DA Network Link Line Unit 2 DL-11E Async Serial Line Interface 4 Intel IN-1670 448K Words MOS Memory...86 5.3 VIRTUAL PROCESSORS - VAX-11/750 ........................... 89 5.4 A RELATIONAL DATA MANAGEMENT SYSTEM - ORACLE...Central Processing Unit (CPU) is a 16 bit processor for high-speed, real time applications, and for large multi-user, multi- task, time shared
DMA shared byte counters in a parallel computer

DOEpatents

Chen, Dong; Gara, Alan G.; Heidelberger, Philip; Vranas, Pavlos

2010-04-06

A parallel computer system is constructed as a network of interconnected compute nodes. Each of the compute nodes includes at least one processor, a memory and a DMA engine. The DMA engine includes a processor interface for interfacing with the at least one processor, DMA logic, a memory interface for interfacing with the memory, a DMA network interface for interfacing with the network, injection and reception byte counters, injection and reception FIFO metadata, and status registers and control registers. The injection FIFOs maintain memory locations of the injection FIFO metadata memory locations including its current head and tail, and the reception FIFOs maintain the reception FIFO metadata memory locations including its current head and tail. The injection byte counters and reception byte counters may be shared between messages.
Methodology for fast detection of false sharing in threaded scientific codes

DOEpatents

Chung, I-Hsin; Cong, Guojing; Murata, Hiroki; Negishi, Yasushi; Wen, Hui-Fang

2014-11-25

A profiling tool identifies a code region with a false sharing potential. A static analysis tool classifies variables and arrays in the identified code region. A mapping detection library correlates memory access instructions in the identified code region with variables and arrays in the identified code region while a processor is running the identified code region. The mapping detection library identifies one or more instructions at risk, in the identified code region, which are subject to an analysis by a false sharing detection library. A false sharing detection library performs a run-time analysis of the one or more instructions at risk while the processor is re-running the identified code region. The false sharing detection library determines, based on the performed run-time analysis, whether two different portions of the cache memory line are accessed by the generated binary code.
Testing the Tester: Lessons Learned During the Testing of a State-of-the-Art Commercial 14nm Processor Under Proton Irradiation

NASA Technical Reports Server (NTRS)

Szabo, Carl M., Jr.; Duncan, Adam R.; Label, Kenneth A.

2017-01-01

Testing of an Intel 14nm desktop processor was conducted under proton irradiation. We share lessons learned, demonstrating that complex devices beget further complex challenges requiring practical and theoretical investigative expertise to solve.
Applications considerations in the system design of highly concurrent multiprocessors

NASA Technical Reports Server (NTRS)

Lundstrom, Stephen F.

1987-01-01

A flow model processor approach to parallel processing is described, using very-high-performance individual processors, high-speed circuit switched interconnection networks, and a high-speed synchronization capability to minimize the effect of the inherently serial portions of applications on performance. Design studies related to the determination of the number of processors, the memory organization, and the structure of the networks used to interconnect the processor and memory resources are discussed. Simulations indicate that applications centered on the large shared data memory should be able to sustain over 500 million floating point operations per second.
Reader set encoding for directory of shared cache memory in multiprocessor system

DOEpatents

Ahn, Dnaiel; Ceze, Luis H.; Gara, Alan; Ohmacht, Martin; Xiaotong, Zhuang

2014-06-10

In a parallel processing system with speculative execution, conflict checking occurs in a directory lookup of a cache memory that is shared by all processors. In each case, the same physical memory address will map to the same set of that cache, no matter which processor originated that access. The directory includes a dynamic reader set encoding, indicating what speculative threads have read a particular line. This reader set encoding is used in conflict checking. A bitset encoding is used to specify particular threads that have read the line.
Cache-based error recovery for shared memory multiprocessor systems

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent; Patel, Janak H.

1989-01-01

A multiprocessor cache-based checkpointing and recovery scheme for of recovering from transient processor errors in a shared-memory multiprocessor with private caches is presented. New implementation techniques that use checkpoint identifiers and recovery stacks to reduce performance degradation in processor utilization during normal execution are examined. This cache-based checkpointing technique prevents rollback propagation, provides for rapid recovery, and can be integrated into standard cache coherence protocols. An analytical model is used to estimate the relative performance of the scheme during normal execution. Extensions that take error latency into account are presented.
Fault-Tolerant, Real-Time, Multi-Core Computer System

NASA Technical Reports Server (NTRS)

Gostelow, Kim P.

2012-01-01

A document discusses a fault-tolerant, self-aware, low-power, multi-core computer for space missions with thousands of simple cores, achieving speed through concurrency. The proposed machine decides how to achieve concurrency in real time, rather than depending on programmers. The driving features of the system are simple hardware that is modular in the extreme, with no shared memory, and software with significant runtime reorganizing capability. The document describes a mechanism for moving ongoing computations and data that is based on a functional model of execution. Because there is no shared memory, the processor connects to its neighbors through a high-speed data link. Messages are sent to a neighbor switch, which in turn forwards that message on to its neighbor until reaching the intended destination. Except for the neighbor connections, processors are isolated and independent of each other. The processors on the periphery also connect chip-to-chip, thus building up a large processor net. There is no particular topology to the larger net, as a function at each processor allows it to forward a message in the correct direction. Some chip-to-chip connections are not necessarily nearest neighbors, providing short cuts for some of the longer physical distances. The peripheral processors also provide the connections to sensors, actuators, radios, science instruments, and other devices with which the computer system interacts.
PANDA: A distributed multiprocessor operating system

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chubb, P.

1989-01-01

PANDA is a design for a distributed multiprocessor and an operating system. PANDA is designed to allow easy expansion of both hardware and software. As such, the PANDA kernel provides only message passing and memory and process management. The other features needed for the system (device drivers, secondary storage management, etc.) are provided as replaceable user tasks. The thesis presents PANDA's design and implementation, both hardware and software. PANDA uses multiple 68010 processors sharing memory on a VME bus, each such node potentially connected to others via a high speed network. The machine is completely homogeneous: there are no differencesmore » between processors that are detectable by programs running on the machine. A single two-processor node has been constructed. Each processor contains memory management circuits designed to allow processors to share page tables safely. PANDA presents a programmers' model similar to the hardware model: a job is divided into multiple tasks, each having its own address space. Within each task, multiple processes share code and data. Tasks can send messages to each other, and set up virtual circuits between themselves. Peripheral devices such as disc drives are represented within PANDA by tasks. PANDA divides secondary storage into volumes, each volume being accessed by a volume access task, or VAT. All knowledge about the way that data is stored on a disc is kept in its volume's VAT. The design is such that PANDA should provide a useful testbed for file systems and device drivers, as these can be installed without recompiling PANDA itself, and without rebooting the machine.« less
Contention Modeling for Multithreaded Distributed Shared Memory Machines: The Cray XMT

DOE Office of Scientific and Technical Information (OSTI.GOV)

Secchi, Simone; Tumeo, Antonino; Villa, Oreste

Distributed Shared Memory (DSM) machines are a wide class of multi-processor computing systems where a large virtually-shared address space is mapped on a network of physically distributed memories. High memory latency and network contention are two of the main factors that limit performance scaling of such architectures. Modern high-performance computing DSM systems have evolved toward exploitation of massive hardware multi-threading and fine-grained memory hashing to tolerate irregular latencies, avoid network hot-spots and enable high scaling. In order to model the performance of such large-scale machines, parallel simulation has been proved to be a promising approach to achieve good accuracy inmore » reasonable times. One of the most critical factors in solving the simulation speed-accuracy trade-off is network modeling. The Cray XMT is a massively multi-threaded supercomputing architecture that belongs to the DSM class, since it implements a globally-shared address space abstraction on top of a physically distributed memory substrate. In this paper, we discuss the development of a contention-aware network model intended to be integrated in a full-system XMT simulator. We start by measuring the effects of network contention in a 128-processor XMT machine and then investigate the trade-off that exists between simulation accuracy and speed, by comparing three network models which operate at different levels of accuracy. The comparison and model validation is performed by executing a string-matching algorithm on the full-system simulator and on the XMT, using three datasets that generate noticeably different contention patterns.« less
An experimental distributed microprocessor implementation with a shared memory communications and control medium

NASA Technical Reports Server (NTRS)

Mejzak, R. S.

1980-01-01

The distributed processing concept is defined in terms of control primitives, variables, and structures and their use in performing a decomposed discrete Fourier transform (DET) application function. The design assumes interprocessor communications to be anonymous. In this scheme, all processors can access an entire common database by employing control primitives. Access to selected areas within the common database is random, enforced by a hardware lock, and determined by task and subtask pointers. This enables the number of processors to be varied in the configuration without any modifications to the control structure. Decompositional elements of the DFT application function in terms of tasks and subtasks are also described. The experimental hardware configuration consists of IMSAI 8080 chassis which are independent, 8 bit microcomputer units. These chassis are linked together to form a multiple processing system by means of a shared memory facility. This facility consists of hardware which provides a bus structure to enable up to six microcomputers to be interconnected. It provides polling and arbitration logic so that only one processor has access to shared memory at any one time.
Fault tolerant onboard packet switch architecture for communication satellites: Shared memory per beam approach

NASA Technical Reports Server (NTRS)

Shalkhauser, Mary JO; Quintana, Jorge A.; Soni, Nitin J.

1994-01-01

The NASA Lewis Research Center is developing a multichannel communication signal processing satellite (MCSPS) system which will provide low data rate, direct to user, commercial communications services. The focus of current space segment developments is a flexible, high-throughput, fault tolerant onboard information switching processor. This information switching processor (ISP) is a destination-directed packet switch which performs both space and time switching to route user information among numerous user ground terminals. Through both industry study contracts and in-house investigations, several packet switching architectures were examined. A contention-free approach, the shared memory per beam architecture, was selected for implementation. The shared memory per beam architecture, fault tolerance insertion, implementation, and demonstration plans are described.
Computational performance of a smoothed particle hydrodynamics simulation for shared-memory parallel computing

NASA Astrophysics Data System (ADS)

Nishiura, Daisuke; Furuichi, Mikito; Sakaguchi, Hide

2015-09-01

The computational performance of a smoothed particle hydrodynamics (SPH) simulation is investigated for three types of current shared-memory parallel computer devices: many integrated core (MIC) processors, graphics processing units (GPUs), and multi-core CPUs. We are especially interested in efficient shared-memory allocation methods for each chipset, because the efficient data access patterns differ between compute unified device architecture (CUDA) programming for GPUs and OpenMP programming for MIC processors and multi-core CPUs. We first introduce several parallel implementation techniques for the SPH code, and then examine these on our target computer architectures to determine the most effective algorithms for each processor unit. In addition, we evaluate the effective computing performance and power efficiency of the SPH simulation on each architecture, as these are critical metrics for overall performance in a multi-device environment. In our benchmark test, the GPU is found to produce the best arithmetic performance as a standalone device unit, and gives the most efficient power consumption. The multi-core CPU obtains the most effective computing performance. The computational speed of the MIC processor on Xeon Phi approached that of two Xeon CPUs. This indicates that using MICs is an attractive choice for existing SPH codes on multi-core CPUs parallelized by OpenMP, as it gains computational acceleration without the need for significant changes to the source code.
Production Level CFD Code Acceleration for Hybrid Many-Core Architectures

NASA Technical Reports Server (NTRS)

Duffy, Austen C.; Hammond, Dana P.; Nielsen, Eric J.

2012-01-01

In this work, a novel graphics processing unit (GPU) distributed sharing model for hybrid many-core architectures is introduced and employed in the acceleration of a production-level computational fluid dynamics (CFD) code. The latest generation graphics hardware allows multiple processor cores to simultaneously share a single GPU through concurrent kernel execution. This feature has allowed the NASA FUN3D code to be accelerated in parallel with up to four processor cores sharing a single GPU. For codes to scale and fully use resources on these and the next generation machines, codes will need to employ some type of GPU sharing model, as presented in this work. Findings include the effects of GPU sharing on overall performance. A discussion of the inherent challenges that parallel unstructured CFD codes face in accelerator-based computing environments is included, with considerations for future generation architectures. This work was completed by the author in August 2010, and reflects the analysis and results of the time.
Optimal processor assignment for pipeline computations

NASA Technical Reports Server (NTRS)

Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

1991-01-01

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.
Android Protection Mechanism: A Signed Code Security Mechanism for Smartphone Applications

DTIC Science & Technology

2011-03-01

status registers, exceptions, endian support, unaligned access support, synchronization primitives , the Jazelle Extension, and saturated integer...supports comprehensive non-blocking shared-memory synchronization primitives that scale for multiple-processor system designs. This is an improvement... synchronization . Memory semaphores can be loaded and altered without interruption because the load and store operations are atomic. Processor
System and method for memory allocation in a multiclass memory system

DOEpatents

Loh, Gabriel; Meswani, Mitesh; Ignatowski, Michael; Nutter, Mark

2016-06-28

A system for memory allocation in a multiclass memory system includes a processor coupleable to a plurality of memories sharing a unified memory address space, and a library store to store a library of software functions. The processor identifies a type of a data structure in response to a memory allocation function call to the library for allocating memory to the data structure. Using the library, the processor allocates portions of the data structure among multiple memories of the multiclass memory system based on the type of the data structure.

A High Performance VLSI Computer Architecture For Computer Graphics

NASA Astrophysics Data System (ADS)

Chin, Chi-Yuan; Lin, Wen-Tai

1988-10-01

A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
Parallel processing on the Livermore VAX 11/780-4 parallel processor system with compatibility to Cray Research, Inc. (CRI) multitasking. Version 1

DOE Office of Scientific and Technical Information (OSTI.GOV)

Werner, N.E.; Van Matre, S.W.

1985-05-01

This manual describes the CRI Subroutine Library and Utility Package. The CRI library provides Cray multitasking functionality on the four-processor shared memory VAX 11/780-4. Additional functionality has been added for more flexibility. A discussion of the library, utilities, error messages, and example programs is provided.
Limit characteristics of digital optoelectronic processor

NASA Astrophysics Data System (ADS)

Kolobrodov, V. G.; Tymchik, G. S.; Kolobrodov, M. S.

2018-01-01

In this article, the limiting characteristics of a digital optoelectronic processor are explored. The limits are defined by diffraction effects and a matrix structure of the devices for input and output of optical signals. The purpose of a present research is to optimize the parameters of the processor's components. The developed physical and mathematical model of DOEP allowed to establish the limit characteristics of the processor, restricted by diffraction effects and an array structure of the equipment for input and output of optical signals, as well as to optimize the parameters of the processor's components. The diameter of the entrance pupil of the Fourier lens is determined by the size of SLM and the pixel size of the modulator. To determine the spectral resolution, it is offered to use a concept of an optimum phase when the resolved diffraction maxima coincide with the pixel centers of the radiation detector.
An enhanced Ada run-time system for real-time embedded processors

NASA Technical Reports Server (NTRS)

Sims, J. T.

1991-01-01

An enhanced Ada run-time system has been developed to support real-time embedded processor applications. The primary focus of this development effort has been on the tasking system and the memory management facilities of the run-time system. The tasking system has been extended to support efficient and precise periodic task execution as required for control applications. Event-driven task execution providing a means of task-asynchronous control and communication among Ada tasks is supported in this system. Inter-task control is even provided among tasks distributed on separate physical processors. The memory management system has been enhanced to provide object allocation and protected access support for memory shared between disjoint processors, each of which is executing a distinct Ada program.
A fully reconfigurable photonic integrated signal processor

NASA Astrophysics Data System (ADS)

Liu, Weilin; Li, Ming; Guzzon, Robert S.; Norberg, Erik J.; Parker, John S.; Lu, Mingzhi; Coldren, Larry A.; Yao, Jianping

2016-03-01

Photonic signal processing has been considered a solution to overcome the inherent electronic speed limitations. Over the past few years, an impressive range of photonic integrated signal processors have been proposed, but they usually offer limited reconfigurability, a feature highly needed for the implementation of large-scale general-purpose photonic signal processors. Here, we report and experimentally demonstrate a fully reconfigurable photonic integrated signal processor based on an InP-InGaAsP material system. The proposed photonic signal processor is capable of performing reconfigurable signal processing functions including temporal integration, temporal differentiation and Hilbert transformation. The reconfigurability is achieved by controlling the injection currents to the active components of the signal processor. Our demonstration suggests great potential for chip-scale fully programmable all-optical signal processing.
7 CFR 1435.315 - Adjustments to proportionate shares.

Code of Federal Regulations, 2010 CFR

2010-01-01

... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.

Code of Federal Regulations, 2011 CFR

2011-01-01

... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.

Code of Federal Regulations, 2014 CFR

2014-01-01

... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.

Code of Federal Regulations, 2012 CFR

2012-01-01

... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
7 CFR 1435.315 - Adjustments to proportionate shares.

Code of Federal Regulations, 2013 CFR

2013-01-01

... CORPORATION, DEPARTMENT OF AGRICULTURE LOANS, PURCHASES, AND OTHER OPERATIONS SUGAR PROGRAM Flexible Marketing Allotments For Sugar § 1435.315 Adjustments to proportionate shares. Whenever CCC determines that, because of... sufficient to enable state processors to produce sufficient sugar to meet the State's cane sugar allotment...
Data traffic reduction schemes for sparse Cholesky factorizations

NASA Technical Reports Server (NTRS)

Naik, Vijay K.; Patrick, Merrell L.

1988-01-01

Load distribution schemes are presented which minimize the total data traffic in the Cholesky factorization of dense and sparse, symmetric, positive definite matrices on multiprocessor systems with local and shared memory. The total data traffic in factoring an n x n sparse, symmetric, positive definite matrix representing an n-vertex regular 2-D grid graph using n (sup alpha), alpha is equal to or less than 1, processors are shown to be O(n(sup 1 + alpha/2)). It is O(n(sup 3/2)), when n (sup alpha), alpha is equal to or greater than 1, processors are used. Under the conditions of uniform load distribution, these results are shown to be asymptotically optimal. The schemes allow efficient use of up to O(n) processors before the total data traffic reaches the maximum value of O(n(sup 3/2)). The partitioning employed within the scheme, allows a better utilization of the data accessed from shared memory than those of previously published methods.
Development of a Dynamic Time Sharing Scheduled Environment Final Report CRADA No. TC-824-94E

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jette, M.; Caliga, D.

Massively parallel computers, such as the Cray T3D, have historically supported resource sharing solely with space sharing. In that method, multiple problems are solved by executing them on distinct processors. This project developed a dynamic time- and space-sharing scheduler to achieve greater interactivity and throughput than could be achieved with space-sharing alone. CRI and LLNL worked together on the design, testing, and review aspects of this project. There were separate software deliverables. CFU implemented a general purpose scheduling system as per the design specifications. LLNL ported the local gang scheduler software to the LLNL Cray T3D. In this approach, processorsmore » are allocated simultaneously to aU components of a parallel program (in a “gang”). Program execution is preempted as needed to provide for interactivity. Programs are also reIocated to different processors as needed to efficiently pack the computer’s torus of processors. In phase one, CRI developed an interface specification after discussions with LLNL for systemlevel software supporting a time- and space-sharing environment on the LLNL T3D. The two parties also discussed interface specifications for external control tools (such as scheduling policy tools, system administration tools) and applications programs. CRI assumed responsibility for the writing and implementation of all the necessary system software in this phase. In phase two, CRI implemented job-rolling on the Cray T3D, a mechanism for preempting a program, saving its state to disk, and later restoring its state to memory for continued execution. LLNL ported its gang scheduler to the LLNL T3D utilizing the CRI interface implemented in phases one and two. During phase three, the functionality and effectiveness of the LLNL gang scheduler was assessed to provide input to CRI time- and space-sharing, efforts. CRI will utilize this information in the development of general schedulers suitable for other sites and future architectures.« less
Parallel processing approach to transform-based image coding

NASA Astrophysics Data System (ADS)

Normile, James O.; Wright, Dan; Chu, Ken; Yeh, Chia L.

1991-06-01

This paper describes a flexible parallel processing architecture designed for use in real time video processing. The system consists of floating point DSP processors connected to each other via fast serial links, each processor has access to a globally shared memory. A multiple bus architecture in combination with a dual ported memory allows communication with a host control processor. The system has been applied to prototyping of video compression and decompression algorithms. The decomposition of transform based algorithms for decompression into a form suitable for parallel processing is described. A technique for automatic load balancing among the processors is developed and discussed, results ar presented with image statistics and data rates. Finally techniques for accelerating the system throughput are analyzed and results from the application of one such modification described.
76 FR 3090 - Proposed Information Collection; Comment Request; Alaska Region; Bering Sea and Aleutian Islands...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-01-19

... submitted on or before March 21, 2011. ADDRESSES: Direct all written comments to Diana Hynek, Departmental... fisheries. Program components include quota share allocation, processor quota share allocation, individual... Binding Arbitration process, and fee collection. II. Method of Collection Responses are mailed, except the...
Conditions for space invariance in optical data processors used with coherent or noncoherent light.

PubMed

Arsenault, H R

1972-10-01

The conditions for space invariance in coherent and noncoherent optical processors are considered. All linear optical processors are shown to belong to one of two types. The conditions for space invariance are more stringent for noncoherent processors than for coherent processors, so that a system that is linear in coherent light may be nonlinear in noncoherent light. However, any processor that is linear in noncoherent light is also linear in the coherent limit.
Formulation of consumables management models. Development approach for the mission planning processor working model

NASA Technical Reports Server (NTRS)

Connelly, L. C.

1977-01-01

The mission planning processor is a user oriented tool for consumables management and is part of the total consumables subsystem management concept. The approach to be used in developing a working model of the mission planning processor is documented. The approach includes top-down design, structured programming techniques, and application of NASA approved software development standards. This development approach: (1) promotes cost effective software development, (2) enhances the quality and reliability of the working model, (3) encourages the sharing of the working model through a standard approach, and (4) promotes portability of the working model to other computer systems.
Automatic film processors' quality control test in Greek military hospitals.

PubMed

Lymberis, C; Efstathopoulos, E P; Manetou, A; Poudridis, G

1993-04-01

The two major military radiology installations (Athens, Greece) using a total of 15 automatic film processors were assessed using the 21-step-wedge method. The results of quality control in all these processors are presented. The parameters measured under actual working conditions were base and fog, contrast and speed. Base and fog as well as speed displayed large variations with average values generally higher than acceptable, whilst contrast displayed greater stability. Developer temperature was measured daily during the test and was found to be outside the film manufacturers' recommended limits in nine of the 15 processors. In only one processor did film passing time vary on an every day basis and this was due to maloperation. Developer pH test was not part of the daily monitoring service being performed every 5 days for each film processor and found to be in the range 9-12; 10 of the 15 processors presented pH values outside the limits specified by the film manufacturers.
A high-accuracy optical linear algebra processor for finite element applications

NASA Technical Reports Server (NTRS)

Casasent, D.; Taylor, B. K.

1984-01-01

Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.
An architecture for real-time vision processing

NASA Technical Reports Server (NTRS)

Chien, Chiun-Hong

1994-01-01

To study the feasibility of developing an architecture for real time vision processing, a task queue server and parallel algorithms for two vision operations were designed and implemented on an i860-based Mercury Computing System 860VS array processor. The proposed architecture treats each vision function as a task or set of tasks which may be recursively divided into subtasks and processed by multiple processors coordinated by a task queue server accessible by all processors. Each idle processor subsequently fetches a task and associated data from the task queue server for processing and posts the result to shared memory for later use. Load balancing can be carried out within the processing system without the requirement for a centralized controller. The author concludes that real time vision processing cannot be achieved without both sequential and parallel vision algorithms and a good parallel vision architecture.
Multiprocessing on supercomputers for computational aerodynamics

NASA Technical Reports Server (NTRS)

Yarrow, Maurice; Mehta, Unmeel B.

1991-01-01

Little use is made of multiple processors available on current supercomputers (computers with a theoretical peak performance capability equal to 100 MFLOPS or more) to improve turnaround time in computational aerodynamics. The productivity of a computer user is directly related to this turnaround time. In a time-sharing environment, such improvement in this speed is achieved when multiple processors are used efficiently to execute an algorithm. The concept of multiple instructions and multiple data (MIMD) is applied through multitasking via a strategy that requires relatively minor modifications to an existing code for a single processor. This approach maps the available memory to multiple processors, exploiting the C-Fortran-Unix interface. The existing code is mapped without the need for developing a new algorithm. The procedure for building a code utilizing this approach is automated with the Unix stream editor.

Performances of multiprocessor multidisk architectures for continuous media storage

NASA Astrophysics Data System (ADS)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Performance and Application of Parallel OVERFLOW Codes on Distributed and Shared Memory Platforms

NASA Technical Reports Server (NTRS)

Djomehri, M. Jahed; Rizk, Yehia M.

1999-01-01

The presentation discusses recent studies on the performance of the two parallel versions of the aerodynamics CFD code, OVERFLOW_MPI and _MLP. Developed at NASA Ames, the serial version, OVERFLOW, is a multidimensional Navier-Stokes flow solver based on overset (Chimera) grid technology. The code has recently been parallelized in two ways. One is based on the explicit message-passing interface (MPI) across processors and uses the _MPI communication package. This approach is primarily suited for distributed memory systems and workstation clusters. The second, termed the multi-level parallel (MLP) method, is simple and uses shared memory for all communications. The _MLP code is suitable on distributed-shared memory systems. For both methods, the message passing takes place across the processors or processes at the advancement of each time step. This procedure is, in effect, the Chimera boundary conditions update, which is done in an explicit "Jacobi" style. In contrast, the update in the serial code is done in more of the "Gauss-Sidel" fashion. The programming efforts for the _MPI code is more complicated than for the _MLP code; the former requires modification of the outer and some inner shells of the serial code, whereas the latter focuses only on the outer shell of the code. The _MPI version offers a great deal of flexibility in distributing grid zones across a specified number of processors in order to achieve load balancing. The approach is capable of partitioning zones across multiple processors or sending each zone and/or cluster of several zones into a single processor. The message passing across the processors consists of Chimera boundary and/or an overlap of "halo" boundary points for each partitioned zone. The MLP version is a new coarse-grain parallel concept at the zonal and intra-zonal levels. A grouping strategy is used to distribute zones into several groups forming sub-processes which will run in parallel. The total volume of grid points in each group are approximately balanced. A proper number of threads are initially allocated to each group, and in subsequent iterations during the run-time, the number of threads are adjusted to achieve load balancing across the processes. Each process exploits the multitasking directives already established in Overflow.
Parallelization of a Monte Carlo particle transport simulation code

NASA Astrophysics Data System (ADS)

Hadjidoukas, P.; Bousis, C.; Emfietzoglou, D.

2010-05-01

We have developed a high performance version of the Monte Carlo particle transport simulation code MC4. The original application code, developed in Visual Basic for Applications (VBA) for Microsoft Excel, was first rewritten in the C programming language for improving code portability. Several pseudo-random number generators have been also integrated and studied. The new MC4 version was then parallelized for shared and distributed-memory multiprocessor systems using the Message Passing Interface. Two parallel pseudo-random number generator libraries (SPRNG and DCMT) have been seamlessly integrated. The performance speedup of parallel MC4 has been studied on a variety of parallel computing architectures including an Intel Xeon server with 4 dual-core processors, a Sun cluster consisting of 16 nodes of 2 dual-core AMD Opteron processors and a 200 dual-processor HP cluster. For large problem size, which is limited only by the physical memory of the multiprocessor server, the speedup results are almost linear on all systems. We have validated the parallel implementation against the serial VBA and C implementations using the same random number generator. Our experimental results on the transport and energy loss of electrons in a water medium show that the serial and parallel codes are equivalent in accuracy. The present improvements allow for studying of higher particle energies with the use of more accurate physical models, and improve statistics as more particles tracks can be simulated in low response time.
Integrated Payload Data Handling Systems Using Software Partitioning

NASA Astrophysics Data System (ADS)

Taylor, Alun; Hann, Mark; Wishart, Alex

2015-09-01

An integrated Payload Data Handling System (I-PDHS) is one in which multiple instruments share a central payload processor for their on-board data processing tasks. This offers a number of advantages over the conventional decentralised architecture. Savings in payload mass and power can be realised because the total processing resource is matched to the requirements, as opposed to the decentralised architecture here the processing resource is in effect the sum of all the applications. Overall development cost can be reduced using a common processor. At individual instrument level the potential benefits include a standardised application development environment, and the opportunity to run the instrument data handling application on a fully redundant and more powerful processing platform [1]. This paper describes a joint program by SCISYS UK Limited, Airbus Defence and Space, Imperial College London and RAL Space to implement a realistic demonstration of an I-PDHS using engineering models of flight instruments (a magnetometer and camera) and a laboratory demonstrator of a central payload processor which is functionally representative of a flight design. The objective is to raise the Technology Readiness Level of the centralised data processing technique by address the key areas of task partitioning to prevent fault propagation and the use of a common development process for the instrument applications. The project is supported by a UK Space Agency grant awarded under the National Space Technology Program SpaceCITI scheme. [1].
WATERLOPP V2/64: A highly parallel machine for numerical computation

NASA Astrophysics Data System (ADS)

Ostlund, Neil S.

1985-07-01

Current technological trends suggest that the high performance scientific machines of the future are very likely to consist of a large number (greater than 1024) of processors connected and communicating with each other in some as yet undetermined manner. Such an assembly of processors should behave as a single machine in obtaining numerical solutions to scientific problems. However, the appropriate way of organizing both the hardware and software of such an assembly of processors is an unsolved and active area of research. It is particularly important to minimize the organizational overhead of interprocessor comunication, global synchronization, and contention for shared resources if the performance of a large number ( n) of processors is to be anything like the desirable n times the performance of a single processor. In many situations, adding a processor actually decreases the performance of the overall system since the extra organizational overhead is larger than the extra processing power added. The systolic loop architecture is a new multiple processor architecture which attemps at a solution to the problem of how to organize a large number of asynchronous processors into an effective computational system while minimizing the organizational overhead. This paper gives a brief overview of the basic systolic loop architecture, systolic loop algorithms for numerical computation, and a 64-processor implementation of the architecture, WATERLOOP V2/64, that is being used as a testbed for exploring the hardware, software, and algorithmic aspects of the architecture.
Parallel algorithms for boundary value problems

NASA Technical Reports Server (NTRS)

Lin, Avi

1990-01-01

A general approach to solve boundary value problems numerically in a parallel environment is discussed. The basic algorithm consists of two steps: the local step where all the P available processors work in parallel, and the global step where one processor solves a tridiagonal linear system of the order P. The main advantages of this approach are two fold. First, this suggested approach is very flexible, especially in the local step and thus the algorithm can be used with any number of processors and with any of the SIMD or MIMD machines. Secondly, the communication complexity is very small and thus can be used as easily with shared memory machines. Several examples for using this strategy are discussed.
C-MOS array design techniques: SUMC multiprocessor system study

NASA Technical Reports Server (NTRS)

Clapp, W. A.; Helbig, W. A.; Merriam, A. S.

1972-01-01

The current capabilities of LSI techniques for speed and reliability, plus the possibilities of assembling large configurations of LSI logic and storage elements, have demanded the study of multiprocessors and multiprocessing techniques, problems, and potentialities. Evaluated are three previous systems studies for a space ultrareliable modular computer multiprocessing system, and a new multiprocessing system is proposed that is flexibly configured with up to four central processors, four 1/0 processors, and 16 main memory units, plus auxiliary memory and peripheral devices. This multiprocessor system features a multilevel interrupt, qualified S/360 compatibility for ground-based generation of programs, virtual memory management of a storage hierarchy through 1/0 processors, and multiport access to multiple and shared memory units.
Methods for synchronizing a countdown routine of a timer key and electronic device

DOEpatents

Condit, Reston A.; Daniels, Michael A.; Clemens, Gregory P.; Tomberlin, Eric S.; Johnson, Joel A.

2015-06-02

A timer key relating to monitoring a countdown time of a countdown routine of an electronic device is disclosed. The timer key comprises a processor configured to respond to a countdown time associated with operation of the electronic device, a display operably coupled with the processor, and a housing configured to house at least the processor. The housing has an associated structure configured to engage with the electronic device to share the countdown time between the electronic device and the timer key. The processor is configured to begin a countdown routine based at least in part on the countdown time, wherein the countdown routine is at least substantially synchronized with a countdown routine of the electronic device when the timer key is removed from the electronic device. A system and method for synchronizing countdown routines of a timer key and an electronic device are also disclosed.
Apparatus, system, and method for synchronizing a timer key

DOEpatents

Condit, Reston A; Daniels, Michael A; Clemens, Gregory P; Tomberlin, Eric S; Johnson, Joel A

2014-04-22

A timer key relating to monitoring a countdown time of a countdown routine of an electronic device is disclosed. The timer key comprises a processor configured to respond to a countdown time associated with operation of the electronic device, a display operably coupled with the processor, and a housing configured to house at least the processor. The housing has an associated structure configured to engage with the electronic device to share the countdown time between the electronic device and the timer key. The processor is configured to begin a countdown routine based at least in part on the countdown time, wherein the countdown routine is at least substantially synchronized with a countdown routine of the electronic device when the timer key is removed from the electronic device. A system and method for synchronizing countdown routines of a timer key and an electronic device are also disclosed.
Design of a real-time wind turbine simulator using a custom parallel architecture

NASA Technical Reports Server (NTRS)

Hoffman, John A.; Gluck, R.; Sridhar, S.

1995-01-01

The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
Message Passing and Shared Address Space Parallelism on an SMP Cluster

NASA Technical Reports Server (NTRS)

Shan, Hongzhang; Singh, Jaswinder P.; Oliker, Leonid; Biswas, Rupak; Biegel, Bryan (Technical Monitor)

2002-01-01

Currently, message passing (MP) and shared address space (SAS) are the two leading parallel programming paradigms. MP has been standardized with MPI, and is the more common and mature approach; however, code development can be extremely difficult, especially for irregularly structured computations. SAS offers substantial ease of programming, but may suffer from performance limitations due to poor spatial locality and high protocol overhead. In this paper, we compare the performance of and the programming effort required for six applications under both programming models on a 32-processor PC-SMP cluster, a platform that is becoming increasingly attractive for high-end scientific computing. Our application suite consists of codes that typically do not exhibit scalable performance under shared-memory programming due to their high communication-to-computation ratios and/or complex communication patterns. Results indicate that SAS can achieve about half the parallel efficiency of MPI for most of our applications, while being competitive for the others. A hybrid MPI+SAS strategy shows only a small performance advantage over pure MPI in some cases. Finally, improved implementations of two MPI collective operations on PC-SMP clusters are presented.
OpenMP Performance on the Columbia Supercomputer

NASA Technical Reports Server (NTRS)

Haoqiang, Jin; Hood, Robert

2005-01-01

This presentation discusses Columbia World Class Supercomputer which is one of the world's fastest supercomputers providing 61 TFLOPs (10/20/04). Conceived, designed, built, and deployed in just 120 days. A 20-node supercomputer built on proven 512-processor nodes. The largest SGI system in the world with over 10,000 Intel Itanium 2 processors and provides the largest node size incorporating commodity parts (512) and the largest shared-memory environment (2048) with 88% efficiency tops the scalar systems on the Top500 list.
Hypercluster Parallel Processor

NASA Technical Reports Server (NTRS)

Blech, Richard A.; Cole, Gary L.; Milner, Edward J.; Quealy, Angela

1992-01-01

Hypercluster computer system includes multiple digital processors, operation of which coordinated through specialized software. Configurable according to various parallel-computing architectures of shared-memory or distributed-memory class, including scalar computer, vector computer, reduced-instruction-set computer, and complex-instruction-set computer. Designed as flexible, relatively inexpensive system that provides single programming and operating environment within which one can investigate effects of various parallel-computing architectures and combinations on performance in solution of complicated problems like those of three-dimensional flows in turbomachines. Hypercluster software and architectural concepts are in public domain.
40 CFR 747.195 - Triethanolamine salt of a substituted organic acid.

Code of Federal Regulations, 2010 CFR

2010-07-01

..., commerce, importer, impurity, Inventory, manufacturer, person, process, processor, and small quantities... control of the processor. (ii) Distribution in commerce is limited to purposes of export. (iii) The processor or distributor may not use the substance except in small quantities solely for research and...
Improved Remapping Processor For Digital Imagery

NASA Technical Reports Server (NTRS)

Fisher, Timothy E.

1991-01-01

Proposed digital image processor improved version of Programmable Remapper, which performs geometric and radiometric transformations on digital images. Features include overlapping and variably sized preimages. Overcomes some of limitations of image-warping circuit boards implementing only those geometric tranformations expressible in terms of polynomials of limited order. Also overcomes limitations of existing Programmable Remapper and made to perform transformations at video rate.
Call Admission Control on Single Node Networks under Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) Scheduler

NASA Astrophysics Data System (ADS)

Hanada, Masaki; Nakazato, Hidenori; Watanabe, Hitoshi

Multimedia applications such as music or video streaming, video teleconferencing and IP telephony are flourishing in packet-switched networks. Applications that generate such real-time data can have very diverse quality-of-service (QoS) requirements. In order to guarantee diverse QoS requirements, the combined use of a packet scheduling algorithm based on Generalized Processor Sharing (GPS) and leaky bucket traffic regulator is the most successful QoS mechanism. GPS can provide a minimum guaranteed service rate for each session and tight delay bounds for leaky bucket constrained sessions. However, the delay bounds for leaky bucket constrained sessions under GPS are unnecessarily large because each session is served according to its associated constant weight until the session buffer is empty. In order to solve this problem, a scheduling policy called Output Rate-Controlled Generalized Processor Sharing (ORC-GPS) was proposed in [17]. ORC-GPS is a rate-based scheduling like GPS, and controls the service rate in order to lower the delay bounds for leaky bucket constrained sessions. In this paper, we propose a call admission control (CAC) algorithm for ORC-GPS, for leaky-bucket constrained sessions with deterministic delay requirements. This CAC algorithm for ORC-GPS determines the optimal values of parameters of ORC-GPS from the deterministic delay requirements of the sessions. In numerical experiments, we compare the CAC algorithm for ORC-GPS with one for GPS in terms of schedulable region and computational complexity.
Implementation of kernels on the Maestro processor

NASA Astrophysics Data System (ADS)

Suh, Jinwoo; Kang, D. I. D.; Crago, S. P.

Currently, most microprocessors use multiple cores to increase performance while limiting power usage. Some processors use not just a few cores, but tens of cores or even 100 cores. One such many-core microprocessor is the Maestro processor, which is based on Tilera's TILE64 processor. The Maestro chip is a 49-core, general-purpose, radiation-hardened processor designed for space applications. The Maestro processor, unlike the TILE64, has a floating point unit (FPU) in each core for improved floating point performance. The Maestro processor runs at 342 MHz clock frequency. On the Maestro processor, we implemented several widely used kernels: matrix multiplication, vector add, FIR filter, and FFT. We measured and analyzed the performance of these kernels. The achieved performance was up to 5.7 GFLOPS, and the speedup compared to single tile was up to 49 using 49 tiles.
High speed quantitative digital microscopy

NASA Technical Reports Server (NTRS)

Castleman, K. R.; Price, K. H.; Eskenazi, R.; Ovadya, M. M.; Navon, M. A.

1984-01-01

Modern digital image processing hardware makes possible quantitative analysis of microscope images at high speed. This paper describes an application to automatic screening for cervical cancer. The system uses twelve MC6809 microprocessors arranged in a pipeline multiprocessor configuration. Each processor executes one part of the algorithm on each cell image as it passes through the pipeline. Each processor communicates with its upstream and downstream neighbors via shared two-port memory. Thus no time is devoted to input-output operations as such. This configuration is expected to be at least ten times faster than previous systems.
Multitasking OS manages a team of processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Ripps, D.L.

1983-07-21

MTOS-68k is a real-time multitasking operating system designed for the popular MC68000 microprocessors. It aproaches task coordination and synchronization in a fashion that matches uniquely the structural simplicity and regularity of the 68000 instruction set. Since in many 68000 applications the speed and power of one CPU are not enough, MTOS-68k has been designed to support multiple processors, as well as multiple tasks. Typically, the devices are tightly coupled single-board computers, that is they share a backplane and parts of global memory.
Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications

NASA Technical Reports Server (NTRS)

OKeefe, Matthew (Editor); Kerr, Christopher L. (Editor)

1998-01-01

This report contains the abstracts and technical papers from the Second International Workshop on Software Engineering and Code Design in Parallel Meteorological and Oceanographic Applications, held June 15-18, 1998, in Scottsdale, Arizona. The purpose of the workshop is to bring together software developers in meteorology and oceanography to discuss software engineering and code design issues for parallel architectures, including Massively Parallel Processors (MPP's), Parallel Vector Processors (PVP's), Symmetric Multi-Processors (SMP's), Distributed Shared Memory (DSM) multi-processors, and clusters. Issues to be discussed include: (1) code architectures for current parallel models, including basic data structures, storage allocation, variable naming conventions, coding rules and styles, i/o and pre/post-processing of data; (2) designing modular code; (3) load balancing and domain decomposition; (4) techniques that exploit parallelism efficiently yet hide the machine-related details from the programmer; (5) tools for making the programmer more productive; and (6) the proliferation of programming models (F--, OpenMP, MPI, and HPF).

Resource and Performance Evaluations of Fixed Point QRD-RLS Systolic Array through FPGA Implementation

NASA Astrophysics Data System (ADS)

Yokoyama, Yoshiaki; Kim, Minseok; Arai, Hiroyuki

At present, when using space-time processing techniques with multiple antennas for mobile radio communication, real-time weight adaptation is necessary. Due to the progress of integrated circuit technology, dedicated processor implementation with ASIC or FPGA can be employed to implement various wireless applications. This paper presents a resource and performance evaluation of the QRD-RLS systolic array processor based on fixed-point CORDIC algorithm with FPGA. In this paper, to save hardware resources, we propose the shared architecture of a complex CORDIC processor. The required precision of internal calculation, the circuit area for the number of antenna elements and wordlength, and the processing speed will be evaluated. The resource estimation provides a possible processor configuration with a current FPGA on the market. Computer simulations assuming a fading channel will show a fast convergence property with a finite number of training symbols. The proposed architecture has also been implemented and its operation was verified by beamforming evaluation through a radio propagation experiment.
A High-Throughput Processor for Flight Control Research Using Small UAVs

NASA Technical Reports Server (NTRS)

Klenke, Robert H.; Sleeman, W. C., IV; Motter, Mark A.

2006-01-01

There are numerous autopilot systems that are commercially available for small (<100 lbs) UAVs. However, they all share several key disadvantages for conducting aerodynamic research, chief amongst which is the fact that most utilize older, slower, 8- or 16-bit microcontroller technologies. This paper describes the development and testing of a flight control system (FCS) for small UAV s based on a modern, high throughput, embedded processor. In addition, this FCS platform contains user-configurable hardware resources in the form of a Field Programmable Gate Array (FPGA) that can be used to implement custom, application-specific hardware. This hardware can be used to off-load routine tasks such as sensor data collection, from the FCS processor thereby further increasing the computational throughput of the system.
Parallel algorithms for quantum chemistry. I. Integral transformations on a hypercube multiprocessor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Whiteside, R.A.; Binkley, J.S.; Colvin, M.E.

1987-02-15

For many years it has been recognized that fundamental physical constraints such as the speed of light will limit the ultimate speed of single processor computers to less than about three billion floating point operations per second (3 GFLOPS). This limitation is becoming increasingly restrictive as commercially available machines are now within an order of magnitude of this asymptotic limit. A natural way to avoid this limit is to harness together many processors to work on a single computational problem. In principle, these parallel processing computers have speeds limited only by the number of processors one chooses to acquire. Themore » usefulness of potentially unlimited processing speed to a computationally intensive field such as quantum chemistry is obvious. If these methods are to be applied to significantly larger chemical systems, parallel schemes will have to be employed. For this reason we have developed distributed-memory algorithms for a number of standard quantum chemical methods. We are currently implementing these on a 32 processor Intel hypercube. In this paper we present our algorithm and benchmark results for one of the bottleneck steps in quantum chemical calculations: the four index integral transformation.« less
A simple modern correctness condition for a space-based high-performance multiprocessor

NASA Technical Reports Server (NTRS)

Probst, David K.; Li, Hon F.

1992-01-01

A number of U.S. national programs, including space-based detection of ballistic missile launches, envisage putting significant computing power into space. Given sufficient progress in low-power VLSI, multichip-module packaging and liquid-cooling technologies, we will see design of high-performance multiprocessors for individual satellites. In very high speed implementations, performance depends critically on tolerating large latencies in interprocessor communication; without latency tolerance, performance is limited by the vastly differing time scales in processor and data-memory modules, including interconnect times. The modern approach to tolerating remote-communication cost in scalable, shared-memory multiprocessors is to use a multithreaded architecture, and alter the semantics of shared memory slightly, at the price of forcing the programmer either to reason about program correctness in a relaxed consistency model or to agree to program in a constrained style. The literature on multiprocessor correctness conditions has become increasingly complex, and sometimes confusing, which may hinder its practical application. We propose a simple modern correctness condition for a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and a high-performance, shared-memory multiprocessor; the correctness condition is based on a simple interface between the multiprocessor architecture and the parallel programming system.
76 FR 52147 - Fisheries of the Exclusive Economic Zone Off Alaska; Groundfish of the Gulf of Alaska; Amendment 88

Federal Register 2010, 2011, 2012, 2013, 2014

2011-08-19

... Pilot Program and the proposed Rockfish Program are a type of a limited access privilege program (LAPP... Central GOA fishermen, shoreside processors, catcher/processors, and communities by (1) providing greater... the ability to choose when to fish, (3) providing greater stability for processors by spreading...
75 FR 42337 - Fisheries of the Exclusive Economic Zone Off Alaska; Pacific Ocean Perch for Catcher/Processors...

Federal Register 2010, 2011, 2012, 2013, 2014

2010-07-21

.... 0910131362-0087-02] RIN 0648-XX71 Fisheries of the Exclusive Economic Zone Off Alaska; Pacific Ocean Perch... directed fishing for Pacific ocean perch by catcher/processors participating in the rockfish limited access... exceeding the 2010 total allowable catch (TAC) of Pacific ocean perch allocated to catcher/processors...
76 FR 43934 - Fisheries of the Exclusive Economic Zone Off Alaska; Pacific Ocean Perch for Catcher/Processors...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-07-22

.... 101126522-0640-02] RIN 0648-XA587 Fisheries of the Exclusive Economic Zone Off Alaska; Pacific Ocean Perch... directed fishing for Pacific ocean perch by catcher/processors participating in the rockfish limited access... exceeding the 2011 total allowable catch (TAC) of Pacific ocean perch allocated to catcher/processors...
A pervasive parallel framework for visualization: final report for FWP 10-014707

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.

2014-01-01

We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documentsmore » the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.« less
System Level RBDO for Military Ground Vehicles using High Performance Computing

DTIC Science & Technology

2008-01-01

platform. Only the analyses that required more than 24 processors were conducted on the Onyx 350 due to the limited number of processors on the...optimization constraints varied. The queues set the number of processors and number of finite element code licenses available to the analyses. sgi ONYX ...3900: unix 24 MIPS R16000 PROCESSORS 4 IR2 GRAPHICS PIPES 4 IR3 GRAPHICS PIPES 24 GBYTES MEMORY 36 GBYTES LOCAL DISK SPACE sgi ONYX 350: unix 32 MIPS
Asynchronous broadcast for ordered delivery between compute nodes in a parallel computing system where packet header space is limited

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kumar, Sameer

Disclosed is a mechanism on receiving processors in a parallel computing system for providing order to data packets received from a broadcast call and to distinguish data packets received at nodes from several incoming asynchronous broadcast messages where header space is limited. In the present invention, processors at lower leafs of a tree do not need to obtain a broadcast message by directly accessing the data in a root processor's buffer. Instead, each subsequent intermediate node's rank id information is squeezed into the software header of packet headers. In turn, the entire broadcast message is not transferred from the rootmore » processor to each processor in a communicator but instead is replicated on several intermediate nodes which then replicated the message to nodes in lower leafs. Hence, the intermediate compute nodes become "virtual root compute nodes" for the purpose of replicating the broadcast message to lower levels of a tree.« less
Unclassified Information Sharing and Coordination in Security, Stabilization, Transition and Reconstruction Efforts

DTIC Science & Technology

2008-03-01

is implemented using the Drupal (2007) content management system (CMS) and many of the baseline information sharing and collaboration tools have...been contributed through the Dru- pal open source community. Drupal is a very modular open source software written in PHP hypertext processor...needed to suit the particular problem domain. While other frameworks have the potential to provide similar advantages (“Ruby,” 2007), Drupal was
A cache-aided multiprocessor rollback recovery scheme

NASA Technical Reports Server (NTRS)

Wu, Kun-Lung; Fuchs, W. Kent

1989-01-01

This paper demonstrates how previous uniprocessor cache-aided recovery schemes can be applied to multiprocessor architectures, for recovering from transient processor failures, utilizing private caches and a global shared memory. As with cache-aided uniprocessor recovery, the multiprocessor cache-aided recovery scheme of this paper can be easily integrated into standard bus-based snoopy cache coherence protocols. A consistent shared memory state is maintained without the necessity of global check-pointing.
Bristol Ridge: A 28-nm $$\\times$$ 86 Performance-Enhanced Microprocessor Through System Power Management

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sundaram, Sriram; Grenat, Aaron; Naffziger, Samuel

Power management techniques can be effective at extracting more performance and energy efficiency out of mature systems on chip (SoCs). For instance, the peak performance of microprocessors is often limited by worst case technology (Vmax), infrastructure (thermal/electrical), and microprocessor usage assumptions. Performance/watt of microprocessors also typically suffers from guard bands associated with the test and binning processes as well as worst case aging/lifetime degradation. Similarly, on multicore processors, shared voltage rails tend to limit the peak performance achievable in low thread count workloads. In this paper, we describe five power management techniques that maximize the per-part performance under the before-mentionedmore » constraints. Using these techniques, we demonstrate a net performance increase of up to 15% depending on the application and TDP of the SoC, implemented on 'Bristol Ridge,' a 28-nm CMOS, dual-core x 86 accelerated processing unit.« less
Designing minimal space telerobotics systems for maximum performance

NASA Technical Reports Server (NTRS)

Backes, Paul G.; Long, Mark K.; Steele, Robert D.

1992-01-01

The design of the remote site of a local-remote telerobot control system is described which addresses the constraints of limited computational power available at the remote site control system while providing a large range of control capabilities. The Modular Telerobot Task Execution System (MOTES) provides supervised autonomous control, shared control and teleoperation for a redundant manipulator. The system is capable of nominal task execution as well as monitoring and reflex motion. The MOTES system is minimized while providing a large capability by limiting its functionality to only that which is necessary at the remote site and by utilizing a unified multi-sensor based impedance control scheme. A command interpreter similar to one used on robotic spacecraft is used to interpret commands received from the local site. The system is written in Ada and runs in a VME environment on 68020 processors and initially controls a Robotics Research K1207 7 degree of freedom manipulator.
Initial Performance Results on IBM POWER6

NASA Technical Reports Server (NTRS)

Saini, Subbash; Talcott, Dale; Jespersen, Dennis; Djomehri, Jahed; Jin, Haoqiang; Mehrotra, Piysuh

2008-01-01

The POWER5+ processor has a faster memory bus than that of the previous generation POWER5 processor (533 MHz vs. 400 MHz), but the measured per-core memory bandwidth of the latter is better than that of the former (5.7 GB/s vs. 4.3 GB/s). The reason for this is that in the POWER5+, the two cores on the chip share the L2 cache, L3 cache and memory bus. The memory controller is also on the chip and is shared by the two cores. This serializes the path to memory. For consistently good performance on a wide range of applications, the performance of the processor, the memory subsystem, and the interconnects (both latency and bandwidth) should be balanced. Recognizing this, IBM has designed the Power6 processor so as to avoid the bottlenecks due to the L2 cache, memory controller and buffer chips of the POWER5+. Unlike the POWER5+, each core in the POWER6 has its own L2 cache (4 MB - double that of the Power5+), memory controller and buffer chips. Each core in the POWER6 runs at 4.7 GHz instead of 1.9 GHz in POWER5+. In this paper, we evaluate the performance of a dual-core Power6 based IBM p6-570 system, and we compare its performance with that of a dual-core Power5+ based IBM p575+ system. In this evaluation, we have used the High- Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four real-world applications--three from computational fluid dynamics and one from climate modeling.
Shared versus distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Jordan, Harry F.

1991-01-01

The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors.
Multi-GPU and multi-CPU accelerated FDTD scheme for vibroacoustic applications

NASA Astrophysics Data System (ADS)

Francés, J.; Otero, B.; Bleda, S.; Gallego, S.; Neipp, C.; Márquez, A.; Beléndez, A.

2015-06-01

The Finite-Difference Time-Domain (FDTD) method is applied to the analysis of vibroacoustic problems and to study the propagation of longitudinal and transversal waves in a stratified media. The potential of the scheme and the relevance of each acceleration strategy for massively computations in FDTD are demonstrated in this work. In this paper, we propose two new specific implementations of the bi-dimensional scheme of the FDTD method using multi-CPU and multi-GPU, respectively. In the first implementation, an open source message passing interface (OMPI) has been included in order to massively exploit the resources of a biprocessor station with two Intel Xeon processors. Moreover, regarding CPU code version, the streaming SIMD extensions (SSE) and also the advanced vectorial extensions (AVX) have been included with shared memory approaches that take advantage of the multi-core platforms. On the other hand, the second implementation called the multi-GPU code version is based on Peer-to-Peer communications available in CUDA on two GPUs (NVIDIA GTX 670). Subsequently, this paper presents an accurate analysis of the influence of the different code versions including shared memory approaches, vector instructions and multi-processors (both CPU and GPU) and compares them in order to delimit the degree of improvement of using distributed solutions based on multi-CPU and multi-GPU. The performance of both approaches was analysed and it has been demonstrated that the addition of shared memory schemes to CPU computing improves substantially the performance of vector instructions enlarging the simulation sizes that use efficiently the cache memory of CPUs. In this case GPU computing is slightly twice times faster than the fine tuned CPU version in both cases one and two nodes. However, for massively computations explicit vector instructions do not worth it since the memory bandwidth is the limiting factor and the performance tends to be the same than the sequential version with auto-vectorisation and also shared memory approach. In this scenario GPU computing is the best option since it provides a homogeneous behaviour. More specifically, the speedup of GPU computing achieves an upper limit of 12 for both one and two GPUs, whereas the performance reaches peak values of 80 GFlops and 146 GFlops for the performance for one GPU and two GPUs respectively. Finally, the method is applied to an earth crust profile in order to demonstrate the potential of our approach and the necessity of applying acceleration strategies in these type of applications.
Scalability of a Low-Cost Multi-Teraflop Linux Cluster for High-End Classical Atomistic and Quantum Mechanical Simulations

NASA Technical Reports Server (NTRS)

Kikuchi, Hideaki; Kalia, Rajiv K.; Nakano, Aiichiro; Vashishta, Priya; Shimojo, Fuyuki; Saini, Subhash

2003-01-01

Scalability of a low-cost, Intel Xeon-based, multi-Teraflop Linux cluster is tested for two high-end scientific applications: Classical atomistic simulation based on the molecular dynamics method and quantum mechanical calculation based on the density functional theory. These scalable parallel applications use space-time multiresolution algorithms and feature computational-space decomposition, wavelet-based adaptive load balancing, and spacefilling-curve-based data compression for scalable I/O. Comparative performance tests are performed on a 1,024-processor Linux cluster and a conventional higher-end parallel supercomputer, 1,184-processor IBM SP4. The results show that the performance of the Linux cluster is comparable to that of the SP4. We also study various effects, such as the sharing of memory and L2 cache among processors, on the performance.
Practical use of a word processor in a histopathology laboratory.

PubMed Central

Briggs, J C; Ibrahim, N B; Mackintosh, I; Norris, D

1982-01-01

Some of the facilities available with a commercially purchased word processing program, linked to a DEC PDP 11/23 computer are described, together with an account of the practical histopathological use. The system is based on a share of the computer with a Clinical Chemistry Department. Development was time-consuming and required the constant availability of the Department of Physics. However, once working, considerable saving in secretarial time has resulted and a number of projects have been started which would not have been contemplated without the use of the word processor and its linked computer. Images PMID:7068906
Reconfigurable tree architectures using subtree oriented fault tolerance

NASA Technical Reports Server (NTRS)

Lowrie, Matthew B.

1987-01-01

An approach to the design of reconfigurable tree architecture is presented in which spare processors are allocated at the leaves. The approach is unique in that spares are associated with subtrees and sharing of spares between these subtrees can occur. The Subtree Oriented Fault Tolerance (SOFT) approach is more reliable than previous approaches capable of tolerating link and switch failures for both single chip and multichip tree implementations while reducing redundancy in terms of both spare processors and links. VLSI layout is 0(n) for binary trees and is directly extensible to N-ary trees and fault tolerance through performance degradation.

Parallel Gaussian elimination of a block tridiagonal matrix using multiple microcomputers

NASA Technical Reports Server (NTRS)

Blech, Richard A.

1989-01-01

The solution of a block tridiagonal matrix using parallel processing is demonstrated. The multiprocessor system on which results were obtained and the software environment used to program that system are described. Theoretical partitioning and resource allocation for the Gaussian elimination method used to solve the matrix are discussed. The results obtained from running 1, 2 and 3 processor versions of the block tridiagonal solver are presented. The PASCAL source code for these solvers is given in the appendix, and may be transportable to other shared memory parallel processors provided that the synchronization outlines are reproduced on the target system.
Processor Would Find Best Paths On Map

NASA Technical Reports Server (NTRS)

Eberhardt, Silvio P.

1990-01-01

Proposed very-large-scale integrated (VLSI) circuit image-data processor finds path of least cost from specified origin to any destination on map. Cost of traversal assigned to each picture element of map. Path of least cost from originating picture element to every other picture element computed as path that preserves as much as possible of signal transmitted by originating picture element. Dedicated microprocessor at each picture element stores cost of traversal and performs its share of computations of paths of least cost. Least-cost-path problem occurs in research, military maneuvers, and in planning routes of vehicles.
Parallelizing ATLAS Reconstruction and Simulation: Issues and Optimization Solutions for Scaling on Multi- and Many-CPU Platforms

NASA Astrophysics Data System (ADS)

Leggett, C.; Binet, S.; Jackson, K.; Levinthal, D.; Tatarkhanov, M.; Yao, Y.

2011-12-01

Thermal limitations have forced CPU manufacturers to shift from simply increasing clock speeds to improve processor performance, to producing chip designs with multi- and many-core architectures. Further the cores themselves can run multiple threads as a zero overhead context switch allowing low level resource sharing (Intel Hyperthreading). To maximize bandwidth and minimize memory latency, memory access has become non uniform (NUMA). As manufacturers add more cores to each chip, a careful understanding of the underlying architecture is required in order to fully utilize the available resources. We present AthenaMP and the Atlas event loop manager, the driver of the simulation and reconstruction engines, which have been rewritten to make use of multiple cores, by means of event based parallelism, and final stage I/O synchronization. However, initial studies on 8 andl6 core Intel architectures have shown marked non-linearities as parallel process counts increase, with as much as 30% reductions in event throughput in some scenarios. Since the Intel Nehalem architecture (both Gainestown and Westmere) will be the most common choice for the next round of hardware procurements, an understanding of these scaling issues is essential. Using hardware based event counters and Intel's Performance Tuning Utility, we have studied the performance bottlenecks at the hardware level, and discovered optimization schemes to maximize processor throughput. We have also produced optimization mechanisms, common to all large experiments, that address the extreme nature of today's HEP code, which due to it's size, places huge burdens on the memory infrastructure of today's processors.
Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis

NASA Technical Reports Server (NTRS)

Dimpsey, Robert Tod

1992-01-01

In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.
High order parallel numerical schemes for solving incompressible flows

NASA Technical Reports Server (NTRS)

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-01-01

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.
Efficient Sorting on the Tilera Manycore Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Morari, Alessandro; Tumeo, Antonino; Villa, Oreste

e present an efficient implementation of the radix sort algo- rithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is com- posed of 64 tiles interconnected through multiple fast Networks- on-chip and features a fully coherent, shared distributed cache. The architecture has a large degree of flexibility, and allows various optimization strategies. We describe how we mapped the algorithm to this architecture. We present an in-depth analysis of the optimizations for each phase of the algorithm with respect to the processor’s sustained performance. We discuss the overall throughput reached by ourmore » radix sort implementation (up to 132 MK/s) and show that it provides comparable or better performance-per-watt with respect to state-of-the art implemen- tations on x86 processors and graphic processing units.« less
Three-Dimensional High-Lift Analysis Using a Parallel Unstructured Multigrid Solver

NASA Technical Reports Server (NTRS)

Mavriplis, Dimitri J.

1998-01-01

A directional implicit unstructured agglomeration multigrid solver is ported to shared and distributed memory massively parallel machines using the explicit domain-decomposition and message-passing approach. Because the algorithm operates on local implicit lines in the unstructured mesh, special care is required in partitioning the problem for parallel computing. A weighted partitioning strategy is described which avoids breaking the implicit lines across processor boundaries, while incurring minimal additional communication overhead. Good scalability is demonstrated on a 128 processor SGI Origin 2000 machine and on a 512 processor CRAY T3E machine for reasonably fine grids. The feasibility of performing large-scale unstructured grid calculations with the parallel multigrid algorithm is demonstrated by computing the flow over a partial-span flap wing high-lift geometry on a highly resolved grid of 13.5 million points in approximately 4 hours of wall clock time on the CRAY T3E.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems

DTIC Science & Technology

2015-05-01

of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9...presented a cache-partitioning scheme that allows multiple tasks to share the same cache partition on a single processor (as we do for Level-A and...sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9 machine mentioned earlier, the LLC
Binarized cross-approximate entropy in crowdsensing environment.

PubMed

Skoric, Tamara; Mohamoud, Omer; Milovanovic, Branislav; Japundzic-Zigon, Nina; Bajic, Dragana

2017-01-01

Personalised monitoring in health applications has been recognised as part of the mobile crowdsensing concept, where subjects equipped with sensors extract information and share them for personal or common benefit. Limited transmission resources impose the use of local analyses methodology, but this approach is incompatible with analytical tools that require stationary and artefact-free data. This paper proposes a computationally efficient binarised cross-approximate entropy, referred to as (X)BinEn, for unsupervised cardiovascular signal processing in environments where energy and processor resources are limited. The proposed method is a descendant of the cross-approximate entropy ((X)ApEn). It operates on binary, differentially encoded data series split into m-sized vectors. The Hamming distance is used as a distance measure, while a search for similarities is performed on the vector sets. The procedure is tested on rats under shaker and restraint stress, and compared to the existing (X)ApEn results. The number of processing operations is reduced. (X)BinEn captures entropy changes in a similar manner to (X)ApEn. The coding coarseness yields an adverse effect of reduced sensitivity, but it attenuates parameter inconsistency and binary bias. A special case of (X)BinEn is equivalent to Shannon's entropy. A binary conditional entropy for m =1 vectors is embedded into the (X)BinEn procedure. (X)BinEn can be applied to a single time series as an auto-entropy method, or to a pair of time series, as a cross-entropy method. Its low processing requirements makes it suitable for mobile, battery operated, self-attached sensing devices, with limited power and processor resources. Copyright © 2016 Elsevier Ltd. All rights reserved.
Comparison between Frame-Constrained Fix-Pixel-Value and Frame-Free Spiking-Dynamic-Pixel ConvNets for Visual Processing

PubMed Central

Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; LeCun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe

2012-01-01

Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons. PMID:22518097
Comparison between Frame-Constrained Fix-Pixel-Value and Frame-Free Spiking-Dynamic-Pixel ConvNets for Visual Processing.

PubMed

Farabet, Clément; Paz, Rafael; Pérez-Carrasco, Jose; Zamarreño-Ramos, Carlos; Linares-Barranco, Alejandro; Lecun, Yann; Culurciello, Eugenio; Serrano-Gotarredona, Teresa; Linares-Barranco, Bernabe

2012-01-01

Most scene segmentation and categorization architectures for the extraction of features in images and patches make exhaustive use of 2D convolution operations for template matching, template search, and denoising. Convolutional Neural Networks (ConvNets) are one example of such architectures that can implement general-purpose bio-inspired vision systems. In standard digital computers 2D convolutions are usually expensive in terms of resource consumption and impose severe limitations for efficient real-time applications. Nevertheless, neuro-cortex inspired solutions, like dedicated Frame-Based or Frame-Free Spiking ConvNet Convolution Processors, are advancing real-time visual processing. These two approaches share the neural inspiration, but each of them solves the problem in different ways. Frame-Based ConvNets process frame by frame video information in a very robust and fast way that requires to use and share the available hardware resources (such as: multipliers, adders). Hardware resources are fixed- and time-multiplexed by fetching data in and out. Thus memory bandwidth and size is important for good performance. On the other hand, spike-based convolution processors are a frame-free alternative that is able to perform convolution of a spike-based source of visual information with very low latency, which makes ideal for very high-speed applications. However, hardware resources need to be available all the time and cannot be time-multiplexed. Thus, hardware should be modular, reconfigurable, and expansible. Hardware implementations in both VLSI custom integrated circuits (digital and analog) and FPGA have been already used to demonstrate the performance of these systems. In this paper we present a comparison study of these two neuro-inspired solutions. A brief description of both systems is presented and also discussions about their differences, pros and cons.
A Methodology for Distributing the Corporate Database.

ERIC Educational Resources Information Center

McFadden, Fred R.

The trend to distributed processing is being fueled by numerous forces, including advances in technology, corporate downsizing, increasing user sophistication, and acquisitions and mergers. Increasingly, the trend in corporate information systems (IS) departments is toward sharing resources over a network of multiple types of processors, operating…
The Tera Multithreaded Architecture and Unstructured Meshes

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.; Mavriplis, Dimitri J.

1998-01-01

The Tera Multithreaded Architecture (MTA) is a new parallel supercomputer currently being installed at San Diego Supercomputing Center (SDSC). This machine has an architecture quite different from contemporary parallel machines. The computational processor is a custom design and the machine uses hardware to support very fine grained multithreading. The main memory is shared, hardware randomized and flat. These features make the machine highly suited to the execution of unstructured mesh problems, which are difficult to parallelize on other architectures. We report the results of a study carried out during July-August 1998 to evaluate the execution of EUL3D, a code that solves the Euler equations on an unstructured mesh, on the 2 processor Tera MTA at SDSC. Our investigation shows that parallelization of an unstructured code is extremely easy on the Tera. We were able to get an existing parallel code (designed for a shared memory machine), running on the Tera by changing only the compiler directives. Furthermore, a serial version of this code was compiled to run in parallel on the Tera by judicious use of directives to invoke the "full/empty" tag bits of the machine to obtain synchronization. This version achieves 212 and 406 Mflop/s on one and two processors respectively, and requires no attention to partitioning or placement of data issues that would be of paramount importance in other parallel architectures.
Rational calculation accuracy in acousto-optical matrix-vector processor

NASA Astrophysics Data System (ADS)

Oparin, V. V.; Tigin, Dmitry V.

1994-01-01

The high speed of parallel computations for a comparatively small-size processor and acceptable power consumption makes the usage of acousto-optic matrix-vector multiplier (AOMVM) attractive for processing of large amounts of information in real time. The limited accuracy of computations is an essential disadvantage of such a processor. The reduced accuracy requirements allow for considerable simplification of the AOMVM architecture and the reduction of the demands on its components.
Considerations for Future Climate Data Stewardship

NASA Astrophysics Data System (ADS)

Halem, M.; Nguyen, P. T.; Chapman, D. R.

2009-12-01

In this talk, we will describe the lessons learned based on processing and generating a decade of gridded AIRS and MODIS IR sounding data. We describe the challenges faced in accessing and sharing very large data sets, maintaining data provenance under evolving technologies, obtaining access to legacy calibration data and the permanent preservation of Earth science data records for on demand services. These lessons suggest a new approach to data stewardship will be required for the next decade of hyper spectral instruments combined with cloud resolving models. It will not be sufficient for stewards of future data centers to just provide the public with access to archived data but our experience indicates that data needs to reside close to computers with ultra large disc farms and tens of thousands of processors to deliver complex services on demand over very high speed networks much like the offerings of search engines today. Over the first decade of the 21st century, petabyte data records were acquired from the AIRS instrument on Aqua and the MODIS instrument on Aqua and Terra. NOAA data centers also maintain petabytes of operational IR sounders collected over the past four decades. The UMBC Multicore Computational Center (MC2) developed a Service Oriented Atmospheric Radiance gridding system (SOAR) to allow users to select IR sounding instruments from multiple archives and choose space-time- spectral periods of Level 1B data to download, grid, visualize and analyze on demand. Providing this service requires high data rate bandwidth access to the on line disks at Goddard. After 10 years, cost effective disk storage technology finally caught up with the MODIS data volume making it possible for Level 1B MODIS data to be available on line. However, 10Ge fiber optic networks to access large volumes of data are still not available from CSFC to serve the broader community. Data transfer rates are well below 10MB/s limiting their usefulness for climate studies. During this decade, processor performance hit a power wall leading computer vendors to design multicore processor chips. High performance computer systems obtained petaflop performance by clustering tens of thousands of multicore processor chips. Thus, power consumption and autonomic recovery from processor and disc failures have become major cost and technical considerations for future data archives. To address these new architecture requirements, a transparent parallel programming paradigm, the Hadoop MapReduce cloud computing system, became available as an open S/W system. In addition, the Hadoop File System and manages the distribution of data to these processors as well as backs up the processing in the event of any processor or disc failure. However, to employ this paradigm, the data needs to be stored on the computer system. We conclude this talk with a climate data preservation approach that addresses the scalability crisis to exabyte data requirements for the next decade based on projections of processor, disc data density and bandwidth doubling rates.
An efficient 3-dim FFT for plane wave electronic structure calculations on massively parallel machines composed of multiprocessor nodes

NASA Astrophysics Data System (ADS)

Goedecker, Stefan; Boulet, Mireille; Deutsch, Thierry

2003-08-01

Three-dimensional Fast Fourier Transforms (FFTs) are the main computational task in plane wave electronic structure calculations. Obtaining a high performance on a large numbers of processors is non-trivial on the latest generation of parallel computers that consist of nodes made up of a shared memory multiprocessors. A non-dogmatic method for obtaining high performance for such 3-dim FFTs in a combined MPI/OpenMP programming paradigm will be presented. Exploiting the peculiarities of plane wave electronic structure calculations, speedups of up to 160 and speeds of up to 130 Gflops were obtained on 256 processors.
An MPA-IO interface to HPSS

NASA Technical Reports Server (NTRS)

Jones, Terry; Mark, Richard; Martin, Jeanne; May, John; Pierce, Elsie; Stanberry, Linda

1996-01-01

This paper describes an implementation of the proposed MPI-IO (Message Passing Interface - Input/Output) standard for parallel I/O. Our system uses third-party transfer to move data over an external network between the processors where it is used and the I/O devices where it resides. Data travels directly from source to destination, without the need for shuffling it among processors or funneling it through a central node. Our distributed server model lets multiple compute nodes share the burden of coordinating data transfers. The system is built on the High Performance Storage System (HPSS), and a prototype version runs on a Meiko CS-2 parallel computer.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-01-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Memory access in shared virtual memory

DOE Office of Scientific and Technical Information (OSTI.GOV)

Berrendorf, R.

1992-09-01

Shared virtual memory (SVM) is a virtual memory layer with a single address space on top of a distributed real memory on parallel computers. We examine the behavior and performance of SVM running a parallel program with medium-grained, loop-level parallelism on top of it. A simulator for the underlying parallel architecture can be used to examine the behavior of SVM more deeply. The influence of several parameters, such as the number of processors, page size, cold or warm start, and restricted page replication, is studied.
Compiler-directed cache management in multiprocessors

NASA Technical Reports Server (NTRS)

Cheong, Hoichi; Veidenbaum, Alexander V.

1990-01-01

The necessity of finding alternatives to hardware-based cache coherence strategies for large-scale multiprocessor systems is discussed. Three different software-based strategies sharing the same goals and general approach are presented. They consist of a simple invalidation approach, a fast selective invalidation scheme, and a version control scheme. The strategies are suitable for shared-memory multiprocessor systems with interconnection networks and a large number of processors. Results of trace-driven simulations conducted on numerical benchmark routines to compare the performance of the three schemes are presented.

A Stream Tilling Approach to Surface Area Estimation for Large Scale Spatial Data in a Shared Memory System

NASA Astrophysics Data System (ADS)

Liu, Jiping; Kang, Xiaochen; Dong, Chun; Xu, Shenghua

2017-12-01

Surface area estimation is a widely used tool for resource evaluation in the physical world. When processing large scale spatial data, the input/output (I/O) can easily become the bottleneck in parallelizing the algorithm due to the limited physical memory resources and the very slow disk transfer rate. In this paper, we proposed a stream tilling approach to surface area estimation that first decomposed a spatial data set into tiles with topological expansions. With these tiles, the one-to-one mapping relationship between the input and the computing process was broken. Then, we realized a streaming framework towards the scheduling of the I/O processes and computing units. Herein, each computing unit encapsulated a same copy of the estimation algorithm, and multiple asynchronous computing units could work individually in parallel. Finally, the performed experiment demonstrated that our stream tilling estimation can efficiently alleviate the heavy pressures from the I/O-bound work, and the measured speedup after being optimized have greatly outperformed the directly parallel versions in shared memory systems with multi-core processors.
Dynamically programmable cache

NASA Astrophysics Data System (ADS)

Nakkar, Mouna; Harding, John A.; Schwartz, David A.; Franzon, Paul D.; Conte, Thomas

1998-10-01

Reconfigurable machines have recently been used as co- processors to accelerate the execution of certain algorithms or program subroutines. The problems with the above approach include high reconfiguration time and limited partial reconfiguration. By far the most critical problems are: (1) the small on-chip memory which results in slower execution time, and (2) small FPGA areas that cannot implement large subroutines. Dynamically Programmable Cache (DPC) is a novel architecture for embedded processors which offers solutions to the above problems. To solve memory access problems, DPC processors merge reconfigurable arrays with the data cache at various cache levels to create a multi-level reconfigurable machines. As a result DPC machines have both higher data accessibility and FPGA memory bandwidth. To solve the limited FPGA resource problem, DPC processors implemented multi-context switching (Virtualization) concept. Virtualization allows implementation of large subroutines with fewer FPGA cells. Additionally, DPC processors can parallelize the execution of several operations resulting in faster execution time. In this paper, the speedup improvement for DPC machines are shown to be 5X faster than an Altera FLEX10K FPGA chip and 2X faster than a Sun Ultral SPARC station for two different algorithms (convolution and motion estimation).
Geospace simulations on the Cell BE processor

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D.

2008-12-01

OpenGGCM (Open Geospace General circulation Model) is an established numerical code that simulates the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is limited by computational constraints on grid resolution. We investigate porting of the MHD solver to the Cell BE architecture, a novel inhomogeneous multicore architecture capable of up to 230 GFlops per processor. Realizing this high performance on the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallel approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the vector/SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We obtained excellent performance numbers, a speed-up of a factor of 25 compared to just using the main processor, while still keeping the numerical implementation details of the code maintainable.
Ethernet-Enabled Power and Communication Module for Embedded Processors

NASA Technical Reports Server (NTRS)

Perotti, Jose; Oostdyk, Rebecca

2010-01-01

The power and communications module is a printed circuit board (PCB) that has the capability of providing power to an embedded processor and converting Ethernet packets into serial data to transfer to the processor. The purpose of the new design is to address the shortcomings of previous designs, including limited bandwidth and program memory, lack of control over packet processing, and lack of support for timing synchronization. The new design of the module creates a robust serial-to-Ethernet conversion that is powered using the existing Ethernet cable. This innovation has a small form factor that allows it to power processors and transducers with minimal space requirements.
Time-partitioning simulation models for calculation on parallel computers

NASA Technical Reports Server (NTRS)

Milner, Edward J.; Blech, Richard A.; Chima, Rodrick V.

1987-01-01

A technique allowing time-staggered solution of partial differential equations is presented in this report. Using this technique, called time-partitioning, simulation execution speedup is proportional to the number of processors used because all processors operate simultaneously, with each updating of the solution grid at a different time point. The technique is limited by neither the number of processors available nor by the dimension of the solution grid. Time-partitioning was used to obtain the flow pattern through a cascade of airfoils, modeled by the Euler partial differential equations. An execution speedup factor of 1.77 was achieved using a two processor Cray X-MP/24 computer.
The Automatic Parallelisation of Scientific Application Codes Using a Computer Aided Parallelisation Toolkit

NASA Technical Reports Server (NTRS)

Ierotheou, C.; Johnson, S.; Leggett, P.; Cross, M.; Evans, E.; Jin, Hao-Qiang; Frumkin, M.; Yan, J.; Biegel, Bryan (Technical Monitor)

2001-01-01

The shared-memory programming model is a very effective way to achieve parallelism on shared memory parallel computers. Historically, the lack of a programming standard for using directives and the rather limited performance due to scalability have affected the take-up of this programming model approach. Significant progress has been made in hardware and software technologies, as a result the performance of parallel programs with compiler directives has also made improvements. The introduction of an industrial standard for shared-memory programming with directives, OpenMP, has also addressed the issue of portability. In this study, we have extended the computer aided parallelization toolkit (developed at the University of Greenwich), to automatically generate OpenMP based parallel programs with nominal user assistance. We outline the way in which loop types are categorized and how efficient OpenMP directives can be defined and placed using the in-depth interprocedural analysis that is carried out by the toolkit. We also discuss the application of the toolkit on the NAS Parallel Benchmarks and a number of real-world application codes. This work not only demonstrates the great potential of using the toolkit to quickly parallelize serial programs but also the good performance achievable on up to 300 processors for hybrid message passing and directive-based parallelizations.
A Course on Reconfigurable Processors

ERIC Educational Resources Information Center

Shoufan, Abdulhadi; Huss, Sorin A.

2010-01-01

Reconfigurable computing is an established field in computer science. Teaching this field to computer science students demands special attention due to limited student experience in electronics and digital system design. This article presents a compact course on reconfigurable processors, which was offered at the Technische Universitat Darmstadt,…
Neuromorphic Computing: A Post-Moore's Law Complementary Architecture

DOE Office of Scientific and Technical Information (OSTI.GOV)

Schuman, Catherine D; Birdwell, John Douglas; Dean, Mark

2016-01-01

We describe our approach to post-Moore's law computing with three neuromorphic computing models that share a RISC philosophy, featuring simple components combined with a flexible and programmable structure. We envision these to be leveraged as co-processors, or as data filters to provide in situ data analysis in supercomputing environments.
Expert Systems on Multiprocessor Architectures. Volume 2. Technical Reports

DTIC Science & Technology

1991-06-01

Report RC 12936 (#58037). IBM T. J. Wartson Reiearch Center. July 1987. � Alan Jay Smith. Cache memories. Coniputing Sitrry., 1.1(3): I.3-5:30...basic-shared is an instrument for ashared memory design. The components panels are processor- qload-scrolling-bar-panel, memory-qload-scrolling-bar-panel
Copyright in the Age of Photocopiers, Word Processors, and the Internet

ERIC Educational Resources Information Center

Shaw, Marjorie Hodges; Shaw, Brian B.

2003-01-01

Widespread digital infringement of the copyrighted material now has made security firms, night-vision goggles, and metal detectors common in movie previews. The current national controversy over peer-to-peer file sharing of music highlights the difficult questions facing colleges and universities as they grapple with dramatic technological…
Memory Network For Distributed Data Processors

NASA Technical Reports Server (NTRS)

Bolen, David; Jensen, Dean; Millard, ED; Robinson, Dave; Scanlon, George

1992-01-01

Universal Memory Network (UMN) is modular, digital data-communication system enabling computers with differing bus architectures to share 32-bit-wide data between locations up to 3 km apart with less than one millisecond of latency. Makes it possible to design sophisticated real-time and near-real-time data-processing systems without data-transfer "bottlenecks". This enterprise network permits transmission of volume of data equivalent to an encyclopedia each second. Facilities benefiting from Universal Memory Network include telemetry stations, simulation facilities, power-plants, and large laboratories or any facility sharing very large volumes of data. Main hub of UMN is reflection center including smaller hubs called Shared Memory Interfaces.
Comparison of the CENTRM resonance processor to the NITAWL resonance processor in SCALE

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hollenbach, D.F.; Petrie, L.M.

1998-01-01

This report compares the MTAWL and CENTRM resonance processors in the SCALE code system. The cases examined consist of the International OECD/NEA Criticality Working Group Benchmark 20 problem. These cases represent fuel pellets partially dissolved in a borated solution. The assumptions inherent to the Nordheim Integral Treatment, used in MTAWL, are not valid for these problems. CENTRM resolves this limitation by explicitly calculating a problem dependent point flux from point cross sections, which is then used to create group cross sections.
Noise limitations in optical linear algebra processors.

PubMed

Batsell, S G; Jong, T L; Walkup, J F; Krile, T F

1990-05-10

A general statistical noise model is presented for optical linear algebra processors. A statistical analysis which includes device noise, the multiplication process, and the addition operation is undertaken. We focus on those processes which are architecturally independent. Finally, experimental results which verify the analytical predictions are also presented.
The Jet Propulsion Laboratory shared control architecture and implementation

NASA Technical Reports Server (NTRS)

Backes, Paul G.; Hayati, Samad

1990-01-01

A hardware and software environment for shared control of telerobot task execution has been implemented. Modes of task execution range from fully teleoperated to fully autonomous as well as shared where hand controller inputs from the human operator are mixed with autonomous system inputs in real time. The objective of the shared control environment is to aid the telerobot operator during task execution by merging real-time operator control from hand controllers with autonomous control to simplify task execution for the operator. The operator is the principal command source and can assign as much autonomy for a task as desired. The shared control hardware environment consists of two PUMA 560 robots, two 6-axis force reflecting hand controllers, Universal Motor Controllers for each of the robots and hand controllers, a SUN4 computer, and VME chassis containing 68020 processors and input/output boards. The operator interface for shared control, the User Macro Interface (UMI), is a menu driven interface to design a task and assign the levels of teleoperated and autonomous control. The operator also sets up the system monitor which checks safety limits during task execution. Cartesian-space degrees of freedom for teleoperated and/or autonomous control inputs are selected within UMI as well as the weightings for the teleoperation and autonmous inputs. These are then used during task execution to determine the mix of teleoperation and autonomous inputs. Some of the autonomous control primitives available to the user are Joint-Guarded-Move, Cartesian-Guarded-Move, Move-To-Touch, Pin-Insertion/Removal, Door/Crank-Turn, Bolt-Turn, and Slide. The operator can execute a task using pure teleoperation or mix control execution from the autonomous primitives with teleoperated inputs. Presently the shared control environment supports single arm task execution. Work is presently underway to provide the shared control environment for dual arm control. Teleoperation during shared control is only Cartesian space control and no force-reflection is provided. Force-reflecting teleoperation and joint space operator inputs are planned extensions to the environment.
40 CFR 747.115 - Mixed mono and diamides of an organic acid.

Code of Federal Regulations, 2010 CFR

2010-07-01

... warning statement shall be no smaller than six point type. All required label text shall be of sufficient..., commerce, importer, impurity, Inventory, manufacturer, person, process, processor, and small quantities... control of the processor. (ii) Distribution in commerce is limited to purposes of export. (iii) The...
Arranging computer architectures to create higher-performance controllers

NASA Technical Reports Server (NTRS)

Jacklin, Stephen A.

1988-01-01

Techniques for integrating microprocessors, array processors, and other intelligent devices in control systems are reviewed, with an emphasis on the (re)arrangement of components to form distributed or parallel processing systems. Consideration is given to the selection of the host microprocessor, increasing the power and/or memory capacity of the host, multitasking software for the host, array processors to reduce computation time, the allocation of real-time and non-real-time events to different computer subsystems, intelligent devices to share the computational burden for real-time events, and intelligent interfaces to increase communication speeds. The case of a helicopter vibration-suppression and stabilization controller is analyzed as an example, and significant improvements in computation and throughput rates are demonstrated.
Multi-processor including data flow accelerator module

DOEpatents

Davidson, George S.; Pierce, Paul E.

1990-01-01

An accelerator module for a data flow computer includes an intelligent memory. The module is added to a multiprocessor arrangement and uses a shared tagged memory architecture in the data flow computer. The intelligent memory module assigns locations for holding data values in correspondence with arcs leading to a node in a data dependency graph. Each primitive computation is associated with a corresponding memory cell, including a number of slots for operands needed to execute a primitive computation, a primitive identifying pointer, and linking slots for distributing the result of the cell computation to other cells requiring that result as an operand. Circuitry is provided for utilizing tag bits to determine automatically when all operands required by a processor are available and for scheduling the primitive for execution in a queue. Each memory cell of the module may be associated with any of the primitives, and the particular primitive to be executed by the processor associated with the cell is identified by providing an index, such as the cell number for the primitive, to the primitive lookup table of starting addresses. The module thus serves to perform functions previously performed by a number of sections of data flow architectures and coexists with conventional shared memory therein. A multiprocessing system including the module operates in a hybrid mode, wherein the same processing modules are used to perform some processing in a sequential mode, under immediate control of an operating system, while performing other processing in a data flow mode.
Data traffic reduction schemes for Cholesky factorization on asynchronous multiprocessor systems

NASA Technical Reports Server (NTRS)

Naik, Vijay K.; Patrick, Merrell L.

1989-01-01

Communication requirements of Cholesky factorization of dense and sparse symmetric, positive definite matrices are analyzed. The communication requirement is characterized by the data traffic generated on multiprocessor systems with local and shared memory. Lower bound proofs are given to show that when the load is uniformly distributed the data traffic associated with factoring an n x n dense matrix using n to the alpha power (alpha less than or equal 2) processors is omega(n to the 2 + alpha/2 power). For n x n sparse matrices representing a square root of n x square root of n regular grid graph the data traffic is shown to be omega(n to the 1 + alpha/2 power), alpha less than or equal 1. Partitioning schemes that are variations of block assignment scheme are described and it is shown that the data traffic generated by these schemes are asymptotically optimal. The schemes allow efficient use of up to O(n to the 2nd power) processors in the dense case and up to O(n) processors in the sparse case before the total data traffic reaches the maximum value of O(n to the 3rd power) and O(n to the 3/2 power), respectively. It is shown that the block based partitioning schemes allow a better utilization of the data accessed from shared memory and thus reduce the data traffic than those based on column-wise wrap around assignment schemes.
Design and implementation of a medium speed communications interface and protocol for a low cost, refreshed display computer

NASA Technical Reports Server (NTRS)

Phyne, J. R.; Nelson, M. D.

1975-01-01

The design and implementation of hardware and software systems involved in using a 40,000 bit/second communication line as the connecting link between an IMLAC PDS 1-D display computer and a Univac 1108 computer system were described. The IMLAC consists of two independent processors sharing a common memory. The display processor generates the deflection and beam control currents as it interprets a program contained in the memory; the minicomputer has a general instruction set and is responsible for starting and stopping the display processor and for communicating with the outside world through the keyboard, teletype, light pen, and communication line. The processing time associated with each data byte was minimized by designing the input and output processes as finite state machines which automatically sequence from each state to the next. Several tests of the communication link and the IMLAC software were made using a special low capacity computer grade cable between the IMLAC and the Univac.
An efficient parallel-processing method for transposing large matrices in place.

PubMed

Portnoff, M R

1999-01-01

We have developed an efficient algorithm for transposing large matrices in place. The algorithm is efficient because data are accessed either sequentially in blocks or randomly within blocks small enough to fit in cache, and because the same indexing calculations are shared among identical procedures operating on independent subsets of the data. This inherent parallelism makes the method well suited for a multiprocessor computing environment. The algorithm is easy to implement because the same two procedures are applied to the data in various groupings to carry out the complete transpose operation. Using only a single processor, we have demonstrated nearly an order of magnitude increase in speed over the previously published algorithm by Gate and Twigg for transposing a large rectangular matrix in place. With multiple processors operating in parallel, the processing speed increases almost linearly with the number of processors. A simplified version of the algorithm for square matrices is presented as well as an extension for matrices large enough to require virtual memory.

Implementing the PM Programming Language using MPI and OpenMP - a New Tool for Programming Geophysical Models on Parallel Systems

NASA Astrophysics Data System (ADS)

Bellerby, Tim

2015-04-01

PM (Parallel Models) is a new parallel programming language specifically designed for writing environmental and geophysical models. The language is intended to enable implementers to concentrate on the science behind the model rather than the details of running on parallel hardware. At the same time PM leaves the programmer in control - all parallelisation is explicit and the parallel structure of any given program may be deduced directly from the code. This paper describes a PM implementation based on the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) standards, looking at issues involved with translating the PM parallelisation model to MPI/OpenMP protocols and considering performance in terms of the competing factors of finer-grained parallelisation and increased communication overhead. In order to maximise portability, the implementation stays within the MPI 1.3 standard as much as possible, with MPI-2 MPI-IO file handling the only significant exception. Moreover, it does not assume a thread-safe implementation of MPI. PM adopts a two-tier abstract representation of parallel hardware. A PM processor is a conceptual unit capable of efficiently executing a set of language tasks, with a complete parallel system consisting of an abstract N-dimensional array of such processors. PM processors may map to single cores executing tasks using cooperative multi-tasking, to multiple cores or even to separate processing nodes, efficiently sharing tasks using algorithms such as work stealing. While tasks may move between hardware elements within a PM processor, they may not move between processors without specific programmer intervention. Tasks are assigned to processors using a nested parallelism approach, building on ideas from Reyes et al. (2009). The main program owns all available processors. When the program enters a parallel statement then either processors are divided out among the newly generated tasks (number of new tasks < number of processors) or tasks are divided out among the available processors (number of tasks > number of processors). Nested parallel statements may further subdivide the processor set owned by a given task. Tasks or processors are distributed evenly by default, but uneven distributions are possible under programmer control. It is also possible to explicitly enable child tasks to migrate within the processor set owned by their parent task, reducing load unbalancing at the potential cost of increased inter-processor message traffic. PM incorporates some programming structures from the earlier MIST language presented at a previous EGU General Assembly, while adopting a significantly different underlying parallelisation model and type system. PM code is available at www.pm-lang.org under an unrestrictive MIT license. Reference Ruymán Reyes, Antonio J. Dorta, Francisco Almeida, Francisco de Sande, 2009. Automatic Hybrid MPI+OpenMP Code Generation with llc, Recent Advances in Parallel Virtual Machine and Message Passing Interface, Lecture Notes in Computer Science Volume 5759, 185-195
Zonal methods for the parallel execution of range-limited N-body simulations

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bowers, Kevin J.; Dror, Ron O.; Shaw, David E.

2007-01-20

Particle simulations in fields ranging from biochemistry to astrophysics require the evaluation of interactions between all pairs of particles separated by less than some fixed interaction radius. The applicability of such simulations is often limited by the time required for calculation, but the use of massive parallelism to accelerate these computations is typically limited by inter-processor communication requirements. Recently, Snir [M. Snir, A note on N-body computations with cutoffs, Theor. Comput. Syst. 37 (2004) 295-318] and Shaw [D.E. Shaw, A fast, scalable method for the parallel evaluation of distance-limited pairwise particle interactions, J. Comput. Chem. 26 (2005) 1318-1328] independently introducedmore » two distinct methods that offer asymptotic reductions in the amount of data transferred between processors. In the present paper, we show that these schemes represent special cases of a more general class of methods, and introduce several new algorithms in this class that offer practical advantages over all previously described methods for a wide range of problem parameters. We also show that several of these algorithms approach an approximate lower bound on inter-processor data transfer.« less
AN Integrated Bibliographic Information System: Concept and Application for Resource Sharing in Special Libraries

DTIC Science & Technology

1987-05-01

workload (beyond that of say an equivalent academic or corporate technical libary ) for the Defense Department libraries. Figure 9 illustrates the range...summer. The hardware configuration for the system is as follows: " Digital Equipment Corporation VAX 11/750 central processor with 6 mega- bytes of real
50 CFR 680.40 - Crab Quota Share (QS), Processor QS (PQS), Individual Fishing Quota (IFQ), and Individual...

Code of Federal Regulations, 2012 CFR

2012-10-01

... exclude any deadloss, test fishing, fishing conducted under an experimental, exploratory, or scientific..., education, exploratory, or experimental permit, or under the Western Alaska CDQ Program. (iv) Documentation... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
50 CFR 680.40 - Crab Quota Share (QS), Processor QS (PQS), Individual Fishing Quota (IFQ), and Individual...

Code of Federal Regulations, 2014 CFR

2014-10-01

... exclude any deadloss, test fishing, fishing conducted under an experimental, exploratory, or scientific..., education, exploratory, or experimental permit, or under the Western Alaska CDQ Program. (iv) Documentation... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
50 CFR 680.40 - Crab Quota Share (QS), Processor QS (PQS), Individual Fishing Quota (IFQ), and Individual...

Code of Federal Regulations, 2011 CFR

2011-10-01

... exclude any deadloss, test fishing, fishing conducted under an experimental, exploratory, or scientific..., education, exploratory, or experimental permit, or under the Western Alaska CDQ Program. (iv) Documentation... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
50 CFR 680.40 - Crab Quota Share (QS), Processor QS (PQS), Individual Fishing Quota (IFQ), and Individual...

Code of Federal Regulations, 2013 CFR

2013-10-01

... exclude any deadloss, test fishing, fishing conducted under an experimental, exploratory, or scientific..., education, exploratory, or experimental permit, or under the Western Alaska CDQ Program. (iv) Documentation... information is true, correct, and complete to the best of his/her knowledge and belief. If the application is...
Sharing Writing on an Electronic Network.

ERIC Educational Resources Information Center

Schwartz, Jeffrey

A writing exchange project at Bread Loaf School of English at Middlebury College in Vermont, funded by Apple Education Foundation and McDonnell Douglas, examined what happened when high school students use word processors and a modem to write to distant audiences. In the first exchange, students interviewed each other in pairs and wrote short…
Cache write generate for parallel image processing on shared memory architectures.

PubMed

Wittenbrink, C M; Somani, A K; Chen, C H

1996-01-01

We investigate cache write generate, our cache mode invention. We demonstrate that for parallel image processing applications, the new mode improves main memory bandwidth, CPU efficiency, cache hits, and cache latency. We use register level simulations validated by the UW-Proteus system. Many memory, cache, and processor configurations are evaluated.
Challenging prior evidence for a shared syntactic processor for language and music.

PubMed

Perruchet, Pierre; Poulin-Charronnat, Bénédicte

2013-04-01

A theoretical landmark in the growing literature comparing language and music is the shared syntactic integration resource hypothesis (SSIRH; e.g., Patel, 2008), which posits that the successful processing of linguistic and musical materials relies, at least partially, on the mastery of a common syntactic processor. Supporting the SSIRH, Slevc, Rosenberg, and Patel (Psychonomic Bulletin & Review 16(2):374-381, 2009) recently reported data showing enhanced syntactic garden path effects when the sentences were paired with syntactically unexpected chords, whereas the musical manipulation had no reliable effect on the processing of semantic violations. The present experiment replicated Slevc et al.'s (2009) procedure, except that syntactic garden paths were replaced with semantic garden paths. We observed the very same interactive pattern of results. These findings suggest that the element underpinning interactions is the garden path configuration, rather than the implication of an alleged syntactic module. We suggest that a different amount of attentional resources is recruited to process each type of linguistic manipulations, hence modulating the resources left available for the processing of music and, consequently, the effects of musical violations.
Effect of processor temperature on film dosimetry

DOE Office of Scientific and Technical Information (OSTI.GOV)

Srivastava, Shiv P.; Das, Indra J., E-mail: idas@iupui.edu

2012-07-01

Optical density (OD) of a radiographic film plays an important role in radiation dosimetry, which depends on various parameters, including beam energy, depth, field size, film batch, dose, dose rate, air film interface, postexposure processing time, and temperature of the processor. Most of these parameters have been studied for Kodak XV and extended dose range (EDR) films used in radiation oncology. There is very limited information on processor temperature, which is investigated in this study. Multiple XV and EDR films were exposed in the reference condition (d{sub max.}, 10 Multiplication-Sign 10 cm{sup 2}, 100 cm) to a given dose. Anmore » automatic film processor (X-Omat 5000) was used for processing films. The temperature of the processor was adjusted manually with increasing temperature. At each temperature, a set of films was processed to evaluate OD at a given dose. For both films, OD is a linear function of processor temperature in the range of 29.4-40.6 Degree-Sign C (85-105 Degree-Sign F) for various dose ranges. The changes in processor temperature are directly related to the dose by a quadratic function. A simple linear equation is provided for the changes in OD vs. processor temperature, which could be used for correcting dose in radiation dosimetry when film is used.« less
Supporting shared data structures on distributed memory architectures

NASA Technical Reports Server (NTRS)

Koelbel, Charles; Mehrotra, Piyush; Vanrosendale, John

1990-01-01

Programming nonshared memory systems is more difficult than programming shared memory systems, since there is no support for shared data structures. Current programming languages for distributed memory architectures force the user to decompose all data structures into separate pieces, with each piece owned by one of the processors in the machine, and with all communication explicitly specified by low-level message-passing primitives. A new programming environment is presented for distributed memory architectures, providing a global name space and allowing direct access to remote parts of data values. The analysis and program transformations required to implement this environment are described, and the efficiency of the resulting code on the NCUBE/7 and IPSC/2 hypercubes are described.
Using R in Taverna: RShell v1.2

PubMed Central

Wassink, Ingo; Rauwerda, Han; Neerincx, Pieter BT; Vet, Paul E van der; Breit, Timo M; Leunissen, Jack AM; Nijholt, Anton

2009-01-01

Background R is the statistical language commonly used by many life scientists in (omics) data analysis. At the same time, these complex analyses benefit from a workflow approach, such as used by the open source workflow management system Taverna. However, Taverna had limited support for R, because it supported just a few data types and only a single output. Also, there was no support for graphical output and persistent sessions. Altogether this made using R in Taverna impractical. Findings We have developed an R plugin for Taverna: RShell, which provides R functionality within workflows designed in Taverna. In order to fully support the R language, our RShell plugin directly uses the R interpreter. The RShell plugin consists of a Taverna processor for R scripts and an RShell Session Manager that communicates with the R server. We made the RShell processor highly configurable allowing the user to define multiple inputs and outputs. Also, various data types are supported, such as strings, numeric data and images. To limit data transport between multiple RShell processors, the RShell plugin also supports persistent sessions. Here, we will describe the architecture of RShell and the new features that are introduced in version 1.2, i.e.: i) Support for R up to and including R version 2.9; ii) Support for persistent sessions to limit data transfer; iii) Support for vector graphics output through PDF; iv)Syntax highlighting of the R code; v) Improved usability through fewer port types. Our new RShell processor is backwards compatible with workflows that use older versions of the RShell processor. We demonstrate the value of the RShell processor by a use-case workflow that maps oligonucleotide probes designed with DNA sequence information from Vega onto the Ensembl genome assembly. Conclusion Our RShell plugin enables Taverna users to employ R scripts within their workflows in a highly configurable way. PMID:19607662
A communication-avoiding, hybrid-parallel, rank-revealing orthogonalization method.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Hoemmen, Mark

2010-11-01

Orthogonalization consumes much of the run time of many iterative methods for solving sparse linear systems and eigenvalue problems. Commonly used algorithms, such as variants of Gram-Schmidt or Householder QR, have performance dominated by communication. Here, 'communication' includes both data movement between the CPU and memory, and messages between processors in parallel. Our Tall Skinny QR (TSQR) family of algorithms requires asymptotically fewer messages between processors and data movement between CPU and memory than typical orthogonalization methods, yet achieves the same accuracy as Householder QR factorization. Furthermore, in block orthogonalizations, TSQR is faster and more accurate than existing approaches formore » orthogonalizing the vectors within each block ('normalization'). TSQR's rank-revealing capability also makes it useful for detecting deflation in block iterative methods, for which existing approaches sacrifice performance, accuracy, or both. We have implemented a version of TSQR that exploits both distributed-memory and shared-memory parallelism, and supports real and complex arithmetic. Our implementation is optimized for the case of orthogonalizing a small number (5-20) of very long vectors. The shared-memory parallel component uses Intel's Threading Building Blocks, though its modular design supports other shared-memory programming models as well, including computation on the GPU. Our implementation achieves speedups of 2 times or more over competing orthogonalizations. It is available now in the development branch of the Trilinos software package, and will be included in the 10.8 release.« less
A parallel algorithm for multi-level logic synthesis using the transduction method. M.S. Thesis

NASA Technical Reports Server (NTRS)

Lim, Chieng-Fai

1991-01-01

The Transduction Method has been shown to be a powerful tool in the optimization of multilevel networks. Many tools such as the SYLON synthesis system (X90), (CM89), (LM90) have been developed based on this method. A parallel implementation is presented of SYLON-XTRANS (XM89) on an eight processor Encore Multimax shared memory multiprocessor. It minimizes multilevel networks consisting of simple gates through parallel pruning, gate substitution, gate merging, generalized gate substitution, and gate input reduction. This implementation, called Parallel TRANSduction (PTRANS), also uses partitioning to break large circuits up and performs inter- and intra-partition dynamic load balancing. With this, good speedups and high processor efficiencies are achievable without sacrificing the resulting circuit quality.
A Tutorial on Parallel and Concurrent Programming in Haskell

NASA Astrophysics Data System (ADS)

Peyton Jones, Simon; Singh, Satnam

This practical tutorial introduces the features available in Haskell for writing parallel and concurrent programs. We first describe how to write semi-explicit parallel programs by using annotations to express opportunities for parallelism and to help control the granularity of parallelism for effective execution on modern operating systems and processors. We then describe the mechanisms provided by Haskell for writing explicitly parallel programs with a focus on the use of software transactional memory to help share information between threads. Finally, we show how nested data parallelism can be used to write deterministically parallel programs which allows programmers to use rich data types in data parallel programs which are automatically transformed into flat data parallel versions for efficient execution on multi-core processors.
3D environment modeling and location tracking using off-the-shelf components

NASA Astrophysics Data System (ADS)

Luke, Robert H.

2016-05-01

The remarkable popularity of smartphones over the past decade has led to a technological race for dominance in market share. This has resulted in a flood of new processors and sensors that are inexpensive, low power and high performance. These sensors include accelerometers, gyroscope, barometers and most importantly cameras. This sensor suite, coupled with multicore processors, allows a new community of researchers to build small, high performance platforms for low cost. This paper describes a system using off-the-shelf components to perform position tracking as well as environment modeling. The system relies on tracking using stereo vision and inertial navigation to determine movement of the system as well as create a model of the environment sensed by the system.
Teleoperated position control of a PUMA robot

NASA Technical Reports Server (NTRS)

Austin, Edmund; Fong, Chung P.

1987-01-01

A laboratory distributed computer control teleoperator system is developed to support NASA's future space telerobotic operation. This teleoperator system uses a universal force-reflecting hand controller in the local iste as the operator's input device. In the remote site, a PUMA controller recieves the Cartesian position commands and implements PID control laws to position the PUMA robot. The local site uses two microprocessors while the remote site uses three. The processors communicate with each other through shared memory. The PUMA robot controller was interfaced through custom made electronics to bypass VAL. The development status of this teleoperator system is reported. The execution time of each processor is analyzed, and the overall system throughput rate is reported. Methods to improve the efficiency and performance are discussed.
Competitive Parallel Processing For Compression Of Data

NASA Technical Reports Server (NTRS)

Diner, Daniel B.; Fender, Antony R. H.

1990-01-01

Momentarily-best compression algorithm selected. Proposed competitive-parallel-processing system compresses data for transmission in channel of limited band-width. Likely application for compression lies in high-resolution, stereoscopic color-television broadcasting. Data from information-rich source like color-television camera compressed by several processors, each operating with different algorithm. Referee processor selects momentarily-best compressed output.
Cooperative use of advanced scanning technology for low-volume hardwood processors

Treesearch

Luis G. Occeña; Timothy J. Rayner; Daniel L. Schmoldt; A. Lynn Abbott

2001-01-01

Of the several hundreds of hardwood lumber sawmills across the country, the majority are small- to medium-sized facilities operated as small businesses in rural communities. Trends of increased log costs and limited availability are forcing wood processors to become more efficient in their operations. Still, small mills are less able to adopt new, more efficient...

75 FR 13024 - Pacific Halibut Fisheries; Catch Sharing Plan

Federal Register 2010, 2011, 2012, 2013, 2014

2010-03-18

... system for guided charter vessels (75 FR 554) was also established January 5, 2010, for Areas 2C and 3A... resulting catch of which is sold or bartered; or is intended to be sold or bartered, other than (i) sport... fish processor; (t) ``VMS transmitter'' means a NMFS-approved vessel monitoring system transmitter that...
Design, Implementation, and Evaluation of a Virtual Shared Memory System in a Multi-Transputer Network.

DTIC Science & Technology

1987-12-01

Synchronization and Data Passing Mechanism ........ 50 4. System Shut Down .................................................................. 51 5...high performance, fault tolerance, and extensibility. These features are attained by synchronizing and coordinating the dis- tributed multicomputer... synchronizing all processors in the network. In a multitransputer network, processes that communicate with each other do so synchronously . This makes
77 FR 44216 - Fisheries of the Exclusive Economic Zone Off Alaska; Bering Sea and Aleutian Islands Crab...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-07-27

... of a zero (0) percent fee for cost recovery under the Bering Sea and Aleutian Islands Crab... Program includes a cost recovery provision to collect fees to recover the actual costs directly related to... processing sectors to each pay half the cost recovery fees. Catcher/processor quota share holders are...
High Performance Active Database Management on a Shared-Nothing Parallel Processor

DTIC Science & Technology

1998-05-01

either stored or virtual. A stored node is like a materialized view. It actually contains the specified tuples. A virtual node is like a real view...90292-6695 DL-5 COLUMBIA UNIV/DEPT COMPUTER SCIENCi ATTN: OR GAIL £. KAISER 450 COMPUTER SCIENCE 3LDG 500 WEST 12ÖTH STRSET NEW YORK NY 10027
Gear Up Your Research Guides with the Emerging OPML Codes

ERIC Educational Resources Information Center

Wilcox, Kimberley

2006-01-01

Outline Processor Markup Language (OPML) is an emerging format that allows for the creation of customized research packages to push to patrons. It is a way to gather collections of Web resources (links, RSS feeds, multimedia files), organize them as outlines, and publish them in a format that others can share and even subscribe to. In this…
Debugging Fortran on a shared memory machine

DOE Office of Scientific and Technical Information (OSTI.GOV)

Allen, T.R.; Padua, D.A.

1987-01-01

Debugging on a parallel processor is more difficult than debugging on a serial machine because errors in a parallel program may introduce nondeterminism. The approach to parallel debugging presented here attempts to reduce the problem of debugging on a parallel machine to that of debugging on a serial machine by automatically detecting nondeterminism. 20 refs., 6 figs.
A multi-satellite orbit determination problem in a parallel processing environment

NASA Technical Reports Server (NTRS)

Deakyne, M. S.; Anderle, R. J.

1988-01-01

The Engineering Orbit Analysis Unit at GE Valley Forge used an Intel Hypercube Parallel Processor to investigate the performance and gain experience of parallel processors with a multi-satellite orbit determination problem. A general study was selected in which major blocks of computation for the multi-satellite orbit computations were used as units to be assigned to the various processors on the Hypercube. Problems encountered or successes achieved in addressing the orbit determination problem would be more likely to be transferable to other parallel processors. The prime objective was to study the algorithm to allow processing of observations later in time than those employed in the state update. Expertise in ephemeris determination was exploited in addressing these problems and the facility used to bring a realism to the study which would highlight the problems which may not otherwise be anticipated. Secondary objectives were to gain experience of a non-trivial problem in a parallel processor environment, to explore the necessary interplay of serial and parallel sections of the algorithm in terms of timing studies, to explore the granularity (coarse vs. fine grain) to discover the granularity limit above which there would be a risk of starvation where the majority of nodes would be idle or under the limit where the overhead associated with splitting the problem may require more work and communication time than is useful.
Is random access memory random?

NASA Technical Reports Server (NTRS)

Denning, P. J.

1986-01-01

Most software is contructed on the assumption that the programs and data are stored in random access memory (RAM). Physical limitations on the relative speeds of processor and memory elements lead to a variety of memory organizations that match processor addressing rate with memory service rate. These include interleaved and cached memory. A very high fraction of a processor's address requests can be satified from the cache without reference to the main memory. The cache requests information from main memory in blocks that can be transferred at the full memory speed. Programmers who organize algorithms for locality can realize the highest performance from these computers.
Parallelization of a Fully-Distributed Hydrologic Model using Sub-basin Partitioning

NASA Astrophysics Data System (ADS)

Vivoni, E. R.; Mniszewski, S.; Fasel, P.; Springer, E.; Ivanov, V. Y.; Bras, R. L.

2005-12-01

A primary obstacle towards advances in watershed simulations has been the limited computational capacity available to most models. The growing trend of model complexity, data availability and physical representation has not been matched by adequate developments in computational efficiency. This situation has created a serious bottleneck which limits existing distributed hydrologic models to small domains and short simulations. In this study, we present novel developments in the parallelization of a fully-distributed hydrologic model. Our work is based on the TIN-based Real-time Integrated Basin Simulator (tRIBS), which provides continuous hydrologic simulation using a multiple resolution representation of complex terrain based on a triangulated irregular network (TIN). While the use of TINs reduces computational demand, the sequential version of the model is currently limited over large basins (>10,000 km2) and long simulation periods (>1 year). To address this, a parallel MPI-based version of the tRIBS model has been implemented and tested using high performance computing resources at Los Alamos National Laboratory. Our approach utilizes domain decomposition based on sub-basin partitioning of the watershed. A stream reach graph based on the channel network structure is used to guide the sub-basin partitioning. Individual sub-basins or sub-graphs of sub-basins are assigned to separate processors to carry out internal hydrologic computations (e.g. rainfall-runoff transformation). Routed streamflow from each sub-basin forms the major hydrologic data exchange along the stream reach graph. Individual sub-basins also share subsurface hydrologic fluxes across adjacent boundaries. We demonstrate how the sub-basin partitioning provides computational feasibility and efficiency for a set of test watersheds in northeastern Oklahoma. We compare the performance of the sequential and parallelized versions to highlight the efficiency gained as the number of processors increases. We also discuss how the coupled use of TINs and parallel processing can lead to feasible long-term simulations in regional watersheds while preserving basin properties at high-resolution.
An investigation of potential applications of OP-SAPS: Operational sampled analog processors

NASA Technical Reports Server (NTRS)

Parrish, E. A.; Mcvey, E. S.

1976-01-01

The impact of charge-coupled device (CCD) processors on future instrumentation was investigated. The CCD devices studied process sampled analog data and are referred to as OP-SAPS - operational sampled analog processors. Preliminary studies into various architectural configurations for systems composed of OP-SAPS show that they have potential in such diverse applications as pattern recognition and automatic control. It appears probable that OP-SAPS may be used to construct computing structures which can serve as special peripherals to large-scale computer complexes used in real time flight simulation. The research was limited to the following benchmark programs: (1) face recognition, (2) voice command and control, (3) terrain classification, and (4) terrain identification. A small amount of effort was spent on examining a method by which OP-SAPS may be used to decrease the limiting ground sampling distance encountered in remote sensing from satellites.
A comparison of five methods for monitoring the precision of automated x-ray film processors.

PubMed

Nickoloff, E L; Leo, F; Reese, M

1978-11-01

Five different methods for preparing sensitometric strips used to monitor the precision of automated film processors are compared. A method for determining the sensitivity of each system to processor variations is presented; the observed statistical variability is multiplied by the system response to temperature or chemical changes. Pre-exposed sensitometric strips required the use of accurate densitometers and stringent control limits to be effective. X-ray exposed sensitometric strips demonstrated large variations in the x-ray output (2 omega approximately equal to 8.0%) over a period of one month. Some light sensitometers were capable of detecting +/- 1.0 degrees F (+/- 0.6 degrees C) variations in developer temperature in the processor and/or about 10.0 ml of chemical contamination in the processor. Nevertheless, even the light sensitometers were susceptible to problems, e.g. film emulsion selection, line voltage variations, and latent image fading. Advantages and disadvantages of the various sensitometric methods are discussed.
Programmable optical processor chips: toward photonic RF filters with DSP-level flexibility and MHz-band selectivity

NASA Astrophysics Data System (ADS)

Xie, Yiwei; Geng, Zihan; Zhuang, Leimeng; Burla, Maurizio; Taddei, Caterina; Hoekman, Marcel; Leinse, Arne; Roeloffzen, Chris G. H.; Boller, Klaus-J.; Lowery, Arthur J.

2017-12-01

Integrated optical signal processors have been identified as a powerful engine for optical processing of microwave signals. They enable wideband and stable signal processing operations on miniaturized chips with ultimate control precision. As a promising application, such processors enables photonic implementations of reconfigurable radio frequency (RF) filters with wide design flexibility, large bandwidth, and high-frequency selectivity. This is a key technology for photonic-assisted RF front ends that opens a path to overcoming the bandwidth limitation of current digital electronics. Here, the recent progress of integrated optical signal processors for implementing such RF filters is reviewed. We highlight the use of a low-loss, high-index-contrast stoichiometric silicon nitride waveguide which promises to serve as a practical material platform for realizing high-performance optical signal processors and points toward photonic RF filters with digital signal processing (DSP)-level flexibility, hundreds-GHz bandwidth, MHz-band frequency selectivity, and full system integration on a chip scale.
Science and Applications Space Platform (SASP) End-to-End Data System Study

NASA Technical Reports Server (NTRS)

Crawford, P. R.; Kasulka, L. H.

1981-01-01

The capability of present technology and the Tracking and Data Relay Satellite System (TDRSS) to accommodate Science and Applications Space Platforms (SASP) payload user's requirements, maximum service to the user through optimization of the SASP Onboard Command and Data Management System, and the ability and availability of new technology to accommodate the evolution of SASP payloads were assessed. Key technology items identified to accommodate payloads on a SASP were onboard storage devices, multiplexers, and onboard data processors. The primary driver is the limited access to TDRSS for single access channels due to sharing with all the low Earth orbit spacecraft plus shuttle. Advantages of onboard data processing include long term storage of processed data until TRDSS is accessible, thus reducing the loss of data, eliminating large data processing tasks at the ground stations, and providing a more timely access to the data.
Parallel Computation of the Jacobian Matrix for Nonlinear Equation Solvers Using MATLAB

NASA Technical Reports Server (NTRS)

Rose, Geoffrey K.; Nguyen, Duc T.; Newman, Brett A.

2017-01-01

Demonstrating speedup for parallel code on a multicore shared memory PC can be challenging in MATLAB due to underlying parallel operations that are often opaque to the user. This can limit potential for improvement of serial code even for the so-called embarrassingly parallel applications. One such application is the computation of the Jacobian matrix inherent to most nonlinear equation solvers. Computation of this matrix represents the primary bottleneck in nonlinear solver speed such that commercial finite element (FE) and multi-body-dynamic (MBD) codes attempt to minimize computations. A timing study using MATLAB's Parallel Computing Toolbox was performed for numerical computation of the Jacobian. Several approaches for implementing parallel code were investigated while only the single program multiple data (spmd) method using composite objects provided positive results. Parallel code speedup is demonstrated but the goal of linear speedup through the addition of processors was not achieved due to PC architecture.
G-cueing microcontroller (a microprocessor application in simulators)

NASA Technical Reports Server (NTRS)

Horattas, C. G.

1980-01-01

A g cueing microcontroller is described which consists of a tandem pair of microprocessors, dedicated to the task of simulating pilot sensed cues caused by gravity effects. This task includes execution of a g cueing model which drives actuators that alter the configuration of the pilot's seat. The g cueing microcontroller receives acceleration commands from the aerodynamics model in the main computer and creates the stimuli that produce physical acceleration effects of the aircraft seat on the pilots anatomy. One of the two microprocessors is a fixed instruction processor that performs all control and interface functions. The other, a specially designed bipolar bit slice microprocessor, is a microprogrammable processor dedicated to all arithmetic operations. The two processors communicate with each other by a shared memory. The g cueing microcontroller contains its own dedicated I/O conversion modules for interface with the seat actuators and controls, and a DMA controller for interfacing with the simulation computer. Any application which can be microcoded within the available memory, the available real time and the available I/O channels, could be implemented in the same controller.
The science of computing - Parallel computation

NASA Technical Reports Server (NTRS)

Denning, P. J.

1985-01-01

Although parallel computation architectures have been known for computers since the 1920s, it was only in the 1970s that microelectronic components technologies advanced to the point where it became feasible to incorporate multiple processors in one machine. Concommitantly, the development of algorithms for parallel processing also lagged due to hardware limitations. The speed of computing with solid-state chips is limited by gate switching delays. The physical limit implies that a 1 Gflop operational speed is the maximum for sequential processors. A computer recently introduced features a 'hypercube' architecture with 128 processors connected in networks at 5, 6 or 7 points per grid, depending on the design choice. Its computing speed rivals that of supercomputers, but at a fraction of the cost. The added speed with less hardware is due to parallel processing, which utilizes algorithms representing different parts of an equation that can be broken into simpler statements and processed simultaneously. Present, highly developed computer languages like FORTRAN, PASCAL, COBOL, etc., rely on sequential instructions. Thus, increased emphasis will now be directed at parallel processing algorithms to exploit the new architectures.
Geospace simulations using modern accelerator processor technology

NASA Astrophysics Data System (ADS)

Germaschewski, K.; Raeder, J.; Larson, D. J.

2009-12-01

OpenGGCM (Open Geospace General Circulation Model) is a well-established numerical code simulating the Earth's space environment. The most computing intensive part is the MHD (magnetohydrodynamics) solver that models the plasma surrounding Earth and its interaction with Earth's magnetic field and the solar wind flowing in from the sun. Like other global magnetosphere codes, OpenGGCM's realism is currently limited by computational constraints on grid resolution. OpenGGCM has been ported to make use of the added computational powerof modern accelerator based processor architectures, in particular the Cell processor. The Cell architecture is a novel inhomogeneous multicore architecture capable of achieving up to 230 GFLops on a single chip. The University of New Hampshire recently acquired a PowerXCell 8i based computing cluster, and here we will report initial performance results of OpenGGCM. Realizing the high theoretical performance of the Cell processor is a programming challenge, though. We implemented the MHD solver using a multi-level parallelization approach: On the coarsest level, the problem is distributed to processors based upon the usual domain decomposition approach. Then, on each processor, the problem is divided into 3D columns, each of which is handled by the memory limited SPEs (synergistic processing elements) slice by slice. Finally, SIMD instructions are used to fully exploit the SIMD FPUs in each SPE. Memory management needs to be handled explicitly by the code, using DMA to move data from main memory to the per-SPE local store and vice versa. We use a modern technique, automatic code generation, which shields the application programmer from having to deal with all of the implementation details just described, keeping the code much more easily maintainable. Our preliminary results indicate excellent performance, a speed-up of a factor of 30 compared to the unoptimized version.
77 FR 38013 - Fisheries of the Exclusive Economic Zone Off Alaska; Groundfish of the Gulf of Alaska; Amendment...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-06-26

... participants in the entry level trawl fishery may qualify for quota share (QS) under the Central Gulf of Alaska... landings to an entry level processor in 2007, 2008, or 2009. This clarification is administrative in nature and does not change the distribution of rockfish QS to entry level trawl participants. DATES...
76 FR 35781 - Fisheries of the Exclusive Economic Zone Off Alaska; Bering Sea and Aleutian Islands Crab...

Federal Register 2010, 2011, 2012, 2013, 2014

2011-06-20

... operational costs. NMFS also issued processor quota share (PQS) under the Program. Each year, PQS yields an... requirements. The RIR/FRFA prepared for this action describes the costs and benefits of Amendment 37 (see... person or company that holds in excess of 20 percent of the West-designated WAG QS; (2) any person or...
Peregrine System Configuration | High-Performance Computing | NREL

Science.gov Websites

nodes and storage are connected by a high speed InfiniBand network. Compute nodes are diskless with an directories are mounted on all nodes, along with a file system dedicated to shared projects. A brief processors with 64 GB of memory. All nodes are connected to the high speed Infiniband network and and a

Why K-12 IT Managers and Administrators Are Embracing the Intel-Based Mac

ERIC Educational Resources Information Center

Technology & Learning, 2007

2007-01-01

Over the past year, Apple has dramatically increased its share of the school computer marketplace--especially in the category of notebook computers. A recent study conducted by Grunwald Associates and Rockman et al. reports that one of the major reasons for this growth is Apple's introduction of the Intel processor to the entire line of Mac…
Importance of balanced architectures in the design of high-performance imaging systems

NASA Astrophysics Data System (ADS)

Sgro, Joseph A.; Stanton, Paul C.

1999-03-01

Imaging systems employed in demanding military and industrial applications, such as automatic target recognition and computer vision, typically require real-time high-performance computing resources. While high- performances computing systems have traditionally relied on proprietary architectures and custom components, recent advances in high performance general-purpose microprocessor technology have produced an abundance of low cost components suitable for use in high-performance computing systems. A common pitfall in the design of high performance imaging system, particularly systems employing scalable multiprocessor architectures, is the failure to balance computational and memory bandwidth. The performance of standard cluster designs, for example, in which several processors share a common memory bus, is typically constrained by memory bandwidth. The symptom characteristic of this problem is failure to the performance of the system to scale as more processors are added. The problem becomes exacerbated if I/O and memory functions share the same bus. The recent introduction of microprocessors with large internal caches and high performance external memory interfaces makes it practical to design high performance imaging system with balanced computational and memory bandwidth. Real word examples of such designs will be presented, along with a discussion of adapting algorithm design to best utilize available memory bandwidth.
Static analysis of the hull plate using the finite element method

NASA Astrophysics Data System (ADS)

Ion, A.

2015-11-01

This paper aims at presenting the static analysis for two levels of a container ship's construction as follows: the first level is at the girder / hull plate and the second level is conducted at the entire strength hull of the vessel. This article will describe the work for the static analysis of a hull plate. We shall use the software package ANSYS Mechanical 14.5. The program is run on a computer with four Intel Xeon X5260 CPU processors at 3.33 GHz, 32 GB memory installed. In terms of software, the shared memory parallel version of ANSYS refers to running ANSYS across multiple cores on a SMP system. The distributed memory parallel version of ANSYS (Distributed ANSYS) refers to running ANSYS across multiple processors on SMP systems or DMP systems.
VLSI 'smart' I/O module development

NASA Astrophysics Data System (ADS)

Kirk, Dan

The developmental history, design, and operation of the MIL-STD-1553A/B discrete and serial module (DSM) for the U.S. Navy AN/AYK-14(V) avionics computer are described and illustrated with diagrams. The ongoing preplanned product improvement for the AN/AYK-14(V) includes five dual-redundant MIL-STD-1553 channels based on DSMs. The DSM is a front-end processor for transferring data to and from a common memory, sharing memory with a host processor to provide improved 'smart' input/output performance. Each DSM comprises three hardware sections: three VLSI-6000 semicustomized CMOS arrays, memory units to support the arrays, and buffers and resynchronization circuits. The DSM hardware module design, VLSI-6000 design tools, controlware and test software, and checkout procedures (using a hardware simulator) are characterized in detail.
HeinzelCluster: accelerated reconstruction for FORE and OSEM3D.

PubMed

Vollmar, S; Michel, C; Treffert, J T; Newport, D F; Casey, M; Knöss, C; Wienhard, K; Liu, X; Defrise, M; Heiss, W D

2002-08-07

Using iterative three-dimensional (3D) reconstruction techniques for reconstruction of positron emission tomography (PET) is not feasible on most single-processor machines due to the excessive computing time needed, especially so for the large sinogram sizes of our high-resolution research tomograph (HRRT). In our first approach to speed up reconstruction time we transform the 3D scan into the format of a two-dimensional (2D) scan with sinograms that can be reconstructed independently using Fourier rebinning (FORE) and a fast 2D reconstruction method. On our dedicated reconstruction cluster (seven four-processor systems, Intel PIII@700 MHz, switched fast ethernet and Myrinet, Windows NT Server), we process these 2D sinograms in parallel. We have achieved a speedup > 23 using 26 processors and also compared results for different communication methods (RPC, Syngo, Myrinet GM). The other approach is to parallelize OSEM3D (implementation of C Michel), which has produced the best results for HRRT data so far and is more suitable for an adequate treatment of the sinogram gaps that result from the detector geometry of the HRRT. We have implemented two levels of parallelization for four dedicated cluster (a shared memory fine-grain level on each node utilizing all four processors and a coarse-grain level allowing for 15 nodes) reducing the time for one core iteration from over 7 h to about 35 min.
A Parallel Algorithm for Contact in a Finite Element Hydrocode

DOE Office of Scientific and Technical Information (OSTI.GOV)

Pierce, Timothy G.

A parallel algorithm is developed for contact/impact of multiple three dimensional bodies undergoing large deformation. As time progresses the relative positions of contact between the multiple bodies changes as collision and sliding occurs. The parallel algorithm is capable of tracking these changes and enforcing an impenetrability constraint and momentum transfer across the surfaces in contact. Portions of the various surfaces of the bodies are assigned to the processors of a distributed-memory parallel machine in an arbitrary fashion, known as the primary decomposition. A secondary, dynamic decomposition is utilized to bring opposing sections of the contacting surfaces together on the samemore » processors, so that opposing forces may be balanced and the resultant deformation of the bodies calculated. The secondary decomposition is accomplished and updated using only local communication with a limited subset of neighbor processors. Each processor represents both a domain of the primary decomposition and a domain of the secondary, or contact, decomposition. Thus each processor has four sets of neighbor processors: (a) those processors which represent regions adjacent to it in the primary decomposition, (b) those processors which represent regions adjacent to it in the contact decomposition, (c) those processors which send it the data from which it constructs its contact domain, and (d) those processors to which it sends its primary domain data, from which they construct their contact domains. The latter three of these neighbor sets change dynamically as the simulation progresses. By constraining all communication to these sets of neighbors, all global communication, with its attendant nonscalable performance, is avoided. A set of tests are provided to measure the degree of scalability achieved by this algorithm on up to 1024 processors. Issues related to the operating system of the test platform which lead to some degradation of the results are analyzed. This algorithm has been implemented as the contact capability of the ALE3D multiphysics code, and is currently in production use.« less
New Dimensions in Microarchitecture Harnessing 3D Integration Technologies (BRIEFING CHARTS)

DTIC Science & Technology

2007-03-06

Quad Core Bandwidth and Latency Boundaries General Purpose Processor Loads Latency limited Ba nd w id th li m ite dProcessor load trade -off between I...delay No= number of ckts at 1V do= ckt delay at 1V From “3D Intergration ” Special Topic Sessionl W. Haensch, ISSCC ‘07, 2/07 11 DARPA MTS March 6, 2007
A FAST ITERATIVE METHOD FOR SOLVING THE EIKONAL EQUATION ON TETRAHEDRAL DOMAINS

PubMed Central

Fu, Zhisong; Kirby, Robert M.; Whitaker, Ross T.

2014-01-01

Generating numerical solutions to the eikonal equation and its many variations has a broad range of applications in both the natural and computational sciences. Efficient solvers on cutting-edge, parallel architectures require new algorithms that may not be theoretically optimal, but that are designed to allow asynchronous solution updates and have limited memory access patterns. This paper presents a parallel algorithm for solving the eikonal equation on fully unstructured tetrahedral meshes. The method is appropriate for the type of fine-grained parallelism found on modern massively-SIMD architectures such as graphics processors and takes into account the particular constraints and capabilities of these computing platforms. This work builds on previous work for solving these equations on triangle meshes; in this paper we adapt and extend previous two-dimensional strategies to accommodate three-dimensional, unstructured, tetrahedralized domains. These new developments include a local update strategy with data compaction for tetrahedral meshes that provides solutions on both serial and parallel architectures, with a generalization to inhomogeneous, anisotropic speed functions. We also propose two new update schemes, specialized to mitigate the natural data increase observed when moving to three dimensions, and the data structures necessary for efficiently mapping data to parallel SIMD processors in a way that maintains computational density. Finally, we present descriptions of the implementations for a single CPU, as well as multicore CPUs with shared memory and SIMD architectures, with comparative results against state-of-the-art eikonal solvers. PMID:25221418
Evaluation of pH monitoring as a method of processor control.

PubMed

Stears, J G; Gray, J E; Winkler, N T

1979-01-01

Sensitometry and pH values of the developer solution were compared in controlled over-replenishment, developer depletion, fixer contamination experiments, and on a daily quality control basis. The purpose of these comparisons was to evaluate the potential of pH monitoring as a method of processor control, or a supplement to sensitometry as a method of quality control. Reasonable correlation was found between pH values and film density in two of the three experiments but little or no correlation was found in the third experiment and on a day-to-day basis. The conclusion drawn from these comparisons is that pH monitoring has several limitations which render it unsuitable as a method of daily processor quality control as either a primary or supplementary technique. Sensitometry takes into account all the variables encountered in film processing and is the clear method of choice for processor quality control.
Scalable architecture for a room temperature solid-state quantum information processor.

PubMed

Yao, N Y; Jiang, L; Gorshkov, A V; Maurer, P C; Giedke, G; Cirac, J I; Lukin, M D

2012-04-24

The realization of a scalable quantum information processor has emerged over the past decade as one of the central challenges at the interface of fundamental science and engineering. Here we propose and analyse an architecture for a scalable, solid-state quantum information processor capable of operating at room temperature. Our approach is based on recent experimental advances involving nitrogen-vacancy colour centres in diamond. In particular, we demonstrate that the multiple challenges associated with operation at ambient temperature, individual addressing at the nanoscale, strong qubit coupling, robustness against disorder and low decoherence rates can be simultaneously achieved under realistic, experimentally relevant conditions. The architecture uses a novel approach to quantum information transfer and includes a hierarchy of control at successive length scales. Moreover, it alleviates the stringent constraints currently limiting the realization of scalable quantum processors and will provide fundamental insights into the physics of non-equilibrium many-body quantum systems.
Secure Embedded System Design Methodologies for Military Cryptographic Systems

DTIC Science & Technology

2016-03-31

Fault- Tree Analysis (FTA); Built-In Self-Test (BIST) Introduction Secure access-control systems restrict operations to authorized users via methods...failures in the individual software/processor elements, the question of exactly how unlikely is difficult to answer. Fault- Tree Analysis (FTA) has a...Collins of Sandia National Laboratories for years of sharing his extensive knowledge of Fail-Safe Design Assurance and Fault- Tree Analysis
Operating System Support for Shared Hardware Data Structures

DTIC Science & Technology

2013-01-31

Carbon [73] uses hardware queues to improve fine-grained multitasking for Recognition, Mining , and Synthesis. Compared to software ap- proaches...web transaction processing, data mining , and multimedia. Early work in database processors [114, 96, 79, 111] reduce the costs of relational database...assignment can be solved statically or dynamically. Static assignment deter- mines offline which data structures are assigned to use HWDS resources and at
Digital Collaboration Tools in the Military: Their Historical and Current Status

DTIC Science & Technology

2006-02-16

Writer = online word processor that edits, stores and shares your documents from anywhere. February 16, 2006 31 Recent “ Disruptive ” Technologies Cell...Webcasts Wikis February 16, 2006 32 Now Consider: Disruptive Technologies (1997) becomes Disruptive Innovations in 2003. Military Transformation: Drivers...from http://www.sims.berkeley.edu/how-much-info-2003 Schneiderman, R. (2005). Preparing for the Disruptive Technologies of Tomorrow. http
Distributed Systems Technology Survey.

DTIC Science & Technology

1987-03-01

and prolocols. 2. Hardware Technology Ecnomic factor we a majo reonm for the prolierat of dlstbted systoe. Processors, memory, an magne tc ndoptical...destined messages and pertorn the a pro te forwarding. There gImsno agreement that a ightweight process mechanism is essential to support com- monly used...Xerox PARC environment [311. Shared file servers, discussed below, are essential to the success of such a scheme. 11. ecurlity A distributed
Design of an integrated fuel processor for residential PEMFCs applications

NASA Astrophysics Data System (ADS)

Seo, Yu Taek; Seo, Dong Joo; Jeong, Jin Hyeok; Yoon, Wang Lai

KIER has been developing a novel fuel processing system to provide hydrogen rich gas to residential PEMFCs system. For the effective design of a compact hydrogen production system, each unit process for steam reforming and water gas shift, has a steam generator and internal heat exchangers which are thermally and physically integrated into a single packaged hardware system. The newly designed fuel processor (prototype II) showed a thermal efficiency of 78% as a HHV basis with methane conversion of 89%. The preferential oxidation unit with two staged cascade reactors, reduces, the CO concentration to below 10 ppm without complicated temperature control hardware, which is the prerequisite CO limit for the PEMFC stack. After we achieve the initial performance of the fuel processor, partial load operation was carried out to test the performance and reliability of the fuel processor at various loads. The stability of the fuel processor was also demonstrated for three successive days with a stable composition of product gas and thermal efficiency. The CO concentration remained below 10 ppm during the test period and confirmed the stable performance of the two-stage PrOx reactors.
Ion propulsion cost effectivity

NASA Technical Reports Server (NTRS)

Zafran, S.; Biess, J. J.

1978-01-01

Ion propulsion modules employing 8-cm thrusters and 30-cm thrusters were studied for Multimission Modular Spacecraft (MMS) applications. Recurring and nonrecurring cost elements were generated for these modules. As a result, ion propulsion cost drivers were identified to be Shuttle charges, solar array, power processing, and thruster costs. Cost effective design approaches included short length module configurations, array power sharing, operation at reduced thruster input power, simplified power processing units, and power processor output switching. The MMS mission model employed indicated that nonrecurring costs have to be shared with other programs unless the mission model grows. Extended performance missions exhibited the greatest benefits when compared with monopropellant hydrazine propulsion.
Processor Capacity Reserves for Multimedia Operating Systems

DTIC Science & Technology

1993-05-01

Stefan Savage, and -ideyuki Tokuda May 1993 CMU-CS-93-157 School of Computer Science Camegie Mellon University Pittsburgh, PA 15213 Abstract Multimedia...and provide feedback so that the estimate can be adjusted if necessaty . For non-periodic activities that are to be limited by a processor percentage...comments and suggestions: Brian Bershad, Ragunathan Rajkumar, and the members of the ART group and Mach group at CMU. 13 References [1] D. P
General-purpose interface bus for multiuser, multitasking computer system

NASA Technical Reports Server (NTRS)

Generazio, Edward R.; Roth, Don J.; Stang, David B.

1990-01-01

The architecture of a multiuser, multitasking, virtual-memory computer system intended for the use by a medium-size research group is described. There are three central processing units (CPU) in the configuration, each with 16 MB memory, and two 474 MB hard disks attached. CPU 1 is designed for data analysis and contains an array processor for fast-Fourier transformations. In addition, CPU 1 shares display images viewed with the image processor. CPU 2 is designed for image analysis and display. CPU 3 is designed for data acquisition and contains 8 GPIB channels and an analog-to-digital conversion input/output interface with 16 channels. Up to 9 users can access the third CPU simultaneously for data acquisition. Focus is placed on the optimization of hardware interfaces and software, facilitating instrument control, data acquisition, and processing.
Performance prediction: A case study using a multi-ring KSR-1 machine

NASA Technical Reports Server (NTRS)

Sun, Xian-He; Zhu, Jianping

1995-01-01

While computers with tens of thousands of processors have successfully delivered high performance power for solving some of the so-called 'grand-challenge' applications, the notion of scalability is becoming an important metric in the evaluation of parallel machine architectures and algorithms. In this study, the prediction of scalability and its application are carefully investigated. A simple formula is presented to show the relation between scalability, single processor computing power, and degradation of parallelism. A case study is conducted on a multi-ring KSR1 shared virtual memory machine. Experimental and theoretical results show that the influence of topology variation of an architecture is predictable. Therefore, the performance of an algorithm on a sophisticated, heirarchical architecture can be predicted and the best algorithm-machine combination can be selected for a given application.
40 CFR 432.97 - Effluent limitations attainable by the application of the best control technology for...

Code of Federal Regulations, 2010 CFR

2010-07-01

... of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Canned Meats Processors § 432.97 Effluent limitations attainable by...

40 CFR 432.92 - Effluent limitations attainable by the application of the best practicable control technology...

Code of Federal Regulations, 2010 CFR

2010-07-01

... Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Canned Meats Processors § 432.92 Effluent limitations...
40 CFR 432.77 - Effluent limitations attainable by the application of the best control technology for...

Code of Federal Regulations, 2010 CFR

2010-07-01

... of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Sausage and Luncheon Meats Processors § 432.77 Effluent limitations...
40 CFR 432.97 - Effluent limitations attainable by the application of the best control technology for...

Code of Federal Regulations, 2011 CFR

2011-07-01

... of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Canned Meats Processors § 432.97 Effluent limitations attainable by...
40 CFR 432.77 - Effluent limitations attainable by the application of the best control technology for...

Code of Federal Regulations, 2011 CFR

2011-07-01

... of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Sausage and Luncheon Meats Processors § 432.77 Effluent limitations...
Realization of a single image haze removal system based on DaVinci DM6467T processor

NASA Astrophysics Data System (ADS)

Liu, Zhuang

2014-10-01

Video monitoring system (VMS) has been extensively applied in domains of target recognition, traffic management, remote sensing, auto navigation and national defence. However the VMS has a strong dependence on the weather, for instance, in foggy weather, the quality of images received by the VMS are distinct degraded and the effective range of VMS is also decreased. All in all, the VMS performs terribly in bad weather. Thus the research of fog degraded images enhancement has very high theoretical and practical application value. A design scheme of a fog degraded images enhancement system based on the TI DaVinci processor is presented in this paper. The main function of the referred system is to extract and digital cameras capture images and execute image enhancement processing to obtain a clear image. The processor used in this system is the dual core TI DaVinci DM6467T - ARM@500MHz+DSP@1GH. A MontaVista Linux operating system is running on the ARM subsystem which handles I/O and application processing. The DSP handles signal processing and the results are available to the ARM subsystem in shared memory.The system benefits from the DaVinci processor so that, with lower power cost and smaller volume, it provides the equivalent image processing capability of a X86 computer. The outcome shows that the system in this paper can process images at 25 frames per second on D1 resolution.
AFOSR BRI: Co-Design of Hardware/Software for Predicting MAV Aerodynamics

DTIC Science & Technology

2016-09-27

DOCUMENTATION PAGE Form ApprovedOMB No. 0704-0188 1. REPORT DATE (DD-MM-YYYY) 2. REPORT TYPE 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER 6. AUTHOR(S) 7...703-588-8494 AFOSR BRI While Moore’s Law theoretically doubles processor performance every 24 months, much of the realizable performance remains...past efforts to develop such CFD codes on accelerated processors showed limited success, our hardware/software co-design approach created malleable
The control data "GIRAFFE" system for interactive graphic finite element analysis

NASA Technical Reports Server (NTRS)

Park, S.; Brandon, D. M., Jr.

1975-01-01

The Graphical Interface for Finite Elements (GIRAFFE) general purpose interactive graphics application package was described. This system may be used as a pre/post processor for structural analysis computer programs. It facilitates the operations of creating, editing, or reviewing all the structural input/output data on a graphics terminal in a time-sharing mode of operation. An application program for a simple three-dimensional plate problem was illustrated.
Information Extraction Using Controlled English to Support Knowledge-Sharing and Decision-Making

DTIC Science & Technology

2012-06-01

or language variants. CE-based information extraction will greatly facilitate the processes in the cognitive and social domains that enable forces...terminology or language variants. CE-based information extraction will greatly facilitate the processes in the cognitive and social domains that...processor is run to turn the atomic CE into a more “ stylistically felicitous” CE, using techniques such as: aggregating all information about an entity
Towards Scalable 1024 Processor Shared Memory Systems

NASA Technical Reports Server (NTRS)

Ciotti, Robert B.; Thigpen, William W. (Technical Monitor)

2001-01-01

Over the past 3 years, NASA Ames has been involved in a cooperative effort with SGI to develop the largest single system image systems available. Currently a 1024 Origin3OOO is under development, with first boot expected later in the summer of 2001. This paper discusses some early results with a 512p Origin3OOO system and some arcane IRIX system calls that can dramatically improve scaling performance.
Parallel Programming Paradigms

DTIC Science & Technology

1987-07-01

Unclassified IS.. DECLASSIFICATIONIOOWNGRADIN G 16. DISTRIBUTION STATEMENT (of this Report) Distribution of this report is unlimited. 17...8416878 and by the Office of Naval Research Contracts No. N00014-86-K-0264 and No. N00014-85- K-0328. 8 ?~~ O . G 1 49 II Parallel Programming Paradigms...processors -. "to fetch from the same memory cell (list head) and thus seems to favor a shared memory - g implementation [37). In this dissertation, we
Efficient Approximation Algorithms for Weighted $b$-Matching

DOE Office of Scientific and Technical Information (OSTI.GOV)

Khan, Arif; Pothen, Alex; Mostofa Ali Patwary, Md.

2016-01-01

We describe a half-approximation algorithm, b-Suitor, for computing a b-Matching of maximum weight in a graph with weights on the edges. b-Matching is a generalization of the well-known Matching problem in graphs, where the objective is to choose a subset of M edges in the graph such that at most a specified number b(v) of edges in M are incident on each vertex v. Subject to this restriction we maximize the sum of the weights of the edges in M. We prove that the b-Suitor algorithm computes the same b-Matching as the one obtained by the greedy algorithm for themore » problem. We implement the algorithm on serial and shared-memory parallel processors, and compare its performance against a collection of approximation algorithms that have been proposed for the Matching problem. Our results show that the b-Suitor algorithm outperforms the Greedy and Locally Dominant edge algorithms by one to two orders of magnitude on a serial processor. The b-Suitor algorithm has a high degree of concurrency, and it scales well up to 240 threads on a shared memory multiprocessor. The b-Suitor algorithm outperforms the Locally Dominant edge algorithm by a factor of fourteen on 16 cores of an Intel Xeon multiprocessor.« less
Development of compact fuel processor for 2 kW class residential PEMFCs

NASA Astrophysics Data System (ADS)

Seo, Yu Taek; Seo, Dong Joo; Jeong, Jin Hyeok; Yoon, Wang Lai

Korea Institute of Energy Research (KIER) has been developing a novel fuel processing system to provide hydrogen rich gas to residential polymer electrolyte membrane fuel cells (PEMFCs) cogeneration system. For the effective design of a compact hydrogen production system, the unit processes of steam reforming, high and low temperature water gas shift, steam generator and internal heat exchangers are thermally and physically integrated into a packaged hardware system. Several prototypes are under development and the prototype I fuel processor showed thermal efficiency of 73% as a HHV basis with methane conversion of 81%. Recently tested prototype II has been shown the improved performance of thermal efficiency of 76% with methane conversion of 83%. In both prototypes, two-stage PrOx reactors reduce CO concentration less than 10 ppm, which is the prerequisite CO limit condition of product gas for the PEMFCs stack. After confirming the initial performance of prototype I fuel processor, it is coupled with PEMFC single cell to test the durability and demonstrated that the fuel processor is operated for 3 days successfully without any failure of fuel cell voltage. Prototype II fuel processor also showed stable performance during the durability test.
The role of neuroimaging in the discovery of processing stages. A review.

PubMed

Mulder, G; Wijers, A A; Lange, J J; Buijink, B M; Mulder, L J; Willemsen, A T; Paans, A M

1995-11-01

In this contribution we show how neuroimaging methods can augment behavioural methods to discover processing stages. Event Related Brain Potentials (ERPs), Brain Electrical Source Analysis (BESA) and regional changes in cerebral blood flow (rCBF) do not necessarily require behavioural responses. With the aid of rCBF we are able to discover several cortical and subcortical brain systems (processors) active in selective attention and memory search tasks. BESA describes cortical activity with high temporal resolution in terms of a limited number of neural generators within these brain systems. The combination of behavioural methods and neuroimaging provides a picture of the functional architecture of the brain. The review is organized around three processors: the Visual, Cognitive and Manual Motor Processors.
Fully digital routing logic for single-photon avalanche diode arrays in highly efficient time-resolved imaging

NASA Astrophysics Data System (ADS)

Cominelli, Alessandro; Acconcia, Giulia; Ghioni, Massimo; Rech, Ivan

2018-03-01

Time-correlated single-photon counting (TCSPC) is a powerful optical technique, which permits recording fast luminous signals with picosecond precision. Unfortunately, given its repetitive nature, TCSPC is recognized as a relatively slow technique, especially when a large time-resolved image has to be recorded. In recent years, there has been a fast trend toward the development of TCPSC imagers. Unfortunately, present systems still suffer from a trade-off between number of channels and performance. Even worse, the overall measurement speed is still limited well below the saturation of the transfer bandwidth toward the external processor. We present a routing algorithm that enables a smart connection between a 32×32 detector array and five shared high-performance converters able to provide an overall conversion rate up to 10 Gbit/s. The proposed solution exploits a fully digital logic circuit distributed in a tree structure to limit the number and length of interconnections, which is a major issue in densely integrated circuits. The behavior of the logic has been validated by means of a field-programmable gate array, while a fully integrated prototype has been designed in 180-nm technology and analyzed by means of postlayout simulations.
Expedition Seven CDR Malenkenko performs IFM on Condensate Water Processor

NASA Image and Video Library

2003-07-03

ISS007-E-09229 (3 July 2003) --- Cosmonaut Yuri I. Malenchenko, Expedition 7 mission commander, performs scheduled in-flight maintenance (IFM) on the condensate water processor (SRV-K2M) by removing and replacing its BKO multifiltration/purification column unit, which has reached its service life limit (450 liters min.). The old unit will be discarded on Progress. The IFM took place in the Zvezda Service Module on the International Space Station (ISS). Malenchenko represents Rosaviakosmos.
Expedition Seven CDR Malenkenko performs IFM on Condensate Water Processor

NASA Image and Video Library

2003-07-03

ISS007-E-09231 (3 July 2003) --- Cosmonaut Yuri I. Malenchenko, Expedition 7 mission commander, performs scheduled in-flight maintenance (IFM) on the condensate water processor (SRV-K2M) by removing and replacing its BKO multifiltration/purification column unit, which has reached its service life limit (450 liters min.). The old unit will be discarded on Progress. The IFM took place in the Zvezda Service Module on the International Space Station (ISS). Malenchenko represents Rosaviakosmos.
Thermal Hotspots in CPU Die and It's Future Architecture

NASA Astrophysics Data System (ADS)

Wang, Jian; Hu, Fu-Yuan

Owing to the increasing core frequency and chip integration and the limited die dimension, the power densities in CPU chip have been increasing fastly. The high temperature on chip resulted by power densities threats the processor's performance and chip's reliability. This paper analyzed the thermal hotspots in die and their properties. A new architecture of function units in die - - hot units distributed architecture is suggested to cope with the problems of high power densities for future processor chip.
MPF: A portable message passing facility for shared memory multiprocessors

NASA Technical Reports Server (NTRS)

Malony, Allen D.; Reed, Daniel A.; Mcguire, Patrick J.

1987-01-01

The design, implementation, and performance evaluation of a message passing facility (MPF) for shared memory multiprocessors are presented. The MPF is based on a message passing model conceptually similar to conversations. Participants (parallel processors) can enter or leave a conversation at any time. The message passing primitives for this model are implemented as a portable library of C function calls. The MPF is currently operational on a Sequent Balance 21000, and several parallel applications were developed and tested. Several simple benchmark programs are presented to establish interprocess communication performance for common patterns of interprocess communication. Finally, performance figures are presented for two parallel applications, linear systems solution, and iterative solution of partial differential equations.
Reducing Interprocessor Dependence in Recoverable Distributed Shared Memory

NASA Technical Reports Server (NTRS)

Janssens, Bob; Fuchs, W. Kent

1994-01-01

Checkpointing techniques in parallel systems use dependency tracking and/or message logging to ensure that a system rolls back to a consistent state. Traditional dependency tracking in distributed shared memory (DSM) systems is expensive because of high communication frequency. In this paper we show that, if designed correctly, a DSM system only needs to consider dependencies due to the transfer of blocks of data, resulting in reduced dependency tracking overhead and reduced potential for rollback propagation. We develop an ownership timestamp scheme to tolerate the loss of block state information and develop a passive server model of execution where interactions between processors are considered atomic. With our scheme, dependencies are significantly reduced compared to the traditional message-passing model.
Power processor for a 20CM ion thruster

NASA Technical Reports Server (NTRS)

Biess, J. J.; Schoenfeld, A. D.; Cohen, E.

1973-01-01

A power processor breadboard for the JPL 20CM Ion Engine was designed, fabricated, and tested to determine compliance with the electrical specification. The power processor breadboard used the silicon-controlled rectifier (SCR) series resonant inverter as the basic power stage to process all the power to the ion engine. The breadboard power processor was integrated with the JPL 20CM ion engine and complete testing was performed. The integration tests were performed without any silicon-controlled rectifier failure. This demonstrated the ruggedness of the series resonant inverter in protecting the switching elements during arcing in the ion engine. A method of fault clearing the ion engine and returning back to normal operation without elaborate sequencing and timing control logic was evolved. In this method, the main vaporizer was turned off and the discharge current limit was reduced when an overload existed on the screen/accelerator supply. After the high voltage returned to normal, both the main vaporizer and the discharge were returned to normal.

Real-Time Spatio-Temporal Twice Whitening for MIMO Energy Detector

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; Mitra, Pramita; Barhen, Jacob

2010-01-01

While many techniques exist for local spectrum sensing of a primary user, each represents a computationally demanding task to secondary user receivers. In software-defined radio, computational complexity lengthens the time for a cognitive radio to recognize changes in the transmission environment. This complexity is even more significant for spatially multiplexed receivers, e.g., in SIMO and MIMO, where the spatio-temporal data sets grow in size with the number of antennae. Limits on power and space for the processor hardware further constrain SDR performance. In this report, we discuss improvements in spatio-temporal twice whitening (STTW) for real-time local spectrum sensing by demonstratingmore » a form of STTW well suited for MIMO environments. We implement STTW on the Coherent Logix hx3100 processor, a multicore processor intended for low-power, high-throughput software-defined signal processing. These results demonstrate how coupling the novel capabilities of emerging multicore processors with algorithmic advances can enable real-time, software-defined processing of large spatio-temporal data sets.« less
Video image processor on the Spacelab 2 Solar Optical Universal Polarimeter /SL2 SOUP/

NASA Technical Reports Server (NTRS)

Lindgren, R. W.; Tarbell, T. D.

1981-01-01

The SOUP instrument is designed to obtain diffraction-limited digital images of the sun with high photometric accuracy. The Video Processor originated from the requirement to provide onboard real-time image processing, both to reduce the telemetry rate and to provide meaningful video displays of scientific data to the payload crew. This original concept has evolved into a versatile digital processing system with a multitude of other uses in the SOUP program. The central element in the Video Processor design is a 16-bit central processing unit based on 2900 family bipolar bit-slice devices. All arithmetic, logical and I/O operations are under control of microprograms, stored in programmable read-only memory and initiated by commands from the LSI-11. Several functions of the Video Processor are described, including interface to the High Rate Multiplexer downlink, cosmetic and scientific data processing, scan conversion for crew displays, focus and exposure testing, and use as ground support equipment.
On-board landmark navigation and attitude reference parallel processor system

NASA Technical Reports Server (NTRS)

Gilbert, L. E.; Mahajan, D. T.

1978-01-01

An approach to autonomous navigation and attitude reference for earth observing spacecraft is described along with the landmark identification technique based on a sequential similarity detection algorithm (SSDA). Laboratory experiments undertaken to determine if better than one pixel accuracy in registration can be achieved consistent with onboard processor timing and capacity constraints are included. The SSDA is implemented using a multi-microprocessor system including synchronization logic and chip library. The data is processed in parallel stages, effectively reducing the time to match the small known image within a larger image as seen by the onboard image system. Shared memory is incorporated in the system to help communicate intermediate results among microprocessors. The functions include finding mean values and summation of absolute differences over the image search area. The hardware is a low power, compact unit suitable to onboard application with the flexibility to provide for different parameters depending upon the environment.
Description and Simulation of a Fast Packet Switch Architecture for Communication Satellites

NASA Technical Reports Server (NTRS)

Quintana, Jorge A.; Lizanich, Paul J.

1995-01-01

The NASA Lewis Research Center has been developing the architecture for a multichannel communications signal processing satellite (MCSPS) as part of a flexible, low-cost meshed-VSAT (very small aperture terminal) network. The MCSPS architecture is based on a multifrequency, time-division-multiple-access (MF-TDMA) uplink and a time-division multiplex (TDM) downlink. There are eight uplink MF-TDMA beams, and eight downlink TDM beams, with eight downlink dwells per beam. The information-switching processor, which decodes, stores, and transmits each packet of user data to the appropriate downlink dwell onboard the satellite, has been fully described by using VHSIC (Very High Speed Integrated-Circuit) Hardware Description Language (VHDL). This VHDL code, which was developed in-house to simulate the information switching processor, showed that the architecture is both feasible and viable. This paper describes a shared-memory-per-beam architecture, its VHDL implementation, and the simulation efforts.
Computer-aided design/computer-aided manufacturing skull base drill.

PubMed

Couldwell, William T; MacDonald, Joel D; Thomas, Charles L; Hansen, Bradley C; Lapalikar, Aniruddha; Thakkar, Bharat; Balaji, Alagar K

2017-05-01

The authors have developed a simple device for computer-aided design/computer-aided manufacturing (CAD-CAM) that uses an image-guided system to define a cutting tool path that is shared with a surgical machining system for drilling bone. Information from 2D images (obtained via CT and MRI) is transmitted to a processor that produces a 3D image. The processor generates code defining an optimized cutting tool path, which is sent to a surgical machining system that can drill the desired portion of bone. This tool has applications for bone removal in both cranial and spine neurosurgical approaches. Such applications have the potential to reduce surgical time and associated complications such as infection or blood loss. The device enables rapid removal of bone within 1 mm of vital structures. The validity of such a machining tool is exemplified in the rapid (< 3 minutes machining time) and accurate removal of bone for transtemporal (for example, translabyrinthine) approaches.
Efficient quantum walk on a quantum processor

PubMed Central

Qiang, Xiaogang; Loke, Thomas; Montanaro, Ashley; Aungskunsiri, Kanin; Zhou, Xiaoqi; O'Brien, Jeremy L.; Wang, Jingbo B.; Matthews, Jonathan C. F.

2016-01-01

The random walk formalism is used across a wide range of applications, from modelling share prices to predicting population genetics. Likewise, quantum walks have shown much potential as a framework for developing new quantum algorithms. Here we present explicit efficient quantum circuits for implementing continuous-time quantum walks on the circulant class of graphs. These circuits allow us to sample from the output probability distributions of quantum walks on circulant graphs efficiently. We also show that solving the same sampling problem for arbitrary circulant quantum circuits is intractable for a classical computer, assuming conjectures from computational complexity theory. This is a new link between continuous-time quantum walks and computational complexity theory and it indicates a family of tasks that could ultimately demonstrate quantum supremacy over classical computers. As a proof of principle, we experimentally implement the proposed quantum circuit on an example circulant graph using a two-qubit photonics quantum processor. PMID:27146471
Automation of Data Traffic Control on DSM Architecture

NASA Technical Reports Server (NTRS)

Frumkin, Michael; Jin, Hao-Qiang; Yan, Jerry

2001-01-01

The design of distributed shared memory (DSM) computers liberates users from the duty to distribute data across processors and allows for the incremental development of parallel programs using, for example, OpenMP or Java threads. DSM architecture greatly simplifies the development of parallel programs having good performance on a few processors. However, to achieve a good program scalability on DSM computers requires that the user understand data flow in the application and use various techniques to avoid data traffic congestions. In this paper we discuss a number of such techniques, including data blocking, data placement, data transposition and page size control and evaluate their efficiency on the NAS (NASA Advanced Supercomputing) Parallel Benchmarks. We also present a tool which automates the detection of constructs causing data congestions in Fortran array oriented codes and advises the user on code transformations for improving data traffic in the application.
Parallel programming with Easy Java Simulations

NASA Astrophysics Data System (ADS)

Esquembre, F.; Christian, W.; Belloni, M.

2018-01-01

Nearly all of today's processors are multicore, and ideally programming and algorithm development utilizing the entire processor should be introduced early in the computational physics curriculum. Parallel programming is often not introduced because it requires a new programming environment and uses constructs that are unfamiliar to many teachers. We describe how we decrease the barrier to parallel programming by using a java-based programming environment to treat problems in the usual undergraduate curriculum. We use the easy java simulations programming and authoring tool to create the program's graphical user interface together with objects based on those developed by Kaminsky [Building Parallel Programs (Course Technology, Boston, 2010)] to handle common parallel programming tasks. Shared-memory parallel implementations of physics problems, such as time evolution of the Schrödinger equation, are available as source code and as ready-to-run programs from the AAPT-ComPADRE digital library.
An Evaluation of Architectural Platforms for Parallel Navier-Stokes Computations

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1996-01-01

We study the computational, communication, and scalability characteristics of a computational fluid dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architecture platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), and distributed memory multiprocessors with different topologies - the IBM SP and the Cray T3D. We investigate the impact of various networks connecting the cluster of workstations on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Parallelizing Navier-Stokes Computations on a Variety of Architectural Platforms

NASA Technical Reports Server (NTRS)

Jayasimha, D. N.; Hayder, M. E.; Pillay, S. K.

1997-01-01

We study the computational, communication, and scalability characteristics of a Computational Fluid Dynamics application, which solves the time accurate flow field of a jet using the compressible Navier-Stokes equations, on a variety of parallel architectural platforms. The platforms chosen for this study are a cluster of workstations (the LACE experimental testbed at NASA Lewis), a shared memory multiprocessor (the Cray YMP), distributed memory multiprocessors with different topologies-the IBM SP and the Cray T3D. We investigate the impact of various networks, connecting the cluster of workstations, on the performance of the application and the overheads induced by popular message passing libraries used for parallelization. The work also highlights the importance of matching the memory bandwidth to the processor speed for good single processor performance. By studying the performance of an application on a variety of architectures, we are able to point out the strengths and weaknesses of each of the example computing platforms.
Distributed simulation using a real-time shared memory network

NASA Technical Reports Server (NTRS)

Simon, Donald L.; Mattern, Duane L.; Wong, Edmond; Musgrave, Jeffrey L.

1993-01-01

The Advanced Control Technology Branch of the NASA Lewis Research Center performs research in the area of advanced digital controls for aeronautic and space propulsion systems. This work requires the real-time implementation of both control software and complex dynamical models of the propulsion system. We are implementing these systems in a distributed, multi-vendor computer environment. Therefore, a need exists for real-time communication and synchronization between the distributed multi-vendor computers. A shared memory network is a potential solution which offers several advantages over other real-time communication approaches. A candidate shared memory network was tested for basic performance. The shared memory network was then used to implement a distributed simulation of a ramjet engine. The accuracy and execution time of the distributed simulation was measured and compared to the performance of the non-partitioned simulation. The ease of partitioning the simulation, the minimal time required to develop for communication between the processors and the resulting execution time all indicate that the shared memory network is a real-time communication technique worthy of serious consideration.
The PALM-3000 high-order adaptive optics system for Palomar Observatory

NASA Astrophysics Data System (ADS)

Bouchez, Antonin H.; Dekany, Richard G.; Angione, John R.; Baranec, Christoph; Britton, Matthew C.; Bui, Khanh; Burruss, Rick S.; Cromer, John L.; Guiwits, Stephen R.; Henning, John R.; Hickey, Jeff; McKenna, Daniel L.; Moore, Anna M.; Roberts, Jennifer E.; Trinh, Thang Q.; Troy, Mitchell; Truong, Tuan N.; Velur, Viswa

2008-07-01

Deployed as a multi-user shared facility on the 5.1 meter Hale Telescope at Palomar Observatory, the PALM-3000 highorder upgrade to the successful Palomar Adaptive Optics System will deliver extreme AO correction in the near-infrared, and diffraction-limited images down to visible wavelengths, using both natural and sodium laser guide stars. Wavefront control will be provided by two deformable mirrors, a 3368 active actuator woofer and 349 active actuator tweeter, controlled at up to 3 kHz using an innovative wavefront processor based on a cluster of 17 graphics processing units. A Shack-Hartmann wavefront sensor with selectable pupil sampling will provide high-order wavefront sensing, while an infrared tip/tilt sensor and visible truth wavefront sensor will provide low-order LGS control. Four back-end instruments are planned at first light: the PHARO near-infrared camera/spectrograph, the SWIFT visible light integral field spectrograph, Project 1640, a near-infrared coronagraphic integral field spectrograph, and 888Cam, a high-resolution visible light imager.
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment*†

PubMed Central

Khan, Md. Ashfaquzzaman; Herbordt, Martin C.

2011-01-01

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations. PMID:21822327
Parallel Discrete Molecular Dynamics Simulation With Speculation and In-Order Commitment.

PubMed

Khan, Md Ashfaquzzaman; Herbordt, Martin C

2011-07-20

Discrete molecular dynamics simulation (DMD) uses simplified and discretized models enabling simulations to advance by event rather than by timestep. DMD is an instance of discrete event simulation and so is difficult to scale: even in this multi-core era, all reported DMD codes are serial. In this paper we discuss the inherent difficulties of scaling DMD and present our method of parallelizing DMD through event-based decomposition. Our method is microarchitecture inspired: speculative processing of events exposes parallelism, while in-order commitment ensures correctness. We analyze the potential of this parallelization method for shared-memory multiprocessors. Achieving scalability required extensive experimentation with scheduling and synchronization methods to mitigate serialization. The speed-up achieved for a variety of system sizes and complexities is nearly 6× on an 8-core and over 9× on a 12-core processor. We present and verify analytical models that account for the achieved performance as a function of available concurrency and architectural limitations.
Identification of Air Force Emerging Technologies and Militarily Significant Emerging Technologies.

DTIC Science & Technology

1985-08-31

taking an integrated approach to avionics and EU, the various sensors and receivers on the aircraft can time-share the use of common signal processors...functions mentioned above has required, in addition to a separate sensor or antenna, a totally independent electronics suite. Many of the advanced...Classification A3. IMAGING SENSOR AUTOPROCESSOR The Air Force has contracted with Rockwell International and Honeywell in this work. Rockwell’s work is
Cooperative system and method using mobile robots for testing a cooperative search controller

DOEpatents

Byrne, Raymond H.; Harrington, John J.; Eskridge, Steven E.; Hurtado, John E.

2002-01-01

A test system for testing a controller provides a way to use large numbers of miniature mobile robots to test a cooperative search controller in a test area, where each mobile robot has a sensor, a communication device, a processor, and a memory. A method of using a test system provides a way for testing a cooperative search controller using multiple robots sharing information and communicating over a communication network.
Smart photonic networks and computer security for image data

NASA Astrophysics Data System (ADS)

Campello, Jorge; Gill, John T.; Morf, Martin; Flynn, Michael J.

1998-02-01

Work reported here is part of a larger project on 'Smart Photonic Networks and Computer Security for Image Data', studying the interactions of coding and security, switching architecture simulations, and basic technologies. Coding and security: coding methods that are appropriate for data security in data fusion networks were investigated. These networks have several characteristics that distinguish them form other currently employed networks, such as Ethernet LANs or the Internet. The most significant characteristics are very high maximum data rates; predominance of image data; narrowcasting - transmission of data form one source to a designated set of receivers; data fusion - combining related data from several sources; simple sensor nodes with limited buffering. These characteristics affect both the lower level network design and the higher level coding methods.Data security encompasses privacy, integrity, reliability, and availability. Privacy, integrity, and reliability can be provided through encryption and coding for error detection and correction. Availability is primarily a network issue; network nodes must be protected against failure or routed around in the case of failure. One of the more promising techniques is the use of 'secret sharing'. We consider this method as a special case of our new space-time code diversity based algorithms for secure communication. These algorithms enable us to exploit parallelism and scalable multiplexing schemes to build photonic network architectures. A number of very high-speed switching and routing architectures and their relationships with very high performance processor architectures were studied. Indications are that routers for very high speed photonic networks can be designed using the very robust and distributed TCP/IP protocol, if suitable processor architecture support is available.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Dritz, K.W.; Boyle, J.M.

This paper addresses the problem of measuring and analyzing the performance of fine-grained parallel programs running on shared-memory multiprocessors. Such processors use locking (either directly in the application program, or indirectly in a subroutine library or the operating system) to serialize accesses to global variables. Given sufficiently high rates of locking, the chief factor preventing linear speedup (besides lack of adequate inherent parallelism in the application) is lock contention - the blocking of processes that are trying to acquire a lock currently held by another process. We show how a high-resolution, low-overhead clock may be used to measure both lockmore » contention and lack of parallel work. Several ways of presenting the results are covered, culminating in a method for calculating, in a single multiprocessing run, both the speedup actually achieved and the speedup lost to contention for each lock and to lack of parallel work. The speedup losses are reported in the same units, ''processor-equivalents,'' as the speedup achieved. Both are obtained without having to perform the usual one-process comparison run. We chronicle also a variety of experiments motivated by actual results obtained with our measurement method. The insights into program performance that we gained from these experiments helped us to refine the parts of our programs concerned with communication and synchronization. Ultimately these improvements reduced lock contention to a negligible amount and yielded nearly linear speedup in applications not limited by lack of parallel work. We describe two generally applicable strategies (''code motion out of critical regions'' and ''critical-region fissioning'') for reducing lock contention and one (''lock/variable fusion'') applicable only on certain architectures.« less
Spectral Structure Of Phase-Induced Intensity Noise In Recirculating Delay Lines

NASA Astrophysics Data System (ADS)

Tur, M.; Moslehi, B.; Bowers, J. E.; Newton, S. A.; Jackson, K. P.; Goodman, J. W.; Cutler, C. C.; Shaw, H. J.

1983-09-01

The dynamic range of fiber optic signal processors driven by relatively incoherent multimode semiconductor lasers is shown to be severely limited by laser phase-induced noise. It is experimentally demonstrated that while the noise power spectrum of differential length fiber filters is approximately flat, processors with recirculating loops exhibit noise with a periodically structured power spectrum with notches at zero frequency as well as at all other multiples of 1/(loop delay). The experimental results are aug-mented by a theoretical analysis.
Implementing Legacy-C Algorithms in FPGA Co-Processors for Performance Accelerated Smart Payloads

NASA Technical Reports Server (NTRS)

Pingree, Paula J.; Scharenbroich, Lucas J.; Werne, Thomas A.; Hartzell, Christine

2008-01-01

Accurate, on-board classification of instrument data is used to increase science return by autonomously identifying regions of interest for priority transmission or generating summary products to conserve transmission bandwidth. Due to on-board processing constraints, such classification has been limited to using the simplest functions on a small subset of the full instrument data. FPGA co-processor designs for SVM1 classifiers will lead to significant improvement in on-board classification capability and accuracy.

Parallel computing for probabilistic fatigue analysis

NASA Technical Reports Server (NTRS)

Sues, Robert H.; Lua, Yuan J.; Smith, Mark D.

1993-01-01

This paper presents the results of Phase I research to investigate the most effective parallel processing software strategies and hardware configurations for probabilistic structural analysis. We investigate the efficiency of both shared and distributed-memory architectures via a probabilistic fatigue life analysis problem. We also present a parallel programming approach, the virtual shared-memory paradigm, that is applicable across both types of hardware. Using this approach, problems can be solved on a variety of parallel configurations, including networks of single or multiprocessor workstations. We conclude that it is possible to effectively parallelize probabilistic fatigue analysis codes; however, special strategies will be needed to achieve large-scale parallelism to keep large number of processors busy and to treat problems with the large memory requirements encountered in practice. We also conclude that distributed-memory architecture is preferable to shared-memory for achieving large scale parallelism; however, in the future, the currently emerging hybrid-memory architectures will likely be optimal.
Parallelization of KENO-Va Monte Carlo code

NASA Astrophysics Data System (ADS)

Ramón, Javier; Peña, Jorge

1995-07-01

KENO-Va is a code integrated within the SCALE system developed by Oak Ridge that solves the transport equation through the Monte Carlo Method. It is being used at the Consejo de Seguridad Nuclear (CSN) to perform criticality calculations for fuel storage pools and shipping casks. Two parallel versions of the code: one for shared memory machines and other for distributed memory systems using the message-passing interface PVM have been generated. In both versions the neutrons of each generation are tracked in parallel. In order to preserve the reproducibility of the results in both versions, advanced seeds for random numbers were used. The CONVEX C3440 with four processors and shared memory at CSN was used to implement the shared memory version. A FDDI network of 6 HP9000/735 was employed to implement the message-passing version using proprietary PVM. The speedup obtained was 3.6 in both cases.
Optimization of image processing algorithms on mobile platforms

NASA Astrophysics Data System (ADS)

Poudel, Pramod; Shirvaikar, Mukul

2011-03-01

This work presents a technique to optimize popular image processing algorithms on mobile platforms such as cell phones, net-books and personal digital assistants (PDAs). The increasing demand for video applications like context-aware computing on mobile embedded systems requires the use of computationally intensive image processing algorithms. The system engineer has a mandate to optimize them so as to meet real-time deadlines. A methodology to take advantage of the asymmetric dual-core processor, which includes an ARM and a DSP core supported by shared memory, is presented with implementation details. The target platform chosen is the popular OMAP 3530 processor for embedded media systems. It has an asymmetric dual-core architecture with an ARM Cortex-A8 and a TMS320C64x Digital Signal Processor (DSP). The development platform was the BeagleBoard with 256 MB of NAND RAM and 256 MB SDRAM memory. The basic image correlation algorithm is chosen for benchmarking as it finds widespread application for various template matching tasks such as face-recognition. The basic algorithm prototypes conform to OpenCV, a popular computer vision library. OpenCV algorithms can be easily ported to the ARM core which runs a popular operating system such as Linux or Windows CE. However, the DSP is architecturally more efficient at handling DFT algorithms. The algorithms are tested on a variety of images and performance results are presented measuring the speedup obtained due to dual-core implementation. A major advantage of this approach is that it allows the ARM processor to perform important real-time tasks, while the DSP addresses performance-hungry algorithms.
A Wearable Healthcare System With a 13.7 μA Noise Tolerant ECG Processor.

PubMed

Izumi, Shintaro; Yamashita, Ken; Nakano, Masanao; Kawaguchi, Hiroshi; Kimura, Hiromitsu; Marumoto, Kyoji; Fuchikami, Takaaki; Fujimori, Yoshikazu; Nakajima, Hiroshi; Shiga, Toshikazu; Yoshimoto, Masahiko

2015-10-01

To prevent lifestyle diseases, wearable bio-signal monitoring systems for daily life monitoring have attracted attention. Wearable systems have strict size and weight constraints, which impose significant limitations of the battery capacity and the signal-to-noise ratio of bio-signals. This report describes an electrocardiograph (ECG) processor for use with a wearable healthcare system. It comprises an analog front end, a 12-bit ADC, a robust Instantaneous Heart Rate (IHR) monitor, a 32-bit Cortex-M0 core, and 64 Kbyte Ferroelectric Random Access Memory (FeRAM). The IHR monitor uses a short-term autocorrelation (STAC) algorithm to improve the heart-rate detection accuracy despite its use in noisy conditions. The ECG processor chip consumes 13.7 μA for heart rate logging application.
Acousto-optic time- and space-integrating spotlight-mode SAR processor

NASA Astrophysics Data System (ADS)

Haney, Michael W.; Levy, James J.; Michael, Robert R., Jr.

1993-09-01

The technical approach and recent experimental results for the acousto-optic time- and space- integrating real-time SAR image formation processor program are reported. The concept overcomes the size and power consumption limitations of electronic approaches by using compact, rugged, and low-power analog optical signal processing techniques for the most computationally taxing portions of the SAR imaging problem. Flexibility and performance are maintained by the use of digital electronics for the critical low-complexity filter generation and output image processing functions. The results include a demonstration of the processor's ability to perform high-resolution spotlight-mode SAR imaging by simultaneously compensating for range migration and range/azimuth coupling in the analog optical domain, thereby avoiding a highly power-consuming digital interpolation or reformatting operation usually required in all-electronic approaches.
Solving the Cauchy-Riemann equations on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

Discussed is the implementation of a single algorithm on three parallel-vector computers. The algorithm is a relaxation scheme for the solution of the Cauchy-Riemann equations; a set of coupled first order partial differential equations. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, and SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The machine architectures are briefly described. The implementation of the algorithm is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Conclusions are presented.
Enhanced quasi-static particle-in-cell simulation of electron cloud instabilities in circular accelerators

NASA Astrophysics Data System (ADS)

Feng, Bing

Electron cloud instabilities have been observed in many circular accelerators around the world and raised concerns of future accelerators and possible upgrades. In this thesis, the electron cloud instabilities are studied with the quasi-static particle-in-cell (PIC) code QuickPIC. Modeling in three-dimensions the long timescale propagation of beam in electron clouds in circular accelerators requires faster and more efficient simulation codes. Thousands of processors are easily available for parallel computations. However, it is not straightforward to increase the effective speed of the simulation by running the same problem size on an increasingly number of processors because there is a limit to domain size in the decomposition of the two-dimensional part of the code. A pipelining algorithm applied on the fully parallelized particle-in-cell code QuickPIC is implemented to overcome this limit. The pipelining algorithm uses multiple groups of processors and optimizes the job allocation on the processors in parallel computing. With this novel algorithm, it is possible to use on the order of 102 processors, and to expand the scale and the speed of the simulation with QuickPIC by a similar factor. In addition to the efficiency improvement with the pipelining algorithm, the fidelity of QuickPIC is enhanced by adding two physics models, the beam space charge effect and the dispersion effect. Simulation of two specific circular machines is performed with the enhanced QuickPIC. First, the proposed upgrade to the Fermilab Main Injector is studied with an eye upon guiding the design of the upgrade and code validation. Moderate emittance growth is observed for the upgrade of increasing the bunch population by 5 times. But the simulation also shows that increasing the beam energy from 8GeV to 20GeV or above can effectively limit the emittance growth. Then the enhanced QuickPIC is used to simulate the electron cloud effect on electron beam in the Cornell Energy Recovery Linac (ERL) due to extremely small emittance and high peak currents anticipated in the machine. A tune shift is discovered from the simulation; however, emittance growth of the electron beam in electron cloud is not observed for ERL parameters.
Spaceborne Processor Array

NASA Technical Reports Server (NTRS)

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.
Comparison of Origin 2000 and Origin 3000 Using NAS Parallel Benchmarks

NASA Technical Reports Server (NTRS)

Turney, Raymond D.

2001-01-01

This report describes results of benchmark tests on the Origin 3000 system currently being installed at the NASA Ames National Advanced Supercomputing facility. This machine will ultimately contain 1024 R14K processors. The first part of the system, installed in November, 2000 and named mendel, is an Origin 3000 with 128 R12K processors. For comparison purposes, the tests were also run on lomax, an Origin 2000 with R12K processors. The BT, LU, and SP application benchmarks in the NAS Parallel Benchmark Suite and the kernel benchmark FT were chosen to determine system performance and measure the impact of changes on the machine as it evolves. Having been written to measure performance on Computational Fluid Dynamics applications, these benchmarks are assumed appropriate to represent the NAS workload. Since the NAS runs both message passing (MPI) and shared-memory, compiler directive type codes, both MPI and OpenMP versions of the benchmarks were used. The MPI versions used were the latest official release of the NAS Parallel Benchmarks, version 2.3. The OpenMP versiqns used were PBN3b2, a beta version that is in the process of being released. NPB 2.3 and PBN 3b2 are technically different benchmarks, and NPB results are not directly comparable to PBN results.
MSTor: A program for calculating partition functions, free energies, enthalpies, entropies, and heat capacities of complex molecules including torsional anharmonicity

NASA Astrophysics Data System (ADS)

Zheng, Jingjing; Mielke, Steven L.; Clarkson, Kenneth L.; Truhlar, Donald G.

2012-08-01

We present a Fortran program package, MSTor, which calculates partition functions and thermodynamic functions of complex molecules involving multiple torsional motions by the recently proposed MS-T method. This method interpolates between the local harmonic approximation in the low-temperature limit, and the limit of free internal rotation of all torsions at high temperature. The program can also carry out calculations in the multiple-structure local harmonic approximation. The program package also includes six utility codes that can be used as stand-alone programs to calculate reduced moment of inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomains defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Catalogue identifier: AEMF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEMF_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 77 434 No. of bytes in distributed program, including test data, etc.: 3 264 737 Distribution format: tar.gz Programming language: Fortran 90, C, and Perl Computer: Itasca (HP Linux cluster, each node has two-socket, quad-core 2.8 GHz Intel Xeon X5560 “Nehalem EP” processors), Calhoun (SGI Altix XE 1300 cluster, each node containing two quad-core 2.66 GHz Intel Xeon “Clovertown”-class processors sharing 16 GB of main memory), Koronis (Altix UV 1000 server with 190 6-core Intel Xeon X7542 “Westmere” processors at 2.66 GHz), Elmo (Sun Fire X4600 Linux cluster with AMD Opteron cores), and Mac Pro (two 2.8 GHz Quad-core Intel Xeon processors) Operating system: Linux/Unix/Mac OS RAM: 2 Mbytes Classification: 16.3, 16.12, 23 Nature of problem: Calculation of the partition functions and thermodynamic functions (standard-state energy, enthalpy, entropy, and free energy as functions of temperatures) of complex molecules involving multiple torsional motions. Solution method: The multi-structural approximation with torsional anharmonicity (MS-T). The program also provides results for the multi-structural local harmonic approximation [1]. Restrictions: There is no limit on the number of torsions that can be included in either the Voronoi calculation or the full MS-T calculation. In practice, the range of problems that can be addressed with the present method consists of all multi-torsional problems for which one can afford to calculate all the conformations and their frequencies. Unusual features: The method can be applied to transition states as well as stable molecules. The program package also includes the hull program for the calculation of Voronoi volumes and six utility codes that can be used as stand-alone programs to calculate reduced moment-of-inertia matrices by the method of Kilpatrick and Pitzer, to generate conformational structures, to calculate, either analytically or by Monte Carlo sampling, volumes for torsional subdomain defined by Voronoi tessellation of the conformational subspace, to generate template input files, and to calculate one-dimensional torsional partition functions using the torsional eigenvalue summation method. Additional comments: The program package includes a manual, installation script, and input and output files for a test suite. Running time: There are 24 test runs. The running time of the test runs on a single processor of the Itasca computer is less than 2 seconds. J. Zheng, T. Yu, E. Papajak, I.M. Alecu, S.L. Mielke, D.G. Truhlar, Practical methods for including torsional anharmonicity in thermochemical calculations of complex molecules: The internal-coordinate multi-structural approximation, Phys. Chem. Chem. Phys. 13 (2011) 10885-10907.
Benchmark tests on the digital equipment corporation Alpha AXP 21164-based AlphaServer 8400, including a comparison of optimized vector and superscalar processing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wasserman, H.J.

1996-02-01

The second generation of the Digital Equipment Corp. (DEC) DECchip Alpha AXP microprocessor is referred to as the 21164. From the viewpoint of numerically-intensive computing, the primary difference between it and its predecessor, the 21064, is that the 21164 has twice the multiply/add throughput per clock period (CP), a maximum of two floating point operations (FLOPS) per CP vs. one for 21064. The AlphaServer 8400 is a shared-memory multiprocessor server system that can accommodate up to 12 CPUs and up to 14 GB of memory. In this report we will compare single processor performance of the 8400 system with thatmore » of the International Business Machines Corp. (IBM) RISC System/6000 POWER-2 microprocessor running at 66 MHz, the Silicon Graphics, Inc. (SGI) MIPS R8000 microprocessor running at 75 MHz, and the Cray Research, Inc. CRAY J90. The performance comparison is based on a set of Fortran benchmark codes that represent a portion of the Los Alamos National Laboratory supercomputer workload. The advantage of using these codes, is that the codes also span a wide range of computational characteristics, such as vectorizability, problem size, and memory access pattern. The primary disadvantage of using them is that detailed, quantitative analysis of performance behavior of all codes on all machines is difficult. One important addition to the benchmark set appears for the first time in this report. Whereas the older version was written for a vector processor, the newer version is more optimized for microprocessor architectures. Therefore, we have for the first time, an opportunity to measure performance on a single application using implementations that expose the respective strengths of vector and superscalar architecture. All results in this report are from single processors. A subsequent article will explore shared-memory multiprocessing performance of the 8400 system.« less
Satellite on-board real-time SAR processor prototype

NASA Astrophysics Data System (ADS)

Bergeron, Alain; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Marchese, Linda; Bourqui, Pascal; Desnoyers, Nicholas; Legros, Mathieu; Guillot, Ludovic; Mercier, Luc; Châteauneuf, François

2017-11-01

A Compact Real-Time Optronic SAR Processor has been successfully developed and tested up to a Technology Readiness Level of 4 (TRL4), the breadboard validation in a laboratory environment. SAR, or Synthetic Aperture Radar, is an active system allowing day and night imaging independent of the cloud coverage of the planet. The SAR raw data is a set of complex data for range and azimuth, which cannot be compressed. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Originally the first SAR images were optically processed. The optical Fourier processor architecture provides inherent parallel computing capabilities allowing real-time SAR data processing and thus the ability for compression and strongly reduced communication bandwidth requirements for the satellite. SAR signal return data are in general complex data. Both amplitude and phase must be combined optically in the SAR processor for each range and azimuth pixel. Amplitude and phase are generated by dedicated spatial light modulators and superimposed by an optical relay set-up. The spatial light modulators display the full complex raw data information over a two-dimensional format, one for the azimuth and one for the range. Since the entire signal history is displayed at once, the processor operates in parallel yielding real-time performances, i.e. without resulting bottleneck. Processing of both azimuth and range information is performed in a single pass. This paper focuses on the onboard capabilities of the compact optical SAR processor prototype that allows in-orbit processing of SAR images. Examples of processed ENVISAT ASAR images are presented. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and size are reviewed.
Feasibility of through-time spiral generalized autocalibrating partial parallel acquisition for low latency accelerated real-time MRI of speech.

PubMed

Lingala, Sajan Goud; Zhu, Yinghua; Lim, Yongwan; Toutios, Asterios; Ji, Yunhua; Lo, Wei-Ching; Seiberlich, Nicole; Narayanan, Shrikanth; Nayak, Krishna S

2017-12-01

To evaluate the feasibility of through-time spiral generalized autocalibrating partial parallel acquisition (GRAPPA) for low-latency accelerated real-time MRI of speech. Through-time spiral GRAPPA (spiral GRAPPA), a fast linear reconstruction method, is applied to spiral (k-t) data acquired from an eight-channel custom upper-airway coil. Fully sampled data were retrospectively down-sampled to evaluate spiral GRAPPA at undersampling factors R = 2 to 6. Pseudo-golden-angle spiral acquisitions were used for prospective studies. Three subjects were imaged while performing a range of speech tasks that involved rapid articulator movements, including fluent speech and beat-boxing. Spiral GRAPPA was compared with view sharing, and a parallel imaging and compressed sensing (PI-CS) method. Spiral GRAPPA captured spatiotemporal dynamics of vocal tract articulators at undersampling factors ≤4. Spiral GRAPPA at 18 ms/frame and 2.4 mm 2 /pixel outperformed view sharing in depicting rapidly moving articulators. Spiral GRAPPA and PI-CS provided equivalent temporal fidelity. Reconstruction latency per frame was 14 ms for view sharing and 116 ms for spiral GRAPPA, using a single processor. Spiral GRAPPA kept up with the MRI data rate of 18ms/frame with eight processors. PI-CS required 17 minutes to reconstruct 5 seconds of dynamic data. Spiral GRAPPA enabled 4-fold accelerated real-time MRI of speech with a low reconstruction latency. This approach is applicable to wide range of speech RT-MRI experiments that benefit from real-time feedback while visualizing rapid articulator movement. Magn Reson Med 78:2275-2282, 2017. © 2017 International Society for Magnetic Resonance in Medicine. © 2017 International Society for Magnetic Resonance in Medicine.
Assessment of mammographic film processor performance in a hospital and mobile screening unit.

PubMed

Murray, J G; Dowsett, D J; Laird, O; Ennis, J T

1992-12-01

In contrast to the majority of mammographic breast screening programmes, film processing at this centre occurs on site in both hospital and mobile trailer units. Initial (1989) quality control (QC) sensitometric tests revealed a large variation in film processor performance in the mobile unit. The clinical significance of these variations was assessed and acceptance limits for processor performance determined. Abnormal mammograms were used as reference material and copied using high definition 35 mm film over a range of exposure settings. The copies were than matched with QC film density variation from the mobile unit. All films were subsequently ranked for spatial and contrast resolution. Optimal values for processing time of 2 min (equivalent to film transit time 3 min and developer time 46 s) and temperature of 36 degrees C were obtained. The widespread anomaly of reporting film transit time as processing time is highlighted. Use of mammogram copies as a means of measuring the influence of film processor variation is advocated. Careful monitoring of the mobile unit film processor performance has produced stable quality comparable with the hospital based unit. The advantages of on site film processing are outlined. The addition of a sensitometric step wedge to all mammography film stock as a means of assessing image quality is recommended.
Interactive high-resolution isosurface ray casting on multicore processors.

PubMed

Wang, Qin; JaJa, Joseph

2008-01-01

We present a new method for the interactive rendering of isosurfaces using ray casting on multi-core processors. This method consists of a combination of an object-order traversal that coarsely identifies possible candidate 3D data blocks for each small set of contiguous pixels, and an isosurface ray casting strategy tailored for the resulting limited-size lists of candidate 3D data blocks. While static screen partitioning is widely used in the literature, our scheme performs dynamic allocation of groups of ray casting tasks to ensure almost equal loads among the different threads running on multi-cores while maintaining spatial locality. We also make careful use of memory management environment commonly present in multi-core processors. We test our system on a two-processor Clovertown platform, each consisting of a Quad-Core 1.86-GHz Intel Xeon Processor, for a number of widely different benchmarks. The detailed experimental results show that our system is efficient and scalable, and achieves high cache performance and excellent load balancing, resulting in an overall performance that is superior to any of the previous algorithms. In fact, we achieve an interactive isosurface rendering on a 1024(2) screen for all the datasets tested up to the maximum size of the main memory of our platform.
Real-time trajectory optimization on parallel processors

NASA Technical Reports Server (NTRS)

Psiaki, Mark L.

1993-01-01

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.
Fuel processing for PEM fuel cells: transport and kinetic issues of system design

NASA Astrophysics Data System (ADS)

Zalc, J. M.; Löffler, D. G.

In light of the distribution and storage issues associated with hydrogen, efficient on-board fuel processing will be a significant factor in the implementation of PEM fuel cells for automotive applications. Here, we apply basic chemical engineering principles to gain insight into the factors that limit performance in each component of a fuel processor. A system consisting of a plate reactor steam reformer, water-gas shift unit, and preferential oxidation reactor is used as a case study. It is found that for a steam reformer based on catalyst-coated foils, mass transfer from the bulk gas to the catalyst surface is the limiting process. The water-gas shift reactor is expected to be the largest component of the fuel processor and is limited by intrinsic catalyst activity, while a successful preferential oxidation unit depends on strict temperature control in order to minimize parasitic hydrogen oxidation. This stepwise approach of sequentially eliminating rate-limiting processes can be used to identify possible means of performance enhancement in a broad range of applications.
Software-Controlled Caches in the VMP Multiprocessor

DTIC Science & Technology

1986-03-01

programming system level that Processors is tuned for the VMP design. In this vein, we are interested in exploring how far the software support can go to ...handled in software, analogously to the handling agement of the shared program state is familiar and of virtual memory page faults. Hardware support for...ensure good behavior, as opposed to how Each cache miss results in bus traffic. Table 2 pro- vides the bus cost for the "average" cache miss. Fig
Thread Migration in the Presence of Pointers

NASA Technical Reports Server (NTRS)

Cronk, David; Haines, Matthew; Mehrotra, Piyush

1996-01-01

Dynamic migration of lightweight threads supports both data locality and load balancing. However, migrating threads that contain pointers referencing data in both the stack and heap remains an open problem. In this paper we describe a technique by which threads with pointers referencing both stack and non-shared heap data can be migrated such that the pointers remain valid after migration. As a result, threads containing pointers can now be migrated between processors in a homogeneous distributed memory environment.
The Mark III Hypercube-Ensemble Computers

NASA Technical Reports Server (NTRS)

Peterson, John C.; Tuazon, Jesus O.; Lieberman, Don; Pniel, Moshe

1988-01-01

Mark III Hypercube concept applied in development of series of increasingly powerful computers. Processor of each node of Mark III Hypercube ensemble is specialized computer containing three subprocessors and shared main memory. Solves problem quickly by simultaneously processing part of problem at each such node and passing combined results to host computer. Disciplines benefitting from speed and memory capacity include astrophysics, geophysics, chemistry, weather, high-energy physics, applied mechanics, image processing, oil exploration, aircraft design, and microcircuit design.

The force on the flex: Global parallelism and portability

NASA Technical Reports Server (NTRS)

Jordan, H. F.

1986-01-01

A parallel programming methodology, called the force, supports the construction of programs to be executed in parallel by an unspecified, but potentially large, number of processes. The methodology was originally developed on a pipelined, shared memory multiprocessor, the Denelcor HEP, and embodies the primitive operations of the force in a set of macros which expand into multiprocessor Fortran code. A small set of primitives is sufficient to write large parallel programs, and the system has been used to produce 10,000 line programs in computational fluid dynamics. The level of complexity of the force primitives is intermediate. It is high enough to mask detailed architectural differences between multiprocessors but low enough to give the user control over performance. The system is being ported to a medium scale multiprocessor, the Flex/32, which is a 20 processor system with a mixture of shared and local memory. Memory organization and the type of processor synchronization supported by the hardware on the two machines lead to some differences in efficient implementations of the force primitives, but the user interface remains the same. An initial implementation was done by retargeting the macros to Flexible Computer Corporation's ConCurrent C language. Subsequently, the macros were caused to directly produce the system calls which form the basis for ConCurrent C. The implementation of the Fortran based system is in step with Flexible Computer Corporations's implementation of a Fortran system in the parallel environment.
Implementing Shared Memory Parallelism in MCBEND

NASA Astrophysics Data System (ADS)

Bird, Adam; Long, David; Dobson, Geoff

2017-09-01

MCBEND is a general purpose radiation transport Monte Carlo code from AMEC Foster Wheelers's ANSWERS® Software Service. MCBEND is well established in the UK shielding community for radiation shielding and dosimetry assessments. The existing MCBEND parallel capability effectively involves running the same calculation on many processors. This works very well except when the memory requirements of a model restrict the number of instances of a calculation that will fit on a machine. To more effectively utilise parallel hardware OpenMP has been used to implement shared memory parallelism in MCBEND. This paper describes the reasoning behind the choice of OpenMP, notes some of the challenges of multi-threading an established code such as MCBEND and assesses the performance of the parallel method implemented in MCBEND.
Rapid solution of large-scale systems of equations

NASA Technical Reports Server (NTRS)

Storaasli, Olaf O.

1994-01-01

The analysis and design of complex aerospace structures requires the rapid solution of large systems of linear and nonlinear equations, eigenvalue extraction for buckling, vibration and flutter modes, structural optimization and design sensitivity calculation. Computers with multiple processors and vector capabilities can offer substantial computational advantages over traditional scalar computer for these analyses. These computers fall into two categories: shared memory computers and distributed memory computers. This presentation covers general-purpose, highly efficient algorithms for generation/assembly or element matrices, solution of systems of linear and nonlinear equations, eigenvalue and design sensitivity analysis and optimization. All algorithms are coded in FORTRAN for shared memory computers and many are adapted to distributed memory computers. The capability and numerical performance of these algorithms will be addressed.
Dynamic programming on a shared-memory multiprocessor

NASA Technical Reports Server (NTRS)

Edmonds, Phil; Chu, Eleanor; George, Alan

1993-01-01

Three new algorithms for solving dynamic programming problems on a shared-memory parallel computer are described. All three algorithms attempt to balance work load, while keeping synchronization cost low. In particular, for a multiprocessor having p processors, an analysis of the best algorithm shows that the arithmetic cost is O(n-cubed/6p) and that the synchronization cost is O(absolute value of log sub C n) if p much less than n, where C = (2p-1)/(2p + 1) and n is the size of the problem. The low synchronization cost is important for machines where synchronization is expensive. Analysis and experiments show that the best algorithm is effective in balancing the work load and producing high efficiency.
Noise Analysis of Spatial Phase coding in analog Acoustooptic Processors

NASA Technical Reports Server (NTRS)

Gary, Charles K.; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

Optical beams can carry information in their amplitude and phase; however, optical analog numerical calculators such as an optical matrix processor use incoherent light to achieve linear operation. Thus, the phase information is lost and only the magnitude can be used. This limits such processors to the representation of positive real numbers. Many systems have been devised to overcome this deficit through the use of digital number representations, but they all operate at a greatly reduced efficiency in contrast to analog systems. The most widely accepted method to achieve sign coding in analog optical systems has been the use of an offset for the zero level. Unfortunately, this results in increased noise sensitivity for small numbers. In this paper, we examine the use of spatially coherent sign coding in acoustooptical processors, a method first developed for digital calculations by D. V. Tigin. This coding technique uses spatial coherence for the representation of signed numbers, while temporal incoherence allows for linear analog processing of the optical information. We show how spatial phase coding reduces noise sensitivity for signed analog calculations.
29. Perimeter acquisition radar building room #318, data processing system ...

Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

29. Perimeter acquisition radar building room #318, data processing system area; data processor maintenance and operations center, showing data processing consoles - Stanley R. Mickelsen Safeguard Complex, Perimeter Acquisition Radar Building, Limited Access Area, between Limited Access Patrol Road & Service Road A, Nekoma, Cavalier County, ND
Limited Area Coverage/High Resolution Picture Transmission (LAC/HRPT) data vegetative index calculation processor user's manual

NASA Technical Reports Server (NTRS)

Obrien, S. O. (Principal Investigator)

1980-01-01

The program, LACVIN, calculates vegetative indexes numbers on limited area coverage/high resolution picture transmission data for selected IJ grid sections. The IJ grid sections were previously extracted from the full resolution data tapes and stored on disk files.
Job-mix modeling and system analysis of an aerospace multiprocessor.

NASA Technical Reports Server (NTRS)

Mallach, E. G.

1972-01-01

An aerospace guidance computer organization, consisting of multiple processors and memory units attached to a central time-multiplexed data bus, is described. A job mix for this type of computer is obtained by analysis of Apollo mission programs. Multiprocessor performance is then analyzed using: 1) queuing theory, under certain 'limiting case' assumptions; 2) Markov process methods; and 3) system simulation. Results of the analyses indicate: 1) Markov process analysis is a useful and efficient predictor of simulation results; 2) efficient job execution is not seriously impaired even when the system is so overloaded that new jobs are inordinately delayed in starting; 3) job scheduling is significant in determining system performance; and 4) a system having many slow processors may or may not perform better than a system of equal power having few fast processors, but will not perform significantly worse.
Performance Models for Split-execution Computing Systems

DOE Office of Scientific and Technical Information (OSTI.GOV)

Humble, Travis S; McCaskey, Alex; Schrock, Jonathan

Split-execution computing leverages the capabilities of multiple computational models to solve problems, but splitting program execution across different computational models incurs costs associated with the translation between domains. We analyze the performance of a split-execution computing system developed from conventional and quantum processing units (QPUs) by using behavioral models that track resource usage. We focus on asymmetric processing models built using conventional CPUs and a family of special-purpose QPUs that employ quantum computing principles. Our performance models account for the translation of a classical optimization problem into the physical representation required by the quantum processor while also accounting for hardwaremore » limitations and conventional processor speed and memory. We conclude that the bottleneck in this split-execution computing system lies at the quantum-classical interface and that the primary time cost is independent of quantum processor behavior.« less
Recall Performance for Content-Addressable Memory Using Adiabatic Quantum Optimization

DOE Office of Scientific and Technical Information (OSTI.GOV)

Imam, Neena; Humble, Travis S.; McCaskey, Alex

A content-addressable memory (CAM) stores key-value associations such that the key is recalled by providing its associated value. While CAM recall is traditionally performed using recurrent neural network models, we show how to solve this problem using adiabatic quantum optimization. Our approach maps the recurrent neural network to a commercially available quantum processing unit by taking advantage of the common underlying Ising spin model. We then assess the accuracy of the quantum processor to store key-value associations by quantifying recall performance against an ensemble of problem sets. We observe that different learning rules from the neural network community influence recallmore » accuracy but performance appears to be limited by potential noise in the processor. The strong connection established between quantum processors and neural network problems supports the growing intersection of these two ideas.« less
Reproducibility of Mammography Units, Film Processing and Quality Imaging

NASA Astrophysics Data System (ADS)

Gaona, Enrique

2003-09-01

The purpose of this study was to carry out an exploratory survey of the problems of quality control in mammography and processors units as a diagnosis of the current situation of mammography facilities. Measurements of reproducibility, optical density, optical difference and gamma index are included. Breast cancer is the most frequently diagnosed cancer and is the second leading cause of cancer death among women in the Mexican Republic. Mammography is a radiographic examination specially designed for detecting breast pathology. We found that the problems of reproducibility of AEC are smaller than the problems of processors units because almost all processors fall outside of the acceptable variation limits and they can affect the mammography quality image and the dose to breast. Only four mammography units agree with the minimum score established by ACR and FDA for the phantom image.
Energy Efficient Real-Time Scheduling Using DPM on Mobile Sensors with a Uniform Multi-Cores

PubMed Central

Kim, Youngmin; Lee, Chan-Gun

2017-01-01

In wireless sensor networks (WSNs), sensor nodes are deployed for collecting and analyzing data. These nodes use limited energy batteries for easy deployment and low cost. The use of limited energy batteries is closely related to the lifetime of the sensor nodes when using wireless sensor networks. Efficient-energy management is important to extending the lifetime of the sensor nodes. Most effort for improving power efficiency in tiny sensor nodes has focused mainly on reducing the power consumed during data transmission. However, recent emergence of sensor nodes equipped with multi-cores strongly requires attention to be given to the problem of reducing power consumption in multi-cores. In this paper, we propose an energy efficient scheduling method for sensor nodes supporting a uniform multi-cores. We extend the proposed T-Ler plane based scheduling for global optimal scheduling of a uniform multi-cores and multi-processors to enable power management using dynamic power management. In the proposed approach, processor selection for a scheduling and mapping method between the tasks and processors is proposed to efficiently utilize dynamic power management. Experiments show the effectiveness of the proposed approach compared to other existing methods. PMID:29240695
Queueing models for token and slotted ring networks. Thesis

NASA Technical Reports Server (NTRS)

Peden, Jeffery H.

1990-01-01

Currently the end-to-end delay characteristics of very high speed local area networks are not well understood. The transmission speed of computer networks is increasing, and local area networks especially are finding increasing use in real time systems. Ring networks operation is generally well understood for both token rings and slotted rings. There is, however, a severe lack of queueing models for high layer operation. There are several factors which contribute to the processing delay of a packet, as opposed to the transmission delay, e.g., packet priority, its length, the user load, the processor load, the use of priority preemption, the use of preemption at packet reception, the number of processors, the number of protocol processing layers, the speed of each processor, and queue length limitations. Currently existing medium access queueing models are extended by adding modeling techniques which will handle exhaustive limited service both with and without priority traffic, and modeling capabilities are extended into the upper layers of the OSI model. Some of the model are parameterized solution methods, since it is shown that certain models do not exist as parameterized solutions, but rather as solution methods.
9 CFR 590.100 - Specific exemptions.

Code of Federal Regulations, 2010 CFR

2010-01-01

... inspection of processing operations in section 5(a) of the Act: Provided, That the conditions for exemption... limited to bakeries, restaurants, and other food processors, without continuous inspection, of certain...
Design distributed simulation platform for vehicle management system

NASA Astrophysics Data System (ADS)

Wen, Zhaodong; Wang, Zhanlin; Qiu, Lihua

2006-11-01

Next generation military aircraft requires the airborne management system high performance. General modules, data integration, high speed data bus and so on are needed to share and manage information of the subsystems efficiently. The subsystems include flight control system, propulsion system, hydraulic power system, environmental control system, fuel management system, electrical power system and so on. The unattached or mixed architecture is changed to integrated architecture. That means the whole airborne system is regarded into one system to manage. So the physical devices are distributed but the system information is integrated and shared. The process function of each subsystem are integrated (including general process modules, dynamic reconfiguration), furthermore, the sensors and the signal processing functions are shared. On the other hand, it is a foundation for power shared. Establish a distributed vehicle management system using 1553B bus and distributed processors which can provide a validation platform for the research of airborne system integrated management. This paper establishes the Vehicle Management System (VMS) simulation platform. Discuss the software and hardware configuration and analyze the communication and fault-tolerant method.
Cache Sharing and Isolation Tradeoffs in Multicore Mixed-Criticality Systems

DTIC Science & Technology

2015-05-01

form of lockdown registers, to provide way-based partitioning. These alternatives are illustrated in Fig. 1 with respect to a quad-core ARM Cortex A9... processor (as we do for Level-A and -B tasks), but they did not consider MC systems. Altmeyer et al. [1] considered uniprocessor scheduling on a system with a...framework. We randomly generated task sets and determined the fraction that were schedulable on our target hardware platform, the quad-core ARM Cortex A9
Broca's area: a supramodal hierarchical processor?

PubMed

Tettamanti, Marco; Weniger, Dorothea

2006-05-01

Despite the presence of shared characteristics across the different domains modulating Broca's area activity (e.g., structural analogies, as between language and music, or representational homologies, as between action execution and action observation), the question of what exactly the common denominator of such diverse brain functions is, with respect to the function of Broca's area, remains largely a debated issue. Here, we suggest that an important computational role of Broca's area may be to process hierarchical structures in a wide range of functional domains.
Multiple Microcomputer Control Algorithm.

DTIC Science & Technology

1979-09-01

discrete and semaphore supervisor calls can be used with tasks in separate processors, in which case they are maintained in shared memory. Operations on ...the source or destination operand specifier of each mode in most cases . However, four of the 16 general register addressing modes and one of the 8 pro...instruction time is based on the specified usage factors and the best cast, and worst case execution times for the instruc- 1I 5 1NAVTRAEQZJ1PCrN M’.V7~j
EndNote 7.0.

PubMed

Eapen, Bell Raj

2006-01-01

EndNote is a useful software for online literature search and efficient bibliography management. It helps to format the bibliography according to the citation style of each journal. EndNote stores references in a library file, which can be shared with others. It can connect to online resources like PubMed and retrieve search results as per the search criteria. It can also effortlessly integrate with popular word processors like MS Word. The Indian Journal of Dermatology, Venereology and Leprology website has a provision to import references to EndNote.
Development and evaluation of a fault-tolerant multiprocessor (FTMP) computer. Volume 1: FTMP principles of operation

NASA Technical Reports Server (NTRS)

Smith, T. B., Jr.; Lala, J. H.

1983-01-01

The basic organization of the fault tolerant multiprocessor, (FTMP) is that of a general purpose homogeneous multiprocessor. Three processors operate on a shared system (memory and I/O) bus. Replication and tight synchronization of all elements and hardware voting is employed to detect and correct any single fault. Reconfiguration is then employed to repair a fault. Multiple faults may be tolerated as a sequence of single faults with repair between fault occurrences.

GPU-based Parallel Application Design for Emerging Mobile Devices

NASA Astrophysics Data System (ADS)

Gupta, Kshitij

A revolution is underway in the computing world that is causing a fundamental paradigm shift in device capabilities and form-factor, with a move from well-established legacy desktop/laptop computers to mobile devices in varying sizes and shapes. Amongst all the tasks these devices must support, graphics has emerged as the 'killer app' for providing a fluid user interface and high-fidelity game rendering, effectively making the graphics processor (GPU) one of the key components in (present and future) mobile systems. By utilizing the GPU as a general-purpose parallel processor, this dissertation explores the GPU computing design space from an applications standpoint, in the mobile context, by focusing on key challenges presented by these devices---limited compute, memory bandwidth, and stringent power consumption requirements---while improving the overall application efficiency of the increasingly important speech recognition workload for mobile user interaction. We broadly partition trends in GPU computing into four major categories. We analyze hardware and programming model limitations in current-generation GPUs and detail an alternate programming style called Persistent Threads, identify four use case patterns, and propose minimal modifications that would be required for extending native support. We show how by manually extracting data locality and altering the speech recognition pipeline, we are able to achieve significant savings in memory bandwidth while simultaneously reducing the compute burden on GPU-like parallel processors. As we foresee GPU computing to evolve from its current 'co-processor' model into an independent 'applications processor' that is capable of executing complex work independently, we create an alternate application framework that enables the GPU to handle all control-flow dependencies autonomously at run-time while minimizing host involvement to just issuing commands, that facilitates an efficient application implementation. Finally, as compute and communication capabilities of mobile devices improve, we analyze energy implications of processing speech recognition locally (on-chip) and offloading it to servers (in-cloud).
On-board computational efficiency in real time UAV embedded terrain reconstruction

NASA Astrophysics Data System (ADS)

Partsinevelos, Panagiotis; Agadakos, Ioannis; Athanasiou, Vasilis; Papaefstathiou, Ioannis; Mertikas, Stylianos; Kyritsis, Sarantis; Tripolitsiotis, Achilles; Zervos, Panagiotis

2014-05-01

In the last few years, there is a surge of applications for object recognition, interpretation and mapping using unmanned aerial vehicles (UAV). Specifications in constructing those UAVs are highly diverse with contradictory characteristics including cost-efficiency, carrying weight, flight time, mapping precision, real time processing capabilities, etc. In this work, a hexacopter UAV is employed for near real time terrain mapping. The main challenge addressed is to retain a low cost flying platform with real time processing capabilities. The UAV weight limitation affecting the overall flight time, makes the selection of the on-board processing components particularly critical. On the other hand, surface reconstruction, as a computational demanding task, calls for a highly demanding processing unit on board. To merge these two contradicting aspects along with customized development, a System on a Chip (SoC) integrated circuit is proposed as a low-power, low-cost processor, which natively supports camera sensors and positioning and navigation systems. Modern SoCs, such as Omap3530 or Zynq, are classified as heterogeneous devices and provide a versatile platform, allowing access to both general purpose processors, such as the ARM11, as well as specialized processors, such as a digital signal processor and floating field-programmable gate array. A UAV equipped with the proposed embedded processors, allows on-board terrain reconstruction using stereo vision in near real time. Furthermore, according to the frame rate required, additional image processing may concurrently take place, such as image rectification andobject detection. Lastly, the onboard positioning and navigation (e.g., GNSS) chip may further improve the quality of the generated map. The resulting terrain maps are compared to ground truth geodetic measurements in order to access the accuracy limitations of the overall process. It is shown that with our proposed novel system,there is much potential in computational efficiency on board and in optimized time constraints.
Design and Analysis of Self-Adapted Task Scheduling Strategies in Wireless Sensor Networks

PubMed Central

Guo, Wenzhong; Xiong, Naixue; Chao, Han-Chieh; Hussain, Sajid; Chen, Guolong

2011-01-01

In a wireless sensor network (WSN), the usage of resources is usually highly related to the execution of tasks which consume a certain amount of computing and communication bandwidth. Parallel processing among sensors is a promising solution to provide the demanded computation capacity in WSNs. Task allocation and scheduling is a typical problem in the area of high performance computing. Although task allocation and scheduling in wired processor networks has been well studied in the past, their counterparts for WSNs remain largely unexplored. Existing traditional high performance computing solutions cannot be directly implemented in WSNs due to the limitations of WSNs such as limited resource availability and the shared communication medium. In this paper, a self-adapted task scheduling strategy for WSNs is presented. First, a multi-agent-based architecture for WSNs is proposed and a mathematical model of dynamic alliance is constructed for the task allocation problem. Then an effective discrete particle swarm optimization (PSO) algorithm for the dynamic alliance (DPSO-DA) with a well-designed particle position code and fitness function is proposed. A mutation operator which can effectively improve the algorithm’s ability of global search and population diversity is also introduced in this algorithm. Finally, the simulation results show that the proposed solution can achieve significant better performance than other algorithms. PMID:22163971
Data processing techniques used with MST radars: A review

NASA Technical Reports Server (NTRS)

Rastogi, P. K.

1983-01-01

The data processing methods used in high power radar probing of the middle atmosphere are examined. The radar acts as a spatial filter on the small scale refractivity fluctuations in the medium. The characteristics of the received signals are related to the statistical properties of these fluctuations. A functional outline of the components of a radar system is given. Most computation intensive tasks are carried out by the processor. The processor computes a statistical function of the received signals, simultaneously for a large number of ranges. The slow fading of atmospheric signals is used to reduce the data input rate to the processor by coherent integration. The inherent range resolution of the radar experiments can be improved significant with the use of pseudonoise phase codes to modulate the transmitted pulses and a corresponding decoding operation on the received signals. Commutability of the decoding and coherent integration operations is used to obtain a significant reduction in computations. The limitations of the processors are outlined. At the next level of data reduction, the measured function is parameterized by a few spectral moments that can be related to physical processes in the medium. The problems encountered in estimating the spectral moments in the presence of strong ground clutter, external interference, and noise are discussed. The graphical and statistical analysis of the inferred parameters are outlined. The requirements for special purpose processors for MST radars are discussed.
Load Balancing Strategies for Multiphase Flows on Structured Grids

NASA Astrophysics Data System (ADS)

Olshefski, Kristopher; Owkes, Mark

2017-11-01

The computation time required to perform large simulations of complex systems is currently one of the leading bottlenecks of computational research. Parallelization allows multiple processing cores to perform calculations simultaneously and reduces computational times. However, load imbalances between processors waste computing resources as processors wait for others to complete imbalanced tasks. In multiphase flows, these imbalances arise due to the additional computational effort required at the gas-liquid interface. However, many current load balancing schemes are only designed for unstructured grid applications. The purpose of this research is to develop a load balancing strategy while maintaining the simplicity of a structured grid. Several approaches are investigated including brute force oversubscription, node oversubscription through Message Passing Interface (MPI) commands, and shared memory load balancing using OpenMP. Each of these strategies are tested with a simple one-dimensional model prior to implementation into the three-dimensional NGA code. Current results show load balancing will reduce computational time by at least 30%.
Advanced data management system architectures testbed

NASA Technical Reports Server (NTRS)

Grant, Terry

1990-01-01

The objective of the Architecture and Tools Testbed is to provide a working, experimental focus to the evolving automation applications for the Space Station Freedom data management system. Emphasis is on defining and refining real-world applications including the following: the validation of user needs; understanding system requirements and capabilities; and extending capabilities. The approach is to provide an open, distributed system of high performance workstations representing both the standard data processors and networks and advanced RISC-based processors and multiprocessor systems. The system provides a base from which to develop and evaluate new performance and risk management concepts and for sharing the results. Participants are given a common view of requirements and capability via: remote login to the testbed; standard, natural user interfaces to simulations and emulations; special attention to user manuals for all software tools; and E-mail communication. The testbed elements which instantiate the approach are briefly described including the workstations, the software simulation and monitoring tools, and performance and fault tolerance experiments.
Concurrent computation of attribute filters on shared memory parallel machines.

PubMed

Wilkinson, Michael H F; Gao, Hui; Hesselink, Wim H; Jonker, Jan-Eppo; Meijster, Arnold

2008-10-01

Morphological attribute filters have not previously been parallelized, mainly because they are both global and non-separable. We propose a parallel algorithm that achieves efficient parallelism for a large class of attribute filters, including attribute openings, closings, thinnings and thickenings, based on Salembier's Max-Trees and Min-trees. The image or volume is first partitioned in multiple slices. We then compute the Max-trees of each slice using any sequential Max-Tree algorithm. Subsequently, the Max-trees of the slices can be merged to obtain the Max-tree of the image. A C-implementation yielded good speed-ups on both a 16-processor MIPS 14000 parallel machine, and a dual-core Opteron-based machine. It is shown that the speed-up of the parallel algorithm is a direct measure of the gain with respect to the sequential algorithm used. Furthermore, the concurrent algorithm shows a speed gain of up to 72 percent on a single-core processor, due to reduced cache thrashing.
Optimization of the Multi-Spectral Euclidean Distance Calculation for FPGA-based Spaceborne Systems

NASA Technical Reports Server (NTRS)

Cristo, Alejandro; Fisher, Kevin; Perez, Rosa M.; Martinez, Pablo; Gualtieri, Anthony J.

2012-01-01

Due to the high quantity of operations that spaceborne processing systems must carry out in space, new methodologies and techniques are being presented as good alternatives in order to free the main processor from work and improve the overall performance. These include the development of ancillary dedicated hardware circuits that carry out the more redundant and computationally expensive operations in a faster way, leaving the main processor free to carry out other tasks while waiting for the result. One of these devices is SpaceCube, a FPGA-based system designed by NASA. The opportunity to use FPGA reconfigurable architectures in space allows not only the optimization of the mission operations with hardware-level solutions, but also the ability to create new and improved versions of the circuits, including error corrections, once the satellite is already in orbit. In this work, we propose the optimization of a common operation in remote sensing: the Multi-Spectral Euclidean Distance calculation. For that, two different hardware architectures have been designed and implemented in a Xilinx Virtex-5 FPGA, the same model of FPGAs used by SpaceCube. Previous results have shown that the communications between the embedded processor and the circuit create a bottleneck that affects the overall performance in a negative way. In order to avoid this, advanced methods including memory sharing, Native Port Interface (NPI) connections and Data Burst Transfers have been used.
Merlin - Massively parallel heterogeneous computing

NASA Technical Reports Server (NTRS)

Wittie, Larry; Maples, Creve

1989-01-01

Hardware and software for Merlin, a new kind of massively parallel computing system, are described. Eight computers are linked as a 300-MIPS prototype to develop system software for a larger Merlin network with 16 to 64 nodes, totaling 600 to 3000 MIPS. These working prototypes help refine a mapped reflective memory technique that offers a new, very general way of linking many types of computer to form supercomputers. Processors share data selectively and rapidly on a word-by-word basis. Fast firmware virtual circuits are reconfigured to match topological needs of individual application programs. Merlin's low-latency memory-sharing interfaces solve many problems in the design of high-performance computing systems. The Merlin prototypes are intended to run parallel programs for scientific applications and to determine hardware and software needs for a future Teraflops Merlin network.
A Parallel Saturation Algorithm on Shared Memory Architectures

NASA Technical Reports Server (NTRS)

Ezekiel, Jonathan; Siminiceanu

2007-01-01

Symbolic state-space generators are notoriously hard to parallelize. However, the Saturation algorithm implemented in the SMART verification tool differs from other sequential symbolic state-space generators in that it exploits the locality of ring events in asynchronous system models. This paper explores whether event locality can be utilized to efficiently parallelize Saturation on shared-memory architectures. Conceptually, we propose to parallelize the ring of events within a decision diagram node, which is technically realized via a thread pool. We discuss the challenges involved in our parallel design and conduct experimental studies on its prototypical implementation. On a dual-processor dual core PC, our studies show speed-ups for several example models, e.g., of up to 50% for a Kanban model, when compared to running our algorithm only on a single core.
Computation of Molecular Spectra on a Quantum Processor with an Error-Resilient Algorithm

DOE PAGES

Colless, J. I.; Ramasesh, V. V.; Dahlen, D.; ...

2018-02-12

Harnessing the full power of nascent quantum processors requires the efficient management of a limited number of quantum bits with finite coherent lifetimes. Hybrid algorithms, such as the variational quantum eigensolver (VQE), leverage classical resources to reduce the required number of quantum gates. Experimental demonstrations of VQE have resulted in calculation of Hamiltonian ground states, and a new theoretical approach based on a quantum subspace expansion (QSE) has outlined a procedure for determining excited states that are central to dynamical processes. Here, we use a superconducting-qubit-based processor to apply the QSE approach to the H 2 molecule, extracting both groundmore » and excited states without the need for auxiliary qubits or additional minimization. Further, we show that this extended protocol can mitigate the effects of incoherent errors, potentially enabling larger-scale quantum simulations without the need for complex error-correction techniques.« less
Computation of Molecular Spectra on a Quantum Processor with an Error-Resilient Algorithm

NASA Astrophysics Data System (ADS)

Colless, J. I.; Ramasesh, V. V.; Dahlen, D.; Blok, M. S.; Kimchi-Schwartz, M. E.; McClean, J. R.; Carter, J.; de Jong, W. A.; Siddiqi, I.

2018-02-01

Harnessing the full power of nascent quantum processors requires the efficient management of a limited number of quantum bits with finite coherent lifetimes. Hybrid algorithms, such as the variational quantum eigensolver (VQE), leverage classical resources to reduce the required number of quantum gates. Experimental demonstrations of VQE have resulted in calculation of Hamiltonian ground states, and a new theoretical approach based on a quantum subspace expansion (QSE) has outlined a procedure for determining excited states that are central to dynamical processes. We use a superconducting-qubit-based processor to apply the QSE approach to the H2 molecule, extracting both ground and excited states without the need for auxiliary qubits or additional minimization. Further, we show that this extended protocol can mitigate the effects of incoherent errors, potentially enabling larger-scale quantum simulations without the need for complex error-correction techniques.
Dense and Sparse Matrix Operations on the Cell Processor

DOE Office of Scientific and Technical Information (OSTI.GOV)

Williams, Samuel W.; Shalf, John; Oliker, Leonid

2005-05-01

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. Therefore, the high performance computing community is examining alternative architectures that address the limitations of modern superscalar designs. In this work, we examine STI's forthcoming Cell processor: a novel, low-power architecture that combines a PowerPC core with eight independent SIMD processing units coupled with a software-controlled memory to offer high FLOP/s/Watt. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop an analytic framework to predict Cell performance on dense and sparse matrix operations, usingmore » a variety of algorithmic approaches. Results demonstrate Cell's potential to deliver more than an order of magnitude better GFLOP/s per watt performance, when compared with the Intel Itanium2 and Cray X1 processors.« less
A programmable power processor for high power space applications

NASA Technical Reports Server (NTRS)

Lanier, J. R., Jr.; Graves, J. R.; Kapustka, R. E.; Bush, J. R., Jr.

1982-01-01

A Programmable Power Processor (P3) has been developed for application in future large space power systems. The P3 is capable of operation over a wide range of input voltage (26 to 375 Vdc) and output voltage (24 to 180 Vdc). The peak output power capability is 18 kW (180 V at 100 A). The output characteristics of the P3 can be programmed to any voltage and/or current level within the limits of the processor and may be controlled as a function of internal or external parameters. Seven breadboard P3s and one 'flight-type' engineering model P3 have been built and tested both individually and in electrical power systems. The programmable feature allows the P3 to be used in a variety of applications by changing the output characteristics. Test results, including efficiency at various input/output combinations, transient response, and output impedance, are presented.
Computation of Molecular Spectra on a Quantum Processor with an Error-Resilient Algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Colless, J. I.; Ramasesh, V. V.; Dahlen, D.

Harnessing the full power of nascent quantum processors requires the efficient management of a limited number of quantum bits with finite coherent lifetimes. Hybrid algorithms, such as the variational quantum eigensolver (VQE), leverage classical resources to reduce the required number of quantum gates. Experimental demonstrations of VQE have resulted in calculation of Hamiltonian ground states, and a new theoretical approach based on a quantum subspace expansion (QSE) has outlined a procedure for determining excited states that are central to dynamical processes. Here, we use a superconducting-qubit-based processor to apply the QSE approach to the H 2 molecule, extracting both groundmore » and excited states without the need for auxiliary qubits or additional minimization. Further, we show that this extended protocol can mitigate the effects of incoherent errors, potentially enabling larger-scale quantum simulations without the need for complex error-correction techniques.« less
Initial Flight Test of the Production Support Flight Control Computers at NASA Dryden Flight Research Center

NASA Technical Reports Server (NTRS)

Carter, John; Stephenson, Mark

1999-01-01

The NASA Dryden Flight Research Center has completed the initial flight test of a modified set of F/A-18 flight control computers that gives the aircraft a research control law capability. The production support flight control computers (PSFCC) provide an increased capability for flight research in the control law, handling qualities, and flight systems areas. The PSFCC feature a research flight control processor that is "piggybacked" onto the baseline F/A-18 flight control system. This research processor allows for pilot selection of research control law operation in flight. To validate flight operation, a replication of a standard F/A-18 control law was programmed into the research processor and flight-tested over a limited envelope. This paper provides a brief description of the system, summarizes the initial flight test of the PSFCC, and describes future experiments for the PSFCC.
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

The implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are: the MPP, an SIMD machine with 16K bit serial processors; FLEX/32, an MIMD machine with 20 processors; and CRAY/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the FLEX/32 and CRAY/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally, conclusions are presented.
Implementation of an ADI method on parallel computers

NASA Technical Reports Server (NTRS)

Fatoohi, Raad A.; Grosch, Chester E.

1987-01-01

In this paper the implementation of an ADI method for solving the diffusion equation on three parallel/vector computers is discussed. The computers were chosen so as to encompass a variety of architectures. They are the MPP, an SIMD machine with 16-Kbit serial processors; Flex/32, an MIMD machine with 20 processors; and Cray/2, an MIMD machine with four vector processors. The Gaussian elimination algorithm is used to solve a set of tridiagonal systems on the Flex/32 and Cray/2 while the cyclic elimination algorithm is used to solve these systems on the MPP. The implementation of the method is discussed in relation to these architectures and measures of the performance on each machine are given. Simple performance models are used to describe the performance. These models highlight the bottlenecks and limiting factors for this algorithm on these architectures. Finally conclusions are presented.
A microprocessor-based one dimensional optical data processor for spatial frequency analysis

NASA Technical Reports Server (NTRS)

Collier, R. L.; Ballard, G. S.

1982-01-01

A high degree of accuracy was obtained in measuring the spatial frequency spectrum of known samples using an optical data processor based on a microprocessor, which reliably collected intensity versus angle data. Stray light control, system alignment, and angle measurement problems were addressed and solved. The capabilities of the instrument were extended by the addition of appropriate optics to allow the use of different wavelengths of laser radiation and by increasing the travel limits of the rotating arm to + or - 160 degrees. The acquisition, storage, and plotting of data by the computer permits the researcher a free hand in data manipulation such as subtracting background scattering from a diffraction pattern. Tests conducted to verify the operation of the processor using a 25 mm diameter pinhole, a 39.37 line pairs per mm series of multiple slits, and a microscope slide coated with 1.091 mm diameter polystyrene latex spheres are described.
50 CFR 679.64 - Harvesting sideboard limits in other fisheries.

Code of Federal Regulations, 2010 CFR

2010-10-01

.../processors be calculated? Except for Aleutian Islands pollock and BSAI Pacific cod, the Regional... group in which a TAC is specified for an area or subarea of the BSAI as follows: (i) Aleutian Islands Pacific ocean perch. (A) The Aleutian Islands Pacific ocean perch harvest limit will be equal to the 1996...

50 CFR 679.64 - Harvesting sideboard limits in other fisheries.

Code of Federal Regulations, 2013 CFR

2013-10-01

... Pacific ocean perch. (A) The Aleutian Islands Pacific ocean perch harvest limit will be equal to the 1996 through 1997 aggregate retained catch of Aleutian Islands Pacific ocean perch by catcher/processors listed... sum of the Aleutian Islands Pacific ocean perch catch in 1996 and 1997 multiplied by the remainder of...
50 CFR 679.64 - Harvesting sideboard limits in other fisheries.

Code of Federal Regulations, 2014 CFR

2014-10-01

... area or subarea of the BSAI as follows: (i) Aleutian Islands Pacific ocean perch. (A) The Aleutian Islands Pacific ocean perch harvest limit will be equal to the 1996 through 1997 aggregate retained catch of Aleutian Islands Pacific ocean perch by catcher/processors listed in Sections 208(e)(1) through...
50 CFR 679.64 - Harvesting sideboard limits in other fisheries.

Code of Federal Regulations, 2012 CFR

2012-10-01

... Pacific ocean perch. (A) The Aleutian Islands Pacific ocean perch harvest limit will be equal to the 1996 through 1997 aggregate retained catch of Aleutian Islands Pacific ocean perch by catcher/processors listed... sum of the Aleutian Islands Pacific ocean perch catch in 1996 and 1997 multiplied by the remainder of...
The Ocean Colour Climate Change Initiative: II. Spatial and Temporal Homogeneity of Satellite Data Retrieval Due to Systematic Effects in Atmospheric Correction Processors

NASA Technical Reports Server (NTRS)

Muller, Dagmar; Krasemann, Hajo; Brewin, Robert J. W.; Brockmann, Carsten; Deschamps, Pierre-Yves; Fomferra, Norman; Franz, Bryan A.; Grant, Mike G.; Groom, Steve B.; Melin, Frederic;

2015-01-01

The established procedure to access the quality of atmospheric correction processors and their underlying algorithms is the comparison of satellite data products with related in-situ measurements. Although this approach addresses the accuracy of derived geophysical properties in a straight forward fashion, it is also limited in its ability to catch systematic sensor and processor dependent behaviour of satellite products along the scan-line, which might impair the usefulness of the data in spatial analyses. The Ocean Colour Climate Change Initiative (OC-CCI) aims to create an ocean colour dataset on a global scale to meet the demands of the ecosystem modelling community. The need for products with increasing spatial and temporal resolution that also show as little systematic and random errors as possible, increases. Due to cloud cover, even temporal means can be influenced by along-scanline artefacts if the observations are not balanced and effects cannot be cancelled out mutually. These effects can arise from a multitude of results which are not easily separated, if at all. Among the sources of artefacts, there are some sensor-specific calibration issues which should lead to similar responses in all processors, as well as processor-specific features which correspond with the individual choices in the algorithms. A set of methods is proposed and applied to MERIS data over two regions of interest in the North Atlantic and the South Pacific Gyre. The normalised water leaving reflectance products of four atmospheric correction processors, which have also been evaluated in match-up analysis, is analysed in order to find and interpret systematic effects across track. These results are summed up with a semi-objective ranking and are used as a complement to the match-up analysis in the decision for the best Atmospheric Correction (AC) processor. Although the need for discussion remains concerning the absolutes by which to judge an AC processor, this example demonstrates clearly, that relying on the match-up analysis alone can lead to misjudgement.

MPI Enhancements in John the Ripper

NASA Astrophysics Data System (ADS)

Sykes, Edward R.; Lin, Michael; Skoczen, Wesley

2010-11-01

John the Ripper (JtR) is an open source software package commonly used by system administrators to enforce password policy. JtR is designed to attack (i.e., crack) passwords encrypted in a wide variety of commonly used formats. While parallel implementations of JtR exist, there are several limitations to them. This research reports on two distinct algorithms that enhance this password cracking tool using the Message Passing Interface. The first algorithm is a novel approach that uses numerous processors to crack one password by using an innovative approach to workload distribution. In this algorithm the candidate password is distributed to all participating processors and the word list is divided based on probability so that each processor has the same likelihood of cracking the password while eliminating overlapping operations. The second algorithm developed in this research involves dividing the passwords within a password file equally amongst available processors while ensuring load-balanced and fault-tolerant behavior. This paper describes John the Ripper, the design of these two algorithms and preliminary results. Given the same amount of time, the original JtR can crack 29 passwords, whereas our algorithms 1 and 2 can crack an additional 35 and 45 passwords respectively.
Dynamic load balance scheme for the DSMC algorithm

DOE Office of Scientific and Technical Information (OSTI.GOV)

Li, Jin; Geng, Xiangren; Jiang, Dingwu

The direct simulation Monte Carlo (DSMC) algorithm, devised by Bird, has been used over a wide range of various rarified flow problems in the past 40 years. While the DSMC is suitable for the parallel implementation on powerful multi-processor architecture, it also introduces a large load imbalance across the processor array, even for small examples. The load imposed on a processor by a DSMC calculation is determined to a large extent by the total of simulator particles upon it. Since most flows are impulsively started with initial distribution of particles which is surely quite different from the steady state, themore » total of simulator particles will change dramatically. The load balance based upon an initial distribution of particles will break down as the steady state of flow is reached. The load imbalance and huge computational cost of DSMC has limited its application to rarefied or simple transitional flows. In this paper, by taking advantage of METIS, a software for partitioning unstructured graphs, and taking the total of simulator particles in each cell as a weight information, the repartitioning based upon the principle that each processor handles approximately the equal total of simulator particles has been achieved. The computation must pause several times to renew the total of simulator particles in each processor and repartition the whole domain again. Thus the load balance across the processors array holds in the duration of computation. The parallel efficiency can be improved effectively. The benchmark solution of a cylinder submerged in hypersonic flow has been simulated numerically. Besides, hypersonic flow past around a complex wing-body configuration has also been simulated. The results have displayed that, for both of cases, the computational time can be reduced by about 50%.« less
40 CFR 432.72 - Effluent limitations attainable by the application of the best practicable control technology...

Code of Federal Regulations, 2010 CFR

2010-07-01

... Protection of Environment ENVIRONMENTAL PROTECTION AGENCY (CONTINUED) EFFLUENT GUIDELINES AND STANDARDS MEAT AND POULTRY PRODUCTS POINT SOURCE CATEGORY Sausage and Luncheon Meats Processors § 432.72 Effluent...
Human Reliability and Ship Stability

DTIC Science & Technology

2003-07-04

models such as Miller (1957) and Broadbent (1959) is the idea of human beings as limited capacity information processors with constraints on...15 4.2.2 Outline of Some Key models ...23 TABLE 11: GENERIC ERROR MODELING SYSTEM
Multiprocessor architecture: Synthesis and evaluation

NASA Technical Reports Server (NTRS)

Standley, Hilda M.

1990-01-01

Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
Status report of the end-to-end ASKAP software system: towards early science operations

NASA Astrophysics Data System (ADS)

Guzman, Juan Carlos; Chapman, Jessica; Marquarding, Malte; Whiting, Matthew

2016-08-01

The Australian SKA Pathfinder (ASKAP) is a novel centimetre radio synthesis telescope currently in the commissioning phase and located in the midwest region of Western Australia. It comprises of 36 x 12 m diameter reflector antennas each equipped with state-of-the-art and award winning Phased Array Feeds (PAF) technology. The PAFs provide a wide, 30 square degree field-of-view by forming up to 36 separate dual-polarisation beams at once. This results in a high data rate: 70 TB of correlated visibilities in an 8-hour observation, requiring custom-written, high-performance software running in dedicated High Performance Computing (HPC) facilities. The first six antennas equipped with first-generation PAF technology (Mark I), named the Boolardy Engineering Test Array (BETA) have been in use since 2014 as a platform to test PAF calibration and imaging techniques, and along the way it has been producing some great science results. Commissioning of the ASKAP Array Release 1, that is the first six antennas with second-generation PAFs (Mark II) is currently under way. An integral part of the instrument is the Central Processor platform hosted at the Pawsey Supercomputing Centre in Perth, which executes custom-written software pipelines, designed specifically to meet the ASKAP imaging requirements of wide field of view and high dynamic range. There are three key hardware components of the Central Processor: The ingest nodes (16 x node cluster), the fast temporary storage (1 PB Lustre file system) and the processing supercomputer (200 TFlop system). This High-Performance Computing (HPC) platform is managed and supported by the Pawsey support team. Due to the limited amount of data generated by BETA and the first ASKAP Array Release, the Central Processor platform has been running in a more "traditional" or user-interactive mode. But this is about to change: integration and verification of the online ingest pipeline starts in early 2016, which is required to support the full 300 MHz bandwidth for Array Release 1; followed by the deployment of the real-time data processing components. In addition to the Central Processor, the first production release of the CSIRO ASKAP Science Data Archive (CASDA) has also been deployed in one of the Pawsey Supercomputing Centre facilities and it is integrated to the end-to-end ASKAP data flow system. This paper describes the current status of the "end-to-end" data flow software system from preparing observations to data acquisition, processing and archiving; and the challenges of integrating an HPC facility as a key part of the instrument. It also shares some lessons learned since the start of integration activities and the challenges ahead in preparation for the start of the Early Science program.
Efficiently modeling neural networks on massively parallel computers

NASA Technical Reports Server (NTRS)

Farber, Robert M.

1993-01-01

Neural networks are a very useful tool for analyzing and modeling complex real world systems. Applying neural network simulations to real world problems generally involves large amounts of data and massive amounts of computation. To efficiently handle the computational requirements of large problems, we have implemented at Los Alamos a highly efficient neural network compiler for serial computers, vector computers, vector parallel computers, and fine grain SIMD computers such as the CM-2 connection machine. This paper describes the mapping used by the compiler to implement feed-forward backpropagation neural networks for a SIMD (Single Instruction Multiple Data) architecture parallel computer. Thinking Machines Corporation has benchmarked our code at 1.3 billion interconnects per second (approximately 3 gigaflops) on a 64,000 processor CM-2 connection machine (Singer 1990). This mapping is applicable to other SIMD computers and can be implemented on MIMD computers such as the CM-5 connection machine. Our mapping has virtually no communications overhead with the exception of the communications required for a global summation across the processors (which has a sub-linear runtime growth on the order of O(log(number of processors)). We can efficiently model very large neural networks which have many neurons and interconnects and our mapping can extend to arbitrarily large networks (within memory limitations) by merging the memory space of separate processors with fast adjacent processor interprocessor communications. This paper will consider the simulation of only feed forward neural network although this method is extendable to recurrent networks.
Formulation of detailed consumables management models for the development (preoperational) period of advanced space transportation system. Volume 3: Study of constraints/limitations for STS consumables management

NASA Technical Reports Server (NTRS)

Newman, C. M.

1976-01-01

The constraints and limitations for STS Consumables Management are studied. Variables imposing constraints on the consumables related subsystems are identified, and a method determining constraint violations with the simplified consumables model in the Mission Planning Processor is presented.
Joint Experiment on Scalable Parallel Processors (JESPP) Parallel Data Management

DTIC Science & Technology

2006-05-01

management and analysis tool, called Simulation Data Grid ( SDG ). The design principles driving the design of SDG are: 1) minimize network communication...or SDG . In this report, an initial prototype implementation of this system is described. This project follows on earlier research, primarily...distributed logging system had some 2 limitations. These limitations will be described in this report, and how the SDG addresses these limitations. 3.0
Fast 2D FWI on a multi and many-cores workstation.

NASA Astrophysics Data System (ADS)

Thierry, Philippe; Donno, Daniela; Noble, Mark

2014-05-01

Following the introduction of x86 co-processors (Xeon Phi) and the performance increase of standard 2-socket workstations using the latest 12 cores E5-v2 x86-64 CPU, we present here a MPI + OpenMP implementation of an acoustic 2D FWI (full waveform inversion) code which simultaneously runs on the CPUs and on the co-processors installed in a workstation. The main advantage of running a 2D FWI on a workstation is to be able to quickly evaluate new features such as more complicated wave equations, new cost functions, finite-difference stencils or boundary conditions. Since the co-processor is made of 61 in-order x86 cores, each of them having up to 4 threads, this many-core can be seen as a shared memory SMP (symmetric multiprocessing) machine with its own IP address. Depending on the vendor, a single workstation can handle several co-processors making the workstation as a personal cluster under the desk. The original Fortran 90 CPU version of the 2D FWI code is just recompiled to get a Xeon Phi x86 binary. This multi and many-core configuration uses standard compilers and associated MPI as well as math libraries under Linux; therefore, the cost of code development remains constant, while improving computation time. We choose to implement the code with the so-called symmetric mode to fully use the capacity of the workstation, but we also evaluate the scalability of the code in native mode (i.e running only on the co-processor) thanks to the Linux ssh and NFS capabilities. Usual care of optimization and SIMD vectorization is used to ensure optimal performances, and to analyze the application performances and bottlenecks on both platforms. The 2D FWI implementation uses finite-difference time-domain forward modeling and a quasi-Newton (with L-BFGS algorithm) optimization scheme for the model parameters update. Parallelization is achieved through standard MPI shot gathers distribution and OpenMP for domain decomposition within the co-processor. Taking advantage of the 16 GB of memory available on the co-processor we are able to keep wavefields in memory to achieve the gradient computation by cross-correlation of forward and back-propagated wavefields needed by our time-domain FWI scheme, without heavy traffic on the i/o subsystem and PCIe bus. In this presentation we will also review some simple methodologies to determine performance expectation compared to real performances in order to get optimization effort estimation before starting any huge modification or rewriting of research codes. The key message is the ease of use and development of this hybrid configuration to reach not the absolute peak performance value but the optimal one that ensures the best balance between geophysical and computer developments.
Vienna FORTRAN: A FORTRAN language extension for distributed memory multiprocessors

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1991-01-01

Exploiting the performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna FORTRAN is a language extension of FORTRAN which provides the user with a wide range of facilities for such mapping of data structures. However, programs in Vienna FORTRAN are written using global data references. Thus, the user has the advantage of a shared memory programming paradigm while explicitly controlling the placement of data. The basic features of Vienna FORTRAN are presented along with a set of examples illustrating the use of these features.
Programming in Vienna Fortran

NASA Technical Reports Server (NTRS)

Chapman, Barbara; Mehrotra, Piyush; Zima, Hans

1992-01-01

Exploiting the full performance potential of distributed memory machines requires a careful distribution of data across the processors. Vienna Fortran is a language extension of Fortran which provides the user with a wide range of facilities for such mapping of data structures. In contrast to current programming practice, programs in Vienna Fortran are written using global data references. Thus, the user has the advantages of a shared memory programming paradigm while explicitly controlling the data distribution. In this paper, we present the language features of Vienna Fortran for FORTRAN 77, together with examples illustrating the use of these features.
Minimum energy information fusion in sensor networks

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chapline, G

1999-05-11

In this paper we consider how to organize the sharing of information in a distributed network of sensors and data processors so as to provide explanations for sensor readings with minimal expenditure of energy. We point out that the Minimum Description Length principle provides an approach to information fusion that is more naturally suited to energy minimization than traditional Bayesian approaches. In addition we show that for networks consisting of a large number of identical sensors Kohonen self-organization provides an exact solution to the problem of combing the sensor outputs into minimal description length explanations.
Proceedings of the Seminar on the DoD Computer Security Initiative Program, National Bureau of Standards, Gaithersburg, Maryland, July 17-18, 1979.

DTIC Science & Technology

1979-01-01

specifications have been prepared for a DoD communications processor on an IBM minicomputer, a minicomputer time sharing system for the DEC PDP-11 and...the Honeywell Level 6. a virtual machine monitor for the IBM 370, and Multics [10] for the Honeywell Level 68. MECHANISMS FOR KERNEL IMPLEMENTATION...HOL INA ZJO : ANERIONS g PROCESSORn , c ...THEOREMS 1 ITP I-THEOREMS PROOF EVIDENCE - p II KV./370 FORMAL DESIGN PROCESS M4ODULAR DECOMPOSITION * NON
New Media Analysis: The Effects of Peer Influence and Personality Characteristics Through the Stages of Trial, Adoption, and Continued Use of Video Sharing Websites

DTIC Science & Technology

2011-03-01

Unfortunately, my family deserves more credit than I could possibly say here, but I must try… Mom, thanks for always supporting my dreams and believing in me...example, the use of a computer word-processor to type a lengthy document may facilitate the trial of a system if the alternative is to handwrite the...technologies are inherently a voluntary form of technological communications; therefore, it is conceivable to say that individuals are more likely to be
Wide Area Recovery and Resiliency Program (WARRP) Knowledge Enhancement Events: Agricultural Waste Disposal Workshop After Action Report

DTIC Science & Technology

2012-07-17

production of milk . Weld produces 57 percent of the milk in Colorado and has become the 17th largest dairy county in the U.S. in cow numbers (almost...engaged in the plan; everyone from the milk producer to the milk processor. 6 “In the event of an outbreak, everyone in this room would have a role...slaughter. Dr. McCarl illustrated the magnitude of the carcass disposal problem, sharing how the problem would be 9 cows wide and stretch the length

Development of Universal Controller Architecture for SiC Based Power Electronic Building Blocks

DTIC Science & Technology

2017-10-30

time control and control network routing and the other for non -real time instrumentation and monitoring. The two subsystems are isolated and share...directly to the processor without any software intervention. We use a non -real time I Gb/s Ethernet interface for monitoring and control of the module...NOTC1 802.lW Spanning tree Prot. 76.96 184.0 107.04 Multiple point Private Line l NOTC1 203.2 382.3 179.1 N/ A Non applicable 1 No traffic control at
Marionette

DOE Office of Scientific and Technical Information (OSTI.GOV)

Sullivan, M.; Anderson, D.P.

1988-01-01

Marionette is a system for distributed parallel programming in an environment of networked heterogeneous computer systems. It is based on a master/slave model. The master process can invoke worker operations (asynchronous remote procedure calls to single slaves) and context operations (updates to the state of all slaves). The master and slaves also interact through shared data structures that can be modified only by the master. The master and slave processes are programmed in a sequential language. The Marionette runtime system manages slave process creation, propagates shared data structures to slaves as needed, queues and dispatches worker and context operations, andmore » manages recovery from slave processor failures. The Marionette system also includes tools for automated compilation of program binaries for multiple architectures, and for distributing binaries to remote fuel systems. A UNIX-based implementation of Marionette is described.« less
50 CFR 679.5 - Recordkeeping and reporting (R&R).

Code of Federal Regulations, 2014 CFR

2014-10-01

... CV trw CP lgl/pot CP trw MS Submit to ... Time limit (1) White X X X X X Must retain, permanently...: CP = catcher/processor; CV = catcher vessel; lgl = longline; trw = trawl; MS = mothership. (2...
50 CFR 679.5 - Recordkeeping and reporting (R&R).

Code of Federal Regulations, 2013 CFR

2013-10-01

... CV trw CP lgl/pot CP trw MS Submit to ... Time limit (1) White X X X X X Must retain, permanently...: CP = catcher/processor; CV = catcher vessel; lgl = longline; trw = trawl; MS = mothership. (2...
Scalable ion-photon quantum interface based on integrated diffractive mirrors

NASA Astrophysics Data System (ADS)

Ghadimi, Moji; BlÅ«ms, Valdis; Norton, Benjamin G.; Fisher, Paul M.; Connell, Steven C.; Amini, Jason M.; Volin, Curtis; Hayden, Harley; Pai, Chien-Shing; Kielpinski, David; Lobino, Mirko; Streed, Erik W.

2017-12-01

Quantum networking links quantum processors through remote entanglement for distributed quantum information processing and secure long-range communication. Trapped ions are a leading quantum information processing platform, having demonstrated universal small-scale processors and roadmaps for large-scale implementation. Overall rates of ion-photon entanglement generation, essential for remote trapped ion entanglement, are limited by coupling efficiency into single mode fibers and scaling to many ions. Here, we show a microfabricated trap with integrated diffractive mirrors that couples 4.1(6)% of the fluorescence from a 174Yb+ ion into a single mode fiber, nearly triple the demonstrated bulk optics efficiency. The integrated optic collects 5.8(8)% of the π transition fluorescence, images the ion with sub-wavelength resolution, and couples 71(5)% of the collected light into the fiber. Our technology is suitable for entangling multiple ions in parallel and overcomes mode quality limitations of existing integrated optical interconnects.
Neural-network dedicated processor for solving competitive assignment problems

NASA Technical Reports Server (NTRS)

Eberhardt, Silvio P. (Inventor)

1993-01-01

A neural-network processor for solving first-order competitive assignment problems consists of a matrix of N x M processing units, each of which corresponds to the pairing of a first number of elements of (R sub i) with a second number of elements (C sub j), wherein limits of the first number are programmed in row control superneurons, and limits of the second number are programmed in column superneurons as MIN and MAX values. The cost (weight) W sub ij of the pairings is programmed separately into each PU. For each row and column of PU's, a dedicated constraint superneuron insures that the number of active neurons within the associated row or column fall within a specified range. Annealing is provided by gradually increasing the PU gain for each row and column or increasing positive feedback to each PU, the latter being effective to increase hysteresis of each PU or by combining both of these techniques.
Vibrational Analysis of Engine Components Using Neural-Net Processing and Electronic Holography

NASA Technical Reports Server (NTRS)

Decker, Arthur J.; Fite, E. Brian; Mehmed, Oral; Thorp, Scott A.

1997-01-01

The use of computational-model trained artificial neural networks to acquire damage specific information from electronic holograms is discussed. A neural network is trained to transform two time-average holograms into a pattern related to the bending-induced-strain distribution of the vibrating component. The bending distribution is very sensitive to component damage unlike the characteristic fringe pattern or the displacement amplitude distribution. The neural network processor is fast for real-time visualization of damage. The two-hologram limit makes the processor more robust to speckle pattern decorrelation. Undamaged and cracked cantilever plates serve as effective objects for testing the combination of electronic holography and neural-net processing. The requirements are discussed for using finite-element-model trained neural networks for field inspections of engine components. The paper specifically discusses neural-network fringe pattern analysis in the presence of the laser speckle effect and the performances of two limiting cases of the neural-net architecture.
Vibrational Analysis of Engine Components Using Neural-Net Processing and Electronic Holography

NASA Technical Reports Server (NTRS)

Decker, Arthur J.; Fite, E. Brian; Mehmed, Oral; Thorp, Scott A.

1998-01-01

The use of computational-model trained artificial neural networks to acquire damage specific information from electronic holograms is discussed. A neural network is trained to transform two time-average holograms into a pattern related to the bending-induced-strain distribution of the vibrating component. The bending distribution is very sensitive to component damage unlike the characteristic fringe pattern or the displacement amplitude distribution. The neural network processor is fast for real-time visualization of damage. The two-hologram limit makes the processor more robust to speckle pattern decorrelation. Undamaged and cracked cantilever plates serve as effective objects for testing the combination of electronic holography and neural-net processing. The requirements are discussed for using finite-element-model trained neural networks for field inspections of engine components. The paper specifically discusses neural-network fringe pattern analysis in the presence of the laser speckle effect and the performances of two limiting cases of the neural-net architecture.
Does the Intel Xeon Phi processor fit HEP workloads?

NASA Astrophysics Data System (ADS)

Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

2014-06-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
Diesel fuel to dc power: Navy & Marine Corps Applications

DOE Office of Scientific and Technical Information (OSTI.GOV)

Bloomfield, D.P.

1996-12-31

During the past year Analytic Power has tested fuel cell stacks and diesel fuel processors for US Navy and Marine Corps applications. The units are 10 kW demonstration power plants. The USN power plant was built to demonstrate the feasibility of diesel fueled PEM fuel cell power plants for 250 kW and 2.5 MW shipboard power systems. We designed and tested a ten cell, 1 kW USMC substack and fuel processor. The complete 10 kW prototype power plant, which has application to both power and hydrogen generation, is now under construction. The USN and USMC fuel cell stacks have beenmore » tested on both actual and simulated reformate. Analytic Power has accumulated operating experience with autothermal reforming based fuel processors operating on sulfur bearing diesel fuel, jet fuel, propane and natural gas. We have also completed the design and fabrication of an advanced regenerative ATR for the USMC. One of the significant problems with small fuel processors is heat loss which limits its ability to operate with the high steam to carbon ratios required for coke free high efficiency operation. The new USMC unit specifically addresses these heat transfer issues. The advances in the mill programs have been incorporated into Analytic Power`s commercial units which are now under test.« less
Using Modern Design Tools for Digital Avionics Development

NASA Technical Reports Server (NTRS)

Hyde, David W.; Lakin, David R., II; Asquith, Thomas E.

2000-01-01

Using Modem Design Tools for Digital Avionics Development Shrinking development time and increased complexity of new avionics forces the designer to use modem tools and methods during hardware development. Engineers at the Marshall Space Flight Center have successfully upgraded their design flow and used it to develop a Mongoose V based radiation tolerant processor board for the International Space Station's Water Recovery System. The design flow, based on hardware description languages, simulation, synthesis, hardware models, and full functional software model libraries, allowed designers to fully simulate the processor board from reset, through initialization before any boards were built. The fidelity of a digital simulation is limited to the accuracy of the models used and how realistically the designer drives the circuit's inputs during simulation. By using the actual silicon during simulation, device modeling errors are reduced. Numerous design flaws were discovered early in the design phase when they could be easily fixed. The use of hardware models and actual MIPS software loaded into full functional memory models also provided checkout of the software development environment. This paper will describe the design flow used to develop the processor board and give examples of errors that were found using the tools. An overview of the processor board firmware will also be covered.
Monte Carlo dose calculation using a cell processor based PlayStation 3 system

NASA Astrophysics Data System (ADS)

Chow, James C. L.; Lam, Phil; Jaffray, David A.

2012-02-01

This study investigates the performance of the EGSnrc computer code coupled with a Cell-based hardware in Monte Carlo simulation of radiation dose in radiotherapy. Performance evaluations of two processor-intensive functions namely, HOWNEAR and RANMAR_GET in the EGSnrc code were carried out basing on the 20-80 rule (Pareto principle). The execution speeds of the two functions were measured by the profiler gprof specifying the number of executions and total time spent on the functions. A testing architecture designed for Cell processor was implemented in the evaluation using a PlayStation3 (PS3) system. The evaluation results show that the algorithms examined are readily parallelizable on the Cell platform, provided that an architectural change of the EGSnrc was made. However, as the EGSnrc performance was limited by the PowerPC Processing Element in the PS3, PC coupled with graphics processing units or GPCPU may provide a more viable avenue for acceleration.
78 FR 16334 - Self-Regulatory Organizations; BATS Exchange, Inc.; Notice of Filing and Immediate Effectiveness...

Federal Register 2010, 2011, 2012, 2013, 2014

2013-03-14

... (May 31, 2012), 77 FR 33498 (June 6, 2012) (the ``Limit Up-Limit Down Release''). The text of the... proposed rule change and discussed any comments it received on the proposed rule change. The text of these... calculated by the Processors.\\13\\ When the National Best Bid (Offer) is below (above) the Lower (Upper) Price...
18. Perimeter acquisition radar building room #105, deionizers (filter tanks) ...

Library of Congress Historic Buildings Survey, Historic Engineering Record, Historic Landscapes Survey

18. Perimeter acquisition radar building room #105, deionizers (filter tanks) for data processor cooling and ice backup; sign reads: Deionizer units provide high-purity water by removal of oxygen, and organic and mineral content from water - Stanley R. Mickelsen Safeguard Complex, Perimeter Acquisition Radar Building, Limited Access Area, between Limited Access Patrol Road & Service Road A, Nekoma, Cavalier County, ND
48 CFR 52.219-18 - Notification of Competition Limited to Eligible 8(a) Concerns.

Code of Federal Regulations, 2010 CFR

2010-10-01

... conformance with the Business Activity Targets set forth in its approved business plan or any remedial action... business manufacturers or processors in the Federal market in accordance with 19.502-2(c), delete...
50 CFR 679.5 - Recordkeeping and reporting (R&R).

Code of Federal Regulations, 2012 CFR

2012-10-01

... ... Logsheets found in these logbooks CV lgl/pot CV trw CP lgl/pot CP trw MS Submit to ... Time limit (1) White... vessel's catch is off-loaded Note: CP = catcher/processor; CV = catcher vessel; lgl = longline; trw...
50 CFR 679.5 - Recordkeeping and reporting (R&R).

Code of Federal Regulations, 2010 CFR

2010-10-01

... ... Logsheets found in these logbooks CV lgl/pot CV trw CP lgl/pot CP trw MS Submit to ... Time limit (1) White... vessel's catch is off-loaded Note: CP = catcher/processor; CV = catcher vessel; lgl = longline; trw...
50 CFR 679.5 - Recordkeeping and reporting (R&R).

Code of Federal Regulations, 2011 CFR

2011-10-01

... ... Logsheets found in these logbooks CV lgl/pot CV trw CP lgl/pot CP trw MS Submit to ... Time limit (1) White... vessel's catch is off-loaded Note: CP = catcher/processor; CV = catcher vessel; lgl = longline; trw...
Smart Power Supply for Battery-Powered Systems

NASA Technical Reports Server (NTRS)

Krasowski, Michael J.; Greer, Lawrence; Prokop, Norman F.; Flatico, Joseph M.

2010-01-01

A power supply for battery-powered systems has been designed with an embedded controller that is capable of monitoring and maintaining batteries, charging hardware, while maintaining output power. The power supply is primarily designed for rovers and other remote science and engineering vehicles, but it can be used in any battery alone, or battery and charging source applications. The supply can function autonomously, or can be connected to a host processor through a serial communications link. It can be programmed a priori or on the fly to return current and voltage readings to a host. It has two output power busses: a constant 24-V direct current nominal bus, and a programmable bus for output from approximately 24 up to approximately 50 V. The programmable bus voltage level, and its output power limit, can be changed on the fly as well. The power supply also offers options to reduce the programmable bus to 24 V when the set power limit is reached, limiting output power in the case of a system fault detected in the system. The smart power supply is based on an embedded 8051-type single-chip microcontroller. This choice was made in that a credible progression to flight (radiation hard, high reliability) can be assumed as many 8051 processors or gate arrays capable of accepting 8051-type core presently exist and will continue to do so for some time. To solve the problem of centralized control, this innovation moves an embedded microcontroller to the power supply and assigns it the task of overseeing the operation and charging of the power supply assets. This embedded processor is connected to the application central processor via a serial data link such that the central processor can request updates of various parameters within the supply, such as battery current, bus voltage, remaining power in battery estimations, etc. This supply has a direct connection to the battery bus for common (quiescent) power application. Because components from multiple vendors may have differing power needs, this supply also has a secondary power bus, which can be programmed a priori or on-the-fly to boost the primary battery voltage level from 24 to 50 V to accommodate various loads as they are brought on line. Through voltage and current monitoring, the device can also shield the charging source from overloads, keep it within safe operating modes, and can meter available power to the application and maintain safe operations.
Improving the performance of heterogeneous multi-core processors by modifying the cache coherence protocol

NASA Astrophysics Data System (ADS)

Fang, Juan; Hao, Xiaoting; Fan, Qingwen; Chang, Zeqing; Song, Shuying

2017-05-01

In the Heterogeneous multi-core architecture, CPU and GPU processor are integrated on the same chip, which poses a new challenge to the last-level cache management. In this architecture, the CPU application and the GPU application execute concurrently, accessing the last-level cache. CPU and GPU have different memory access characteristics, so that they have differences in the sensitivity of last-level cache (LLC) capacity. For many CPU applications, a reduced share of the LLC could lead to significant performance degradation. On the contrary, GPU applications can tolerate increase in memory access latency when there is sufficient thread-level parallelism. Taking into account the GPU program memory latency tolerance characteristics, this paper presents a method that let GPU applications can access to memory directly, leaving lots of LLC space for CPU applications, in improving the performance of CPU applications and does not affect the performance of GPU applications. When the CPU application is cache sensitive, and the GPU application is insensitive to the cache, the overall performance of the system is improved significantly.

New computing systems and their impact on structural analysis and design

NASA Technical Reports Server (NTRS)

Noor, Ahmed K.

1989-01-01

A review is given of the recent advances in computer technology that are likely to impact structural analysis and design. The computational needs for future structures technology are described. The characteristics of new and projected computing systems are summarized. Advances in programming environments, numerical algorithms, and computational strategies for new computing systems are reviewed, and a novel partitioning strategy is outlined for maximizing the degree of parallelism. The strategy is designed for computers with a shared memory and a small number of powerful processors (or a small number of clusters of medium-range processors). It is based on approximating the response of the structure by a combination of symmetric and antisymmetric response vectors, each obtained using a fraction of the degrees of freedom of the original finite element model. The strategy was implemented on the CRAY X-MP/4 and the Alliant FX/8 computers. For nonlinear dynamic problems on the CRAY X-MP with four CPUs, it resulted in an order of magnitude reduction in total analysis time, compared with the direct analysis on a single-CPU CRAY X-MP machine.
Mass storage at NSA

NASA Technical Reports Server (NTRS)

Shields, Michael F.

1993-01-01

The need to manage large amounts of data on robotically controlled devices has been critical to the mission of this Agency for many years. In many respects this Agency has helped pioneer, with their industry counterparts, the development of a number of products long before these systems became commercially available. Numerous attempts have been made to field both robotically controlled tape and optical disk technology and systems to satisfy our tertiary storage needs. Custom developed products were architected, designed, and developed without vendor partners over the past two decades to field workable systems to handle our ever increasing storage requirements. Many of the attendees of this symposium are familiar with some of the older products, such as: the Braegen Automated Tape Libraries (ATL's), the IBM 3850, the Ampex TeraStore, just to name a few. In addition, we embarked on an in-house development of a shared disk input/output support processor to manage our every increasing tape storage needs. For all intents and purposes, this system was a file server by current definitions which used CDC Cyber computers as the control processors. It served us well and was just recently removed from production usage.
Optical Interconnections for VLSI Computational Systems Using Computer-Generated Holography.

NASA Astrophysics Data System (ADS)

Feldman, Michael Robert

Optical interconnects for VLSI computational systems using computer generated holograms are evaluated in theory and experiment. It is shown that by replacing particular electronic connections with free-space optical communication paths, connection of devices on a single chip or wafer and between chips or modules can be improved. Optical and electrical interconnects are compared in terms of power dissipation, communication bandwidth, and connection density. Conditions are determined for which optical interconnects are advantageous. Based on this analysis, it is shown that by applying computer generated holographic optical interconnects to wafer scale fine grain parallel processing systems, dramatic increases in system performance can be expected. Some new interconnection networks, designed to take full advantage of optical interconnect technology, have been developed. Experimental Computer Generated Holograms (CGH's) have been designed, fabricated and subsequently tested in prototype optical interconnected computational systems. Several new CGH encoding methods have been developed to provide efficient high performance CGH's. One CGH was used to decrease the access time of a 1 kilobit CMOS RAM chip. Another was produced to implement the inter-processor communication paths in a shared memory SIMD parallel processor array.
77 FR 59852 - Fisheries of the Exclusive Economic Zone Off Alaska; Bering Sea and Aleutian Islands Management...

Federal Register 2010, 2011, 2012, 2013, 2014

2012-10-01

...NMFS publishes regulations to implement Amendment 97 to the Fishery Management Plan for Groundfish of the Bering Sea and Aleutian Islands Management Area (FMP). Amendment 97 allows the owner of a trawl catcher/processor vessel authorized to participate in the Amendment 80 catch share program to replace that vessel with a vessel that meets certain requirements. This action establishes the regulatory process for replacement of vessels in the Amendment 80 fleet and the requirements for Amendment 80 replacement vessels, such as a limit on the overall length of a replacement vessel, a prohibition on the use of an AFA vessel as a replacement vessel, measures to prevent a replaced vessel from participating in Federal groundfish fisheries off Alaska that are not Amendment 80 fisheries, and measures that extend specific catch limits (known as Amendment 80 sideboards) to a replacement vessel. This action is necessary to promote safety-at-sea by allowing Amendment 80 vessel owners to replace their vessels for any reason at any time and by requiring replacement vessels to meet certain U.S. Coast Guard vessel safety standards, and to improve the retention and utilization of groundfish catch by these vessels by facilitating an increase in the processing capabilities of the fleet. This action is intended to promote the goals and objectives of the Magnuson-Stevens Fishery Conservation and Management Act, the FMP, and other applicable laws.
Hierarchical algorithms for modeling the ocean on hierarchical architectures

NASA Astrophysics Data System (ADS)

Hill, C. N.

2012-12-01

This presentation will describe an approach to using accelerator/co-processor technology that maps hierarchical, multi-scale modeling techniques to an underlying hierarchical hardware architecture. The focus of this work is on making effective use of both CPU and accelerator/co-processor parts of a system, for large scale ocean modeling. In the work, a lower resolution basin scale ocean model is locally coupled to multiple, "embedded", limited area higher resolution sub-models. The higher resolution models execute on co-processor/accelerator hardware and do not interact directly with other sub-models. The lower resolution basin scale model executes on the system CPU(s). The result is a multi-scale algorithm that aligns with hardware designs in the co-processor/accelerator space. We demonstrate this approach being used to substitute explicit process models for standard parameterizations. Code for our sub-models is implemented through a generic abstraction layer, so that we can target multiple accelerator architectures with different programming environments. We will present two application and implementation examples. One uses the CUDA programming environment and targets GPU hardware. This example employs a simple non-hydrostatic two dimensional sub-model to represent vertical motion more accurately. The second example uses a highly threaded three-dimensional model at high resolution. This targets a MIC/Xeon Phi like environment and uses sub-models as a way to explicitly compute sub-mesoscale terms. In both cases the accelerator/co-processor capability provides extra compute cycles that allow improved model fidelity for little or no extra wall-clock time cost.
FPGA-based distributed computing microarchitecture for complex physical dynamics investigation.

PubMed

Borgese, Gianluca; Pace, Calogero; Pantano, Pietro; Bilotta, Eleonora

2013-09-01

In this paper, we present a distributed computing system, called DCMARK, aimed at solving partial differential equations at the basis of many investigation fields, such as solid state physics, nuclear physics, and plasma physics. This distributed architecture is based on the cellular neural network paradigm, which allows us to divide the differential equation system solving into many parallel integration operations to be executed by a custom multiprocessor system. We push the number of processors to the limit of one processor for each equation. In order to test the present idea, we choose to implement DCMARK on a single FPGA, designing the single processor in order to minimize its hardware requirements and to obtain a large number of easily interconnected processors. This approach is particularly suited to study the properties of 1-, 2- and 3-D locally interconnected dynamical systems. In order to test the computing platform, we implement a 200 cells, Korteweg-de Vries (KdV) equation solver and perform a comparison between simulations conducted on a high performance PC and on our system. Since our distributed architecture takes a constant computing time to solve the equation system, independently of the number of dynamical elements (cells) of the CNN array, it allows us to reduce the elaboration time more than other similar systems in the literature. To ensure a high level of reconfigurability, we design a compact system on programmable chip managed by a softcore processor, which controls the fast data/control communication between our system and a PC Host. An intuitively graphical user interface allows us to change the calculation parameters and plot the results.
The limits of sharing: an ethical analysis of the arguments for and against the sharing of databases and material banks.

PubMed

Smith, Elise

2011-11-01

In this article, I study the challenges that make database and material bank sharing difficult for many researchers. I assert that if sharing is prima facie ethical (a view that I will defend), then any practices that limit sharing require justification. I argue that: 1) data and material sharing is ethical for many stakeholders; 2) there are, however, certain reasonable limits to sharing; and 3) the rationale and validity of arguments for any limitations to sharing must be made transparent. I conclude by providing general recommendations for how to ethically share databases and material banks.
Running ATLAS workloads within massively parallel distributed applications using Athena Multi-Process framework (AthenaMP)

NASA Astrophysics Data System (ADS)

Calafiura, Paolo; Leggett, Charles; Seuster, Rolf; Tsulaia, Vakhtang; Van Gemmeren, Peter

2015-12-01

AthenaMP is a multi-process version of the ATLAS reconstruction, simulation and data analysis framework Athena. By leveraging Linux fork and copy-on-write mechanisms, it allows for sharing of memory pages between event processors running on the same compute node with little to no change in the application code. Originally targeted to optimize the memory footprint of reconstruction jobs, AthenaMP has demonstrated that it can reduce the memory usage of certain configurations of ATLAS production jobs by a factor of 2. AthenaMP has also evolved to become the parallel event-processing core of the recently developed ATLAS infrastructure for fine-grained event processing (Event Service) which allows the running of AthenaMP inside massively parallel distributed applications on hundreds of compute nodes simultaneously. We present the architecture of AthenaMP, various strategies implemented by AthenaMP for scheduling workload to worker processes (for example: Shared Event Queue and Shared Distributor of Event Tokens) and the usage of AthenaMP in the diversity of ATLAS event processing workloads on various computing resources: Grid, opportunistic resources and HPC.
A Comparison of Three Programming Models for Adaptive Applications

NASA Technical Reports Server (NTRS)

Shan, Hong-Zhang; Singh, Jaswinder Pal; Oliker, Leonid; Biswa, Rupak; Kwak, Dochan (Technical Monitor)

2000-01-01

We study the performance and programming effort for two major classes of adaptive applications under three leading parallel programming models. We find that all three models can achieve scalable performance on the state-of-the-art multiprocessor machines. The basic parallel algorithms needed for different programming models to deliver their best performance are similar, but the implementations differ greatly, far beyond the fact of using explicit messages versus implicit loads/stores. Compared with MPI and SHMEM, CC-SAS (cache-coherent shared address space) provides substantial ease of programming at the conceptual and program orchestration level, which often leads to the performance gain. However it may also suffer from the poor spatial locality of physically distributed shared data on large number of processors. Our CC-SAS implementation of the PARMETIS partitioner itself runs faster than in the other two programming models, and generates more balanced result for our application.
Parallel, Asynchronous Executive (PAX): System concepts, facilities, and architecture

NASA Technical Reports Server (NTRS)

Jones, W. H.

1983-01-01

The Parallel, Asynchronous Executive (PAX) is a software operating system simulation that allows many computers to work on a single problem at the same time. PAX is currently implemented on a UNIVAC 1100/42 computer system. Independent UNIVAC runstreams are used to simulate independent computers. Data are shared among independent UNIVAC runstreams through shared mass-storage files. PAX has achieved the following: (1) applied several computing processes simultaneously to a single, logically unified problem; (2) resolved most parallel processor conflicts by careful work assignment; (3) resolved by means of worker requests to PAX all conflicts not resolved by work assignment; (4) provided fault isolation and recovery mechanisms to meet the problems of an actual parallel, asynchronous processing machine. Additionally, one real-life problem has been constructed for the PAX environment. This is CASPER, a collection of aerodynamic and structural dynamic problem simulation routines. CASPER is not discussed in this report except to provide examples of parallel-processing techniques.
Partitioning problems in parallel, pipelined and distributed computing

NASA Technical Reports Server (NTRS)

Bokhari, S.

1985-01-01

The problem of optimally assigning the modules of a parallel program over the processors of a multiple computer system is addressed. A Sum-Bottleneck path algorithm is developed that permits the efficient solution of many variants of this problem under some constraints on the structure of the partitions. In particular, the following problems are solved optimally for a single-host, multiple satellite system: partitioning multiple chain structured parallel programs, multiple arbitrarily structured serial programs and single tree structured parallel programs. In addition, the problems of partitioning chain structured parallel programs across chain connected systems and across shared memory (or shared bus) systems are also solved under certain constraints. All solutions for parallel programs are equally applicable to pipelined programs. These results extend prior research in this area by explicitly taking concurrency into account and permit the efficient utilization of multiple computer architectures for a wide range of problems of practical interest.
Optimizing CMS build infrastructure via Apache Mesos

NASA Astrophysics Data System (ADS)

Abdurachmanov, David; Degano, Alessandro; Elmer, Peter; Eulisse, Giulio; Mendez, David; Muzaffar, Shahzad

2015-12-01

The Offline Software of the CMS Experiment at the Large Hadron Collider (LHC) at CERN consists of 6M lines of in-house code, developed over a decade by nearly 1000 physicists, as well as a comparable amount of general use open-source code. A critical ingredient to the success of the construction and early operation of the WLCG was the convergence, around the year 2000, on the use of a homogeneous environment of commodity x86-64 processors and Linux. Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, Jenkins, Spark, Aurora, and other applications on a dynamically shared pool of nodes. We present how we migrated our continuous integration system to schedule jobs on a relatively small Apache Mesos enabled cluster and how this resulted in better resource usage, higher peak performance and lower latency thanks to the dynamic scheduling capabilities of Mesos.
Implementation of Parallel Dynamic Simulation on Shared-Memory vs. Distributed-Memory Environments

DOE Office of Scientific and Technical Information (OSTI.GOV)

Jin, Shuangshuang; Chen, Yousu; Wu, Di

2015-12-09

Power system dynamic simulation computes the system response to a sequence of large disturbance, such as sudden changes in generation or load, or a network short circuit followed by protective branch switching operation. It consists of a large set of differential and algebraic equations, which is computational intensive and challenging to solve using single-processor based dynamic simulation solution. High-performance computing (HPC) based parallel computing is a very promising technology to speed up the computation and facilitate the simulation process. This paper presents two different parallel implementations of power grid dynamic simulation using Open Multi-processing (OpenMP) on shared-memory platform, and Messagemore » Passing Interface (MPI) on distributed-memory clusters, respectively. The difference of the parallel simulation algorithms and architectures of the two HPC technologies are illustrated, and their performances for running parallel dynamic simulation are compared and demonstrated.« less
Trust-Management, Intrusion-Tolerance, Accountability, and Reconstitution Architecture (TIARA)

DTIC Science & Technology

2009-12-01

Tainting, tagged, metadata, architecture, hardware, processor, microkernel , zero-kernel, co-design 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF... microkernels (e.g., [27]) embraced the idea that it was beneficial to reduce the ker- nel, separating out services as separate processes isolated from...limited adoption. More recently Tanenbaum [72] notes the security virtues of microkernels and suggests the modern importance of security makes it
Efficiency of static core turn-off in a system-on-a-chip with variation

DOEpatents

Cher, Chen-Yong; Coteus, Paul W; Gara, Alan; Kursun, Eren; Paulsen, David P; Schuelke, Brian A; Sheets, II, John E; Tian, Shurong

2013-10-29

A processor-implemented method for improving efficiency of a static core turn-off in a multi-core processor with variation, the method comprising: conducting via a simulation a turn-off analysis of the multi-core processor at the multi-core processor's design stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's design stage includes a first output corresponding to a first multi-core processor core to turn off; conducting a turn-off analysis of the multi-core processor at the multi-core processor's testing stage, wherein the turn-off analysis of the multi-core processor at the multi-core processor's testing stage includes a second output corresponding to a second multi-core processor core to turn off; comparing the first output and the second output to determine if the first output is referring to the same core to turn off as the second output; outputting a third output corresponding to the first multi-core processor core if the first output and the second output are both referring to the same core to turn off.
Contaminant Permeation in the Ionomer-Membrane Water Processor (IWP) System

NASA Technical Reports Server (NTRS)

Kelsey, Laura K.; Finger, Barry W.; Pasadilla, Patrick; Perry, Jay

2016-01-01

The Ionomer-membrane Water Processor (IWP) is a patented membrane-distillation based urine brine water recovery system. The unique properties of the IWP membrane pair limit contaminant permeation from the brine to the recovered water and purge gas. A paper study was conducted to predict volatile trace contaminant permeation in the IWP system. Testing of a large-scale IWP Engineering Development Unit (EDU) with urine brine pretreated with the International Space Station (ISS) pretreatment formulation was then conducted to collect air and water samples for quality analysis. Distillate water quality and purge air GC-MS results are presented and compared to predictions, along with implications for the IWP brine processing system.
PIFEX: An advanced programmable pipelined-image processor

NASA Technical Reports Server (NTRS)

Gennery, D. B.; Wilcox, B.

1985-01-01

PIFEX is a pipelined-image processor being built in the JPL Robotics Lab. It will operate on digitized raster-scanned images (at 60 frames per second for images up to about 300 by 400 and at lesser rates for larger images), performing a variety of operations simultaneously under program control. It thus is a powerful, flexible tool for image processing and low-level computer vision. It also has applications in other two-dimensional problems such as route planning for obstacle avoidance and the numerical solution of two-dimensional partial differential equations (although its low numerical precision limits its use in the latter field). The concept and design of PIFEX are described herein, and some examples of its use are given.
An efficient optical architecture for sparsely connected neural networks

NASA Technical Reports Server (NTRS)

Hine, Butler P., III; Downie, John D.; Reid, Max B.

1990-01-01

An architecture for general-purpose optical neural network processor is presented in which the interconnections and weights are formed by directing coherent beams holographically, thereby making use of the space-bandwidth products of the recording medium for sparsely interconnected networks more efficiently that the commonly used vector-matrix multiplier, since all of the hologram area is in use. An investigation is made of the use of computer-generated holograms recorded on such updatable media as thermoplastic materials, in order to define the interconnections and weights of a neural network processor; attention is given to limits on interconnection densities, diffraction efficiencies, and weighing accuracies possible with such an updatable thin film holographic device.
Estimating water flow through a hillslope using the massively parallel processor

NASA Technical Reports Server (NTRS)

Devaney, Judy E.; Camillo, P. J.; Gurney, R. J.

1988-01-01

A new two-dimensional model of water flow in a hillslope has been implemented on the Massively Parallel Processor at the Goddard Space Flight Center. Flow in the soil both in the saturated and unsaturated zones, evaporation and overland flow are all modelled, and the rainfall rates are allowed to vary spatially. Previous models of this type had always been very limited computationally. This model takes less than a minute to model all the components of the hillslope water flow for a day. The model can now be used in sensitivity studies to specify which measurements should be taken and how accurate they should be to describe such flows for environmental studies.
General linear codes for fault-tolerant matrix operations on processor arrays

NASA Technical Reports Server (NTRS)

Nair, V. S. S.; Abraham, J. A.

1988-01-01

Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.

Recovery Act - CAREER: Sustainable Silicon -- Energy-Efficient VLSI Interconnect for Extreme-Scale Computing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chiang, Patrick

2014-01-31

The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
78 FR 40696 - Proposed Information Collection; Comment Request; Alaska Crab Cost Recovery

Federal Register 2010, 2011, 2012, 2013, 2014

2013-07-08

... Collection; Comment Request; Alaska Crab Cost Recovery AGENCY: National Oceanic and Atmospheric..., a limited access system that allocates BSAI Crab resources among harvesters, processors, and coastal communities. The intent of the Alaska Crab Cost Recovery is to [[Page 40697
Old PCs: Upgrade or Abandon?

ERIC Educational Resources Information Center

Perez, Ernest

1997-01-01

Examines the practical realities of upgrading Intel personal computers in libraries, considering budgets and technical personnel availability. Highlights include adding RAM; putting in faster processor chips, including clock multipliers; new hard disks; CD-ROM speed; motherboards and interface cards; cost limits and economic factors; and…
Calibrating thermal behavior of electronics

DOEpatents

Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

2017-07-11

A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.
Calibrating thermal behavior of electronics

DOEpatents

Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

2016-05-31

A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.
Calibrating thermal behavior of electronics

DOEpatents

Chainer, Timothy J.; Parida, Pritish R.; Schultz, Mark D.

2017-01-03

A method includes determining a relationship between indirect thermal data for a processor and a measured temperature associated with the processor, during a calibration process, obtaining the indirect thermal data for the processor during actual operation of the processor, and determining an actual significant temperature associated with the processor during the actual operation using the indirect thermal data for the processor during actual operation of the processor and the relationship.
On the impact of communication complexity in the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D.; Vanrosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical algorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In the second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Implementations of BLAST for parallel computers.

PubMed

Jülich, A

1995-02-01

The BLAST sequence comparison programs have been ported to a variety of parallel computers-the shared memory machine Cray Y-MP 8/864 and the distributed memory architectures Intel iPSC/860 and nCUBE. Additionally, the programs were ported to run on workstation clusters. We explain the parallelization techniques and consider the pros and cons of these methods. The BLAST programs are very well suited for parallelization for a moderate number of processors. We illustrate our results using the program blastp as an example. As input data for blastp, a 799 residue protein query sequence and the protein database PIR were used.
On the impact of communication complexity on the design of parallel numerical algorithms

NASA Technical Reports Server (NTRS)

Gannon, D. B.; Van Rosendale, J.

1984-01-01

This paper describes two models of the cost of data movement in parallel numerical alorithms. One model is a generalization of an approach due to Hockney, and is suitable for shared memory multiprocessors where each processor has vector capabilities. The other model is applicable to highly parallel nonshared memory MIMD systems. In this second model, algorithm performance is characterized in terms of the communication network design. Techniques used in VLSI complexity theory are also brought in, and algorithm-independent upper bounds on system performance are derived for several problems that are important to scientific computation.
Electroacoustic dewatering of food and other suspensions

DOE Office of Scientific and Technical Information (OSTI.GOV)

Kim, B.C.; Zelinski, M.S.; Criner, C.L.

1989-05-31

The food processing industry is a large user of energy for evaporative drying due to limited effectiveness of conventional mechanical dewatering machines. Battelle's Electroacoustic Dewatering (EAD) process improves the performance of mechanical dewatering machines by superimposing electric and ultrasonic fields. A two phase development program to demonstrate the benefits of EAD was carried out in cooperation with the food processing industry, the National Food Processors Association (NFPA) and two equipment vendors. In Phase I, laboratory scale studies were carried out on a variety of food suspensions. The process was scaled up to small commercial scale in Phase II. The technicalmore » feasibility of EAD for a variety of food materials, without adversely affecting the food properties, was successfully demonstrated during this phase, which is the subject of this report. Two Process Research Units (PRUs) were designed and built through joint efforts between Battelle and two equipment vendors. A 0.5-meter wide belt press was tested on apple mash, corn fiber, and corn gluten at sites provided by two food processors. A high speed citrus juice finisher (a hybrid form of screw press and centrifuge) was tested on orange pulp. These tests were carried out jointly by Battelle, equipment vendors, NFPA, and food processors. The apple and citrus juice products were analyzed by food processors and NFPA. 26 figs., 30 tabs.« less
Effective Vectorization with OpenMP 4.5

DOE Office of Scientific and Technical Information (OSTI.GOV)

Huber, Joseph N.; Hernandez, Oscar R.; Lopez, Matthew Graham

This paper describes how the Single Instruction Multiple Data (SIMD) model and its extensions in OpenMP work, and how these are implemented in different compilers. Modern processors are highly parallel computational machines which often include multiple processors capable of executing several instructions in parallel. Understanding SIMD and executing instructions in parallel allows the processor to achieve higher performance without increasing the power required to run it. SIMD instructions can significantly reduce the runtime of code by executing a single operation on large groups of data. The SIMD model is so integral to the processor s potential performance that, if SIMDmore » is not utilized, less than half of the processor is ever actually used. Unfortunately, using SIMD instructions is a challenge in higher level languages because most programming languages do not have a way to describe them. Most compilers are capable of vectorizing code by using the SIMD instructions, but there are many code features important for SIMD vectorization that the compiler cannot determine at compile time. OpenMP attempts to solve this by extending the C++/C and Fortran programming languages with compiler directives that express SIMD parallelism. OpenMP is used to pass hints to the compiler about the code to be executed in SIMD. This is a key resource for making optimized code, but it does not change whether or not the code can use SIMD operations. However, in many cases critical functions are limited by a poor understanding of how SIMD instructions are actually implemented, as SIMD can be implemented through vector instructions or simultaneous multi-threading (SMT). We have found that it is often the case that code cannot be vectorized, or is vectorized poorly, because the programmer does not have sufficient knowledge of how SIMD instructions work.« less
50 CFR Table 40 to Part 679 - BSAI Halibut PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2010 CFR

2010-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED) FISHERIES OF THE EXCLUSIVE ECONOMIC ZONE OFF ALASKA Pt. 679, Table 40 Table 40 to Part 679—BSAI Halibut PSC...
Ways to estimate speeds for the purposes of air quality conformity analyses.

DOT National Transportation Integrated Search

2002-01-01

A speed post-processor refers to equations or lookup tables that can determine vehicle speeds on a particular roadway link using only the limited information available in a long-range planning model. An estimated link speed is usually based on volume...
The resolution of identity and chain of spheres approximations for the LPNO-CCSD singles Fock term

NASA Astrophysics Data System (ADS)

Izsák, Róbert; Hansen, Andreas; Neese, Frank

2012-10-01

In the present work, the RIJCOSX approximation, developed earlier for accelerating the SCF procedure, is applied to one of the limiting factors of LPNO-CCSD calculations: the evaluation of the singles Fock term. It turns out that the introduction of RIJCOSX in the evaluation of the closed shell LPNO-CCSD singles Fock term causes errors below the microhartree limit. If the proposed procedure is also combined with RIJCOSX in SCF, then a somewhat larger error occurs, but reaction energy errors will still remain negligible. The speedup for the singles Fock term only is about 9-10 fold for the largest basis set applied. For the case of Penicillin using the def2-QZVPP basis set, a single point energy evaluation takes 2 day 16 h on a single processor leading to a total speedup of 2.6 as compared to a fully analytic calculation. Using eight processors, the same calculation takes only 14 h.
Temporal processing asymmetries between the cerebral hemispheres: evidence and implications.

PubMed

Nicholls, M E

1996-07-01

This paper reviews a large body of research which has investigated the capacities of the cerebral hemispheres to process temporal information. This research includes clinical, non-clinical, and electrophysiological experimentation. On the whole, the research supports the notion of a left hemisphere advantage for temporal resolution. The existence of such an asymmetry demonstrates that cerebral lateralisation is not limited to the higher-order functions such as language. The capacity for the resolution of fine temporal events appears to play an important role in other left hemisphere functions which require a rapid sequential processor. The functions that are facilitated by such a processor include verbal, textual, and fine movement skills. The co-development of these functions with an efficient temporal processor can be accounted for with reference to a number of evolutionary scenarios. Physiological evidence favours a temporal processing mechanism located within the left temporal cortex. The function of this mechanism may be described in terms of intermittency or travelling moment models of temporal processing. The travelling moment model provides the most plausible account of the asymmetry.
Detection of Instrumental Drifts in the PEP II LER BPM System

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wittmer, W.; Fisher, A.S.; Martin, D.J.

2007-11-07

During the last PEP-II run a major goal was to bring the Low-Energy Ring optics as close as possible to the design. A large number of BPMs exhibited sudden artificial jumps that interfered with this effort. The source of the majority of these jumps had been traced to the filter-isolator boxes (FIBs) near the BPM buttons. A systematic approach to find and repair the failing units had been developed and implemented. Despite this effort, the instrumental orbit jumps never completely disappeared. To trace the source of this behavior a test setup, using a spare Bergoz MX-BPM processor (kindly provided bymore » SPEAR III at SSRL), was connected in parallel to various PEP-II BPM processors. In the course of these measurements a slow instrumental orbit drift was found which was clearly not induced by a moving positron beam. Based on the size of the system and the limited time before PEP-II closes in Oct.2008, an accelerator improvement project was initiated to install BERGOZ BPM-MX processors close to all sextupoles.« less
Unified Compact ECC-AES Co-Processor with Group-Key Support for IoT Devices in Wireless Sensor Networks

PubMed Central

Castillo, Encarnación; López-Ramos, Juan A.; Morales, Diego P.

2018-01-01

Security is a critical challenge for the effective expansion of all new emerging applications in the Internet of Things paradigm. Therefore, it is necessary to define and implement different mechanisms for guaranteeing security and privacy of data interchanged within the multiple wireless sensor networks being part of the Internet of Things. However, in this context, low power and low area are required, limiting the resources available for security and thus hindering the implementation of adequate security protocols. Group keys can save resources and communications bandwidth, but should be combined with public key cryptography to be really secure. In this paper, a compact and unified co-processor for enabling Elliptic Curve Cryptography along to Advanced Encryption Standard with low area requirements and Group-Key support is presented. The designed co-processor allows securing wireless sensor networks with independence of the communications protocols used. With an area occupancy of only 2101 LUTs over Spartan 6 devices from Xilinx, it requires 15% less area while achieving near 490% better performance when compared to cryptoprocessors with similar features in the literature. PMID:29337921
Multiphase complete exchange on Paragon, SP2 and CS-2

NASA Technical Reports Server (NTRS)

Bokhari, Shahid H.

1995-01-01

The overhead of interprocessor communication is a major factor in limiting the performance of parallel computer systems. The complete exchange is the severest communication pattern in that it requires each processor to send a distinct message to every other processor. This pattern is at the heart of many important parallel applications. On hypercubes, multiphase complete exchange has been developed and shown to provide optimal performance over varying message sizes. Most commercial multicomputer systems do not have a hypercube interconnect. However, they use special purpose hardware and dedicated communication processors to achieve very high performance communication and can be made to emulate the hypercube quite well. Multiphase complete exchange has been implemented on three contemporary parallel architectures: the Intel Paragon, IBM SP2 and Meiko CS-2. The essential features of these machines are described and their basic interprocessor communication overheads are discussed. The performance of multiphase complete exchange is evaluated on each machine. It is shown that the theoretical ideas developed for hypercubes are also applicable in practice to these machines and that multiphase complete exchange can lead to major savings in execution time over traditional solutions.
Unified Compact ECC-AES Co-Processor with Group-Key Support for IoT Devices in Wireless Sensor Networks.

PubMed

Parrilla, Luis; Castillo, Encarnación; López-Ramos, Juan A; Álvarez-Bermejo, José A; García, Antonio; Morales, Diego P

2018-01-16

Security is a critical challenge for the effective expansion of all new emerging applications in the Internet of Things paradigm. Therefore, it is necessary to define and implement different mechanisms for guaranteeing security and privacy of data interchanged within the multiple wireless sensor networks being part of the Internet of Things. However, in this context, low power and low area are required, limiting the resources available for security and thus hindering the implementation of adequate security protocols. Group keys can save resources and communications bandwidth, but should be combined with public key cryptography to be really secure. In this paper, a compact and unified co-processor for enabling Elliptic Curve Cryptography along to Advanced Encryption Standard with low area requirements and Group-Key support is presented. The designed co-processor allows securing wireless sensor networks with independence of the communications protocols used. With an area occupancy of only 2101 LUTs over Spartan 6 devices from Xilinx, it requires 15% less area while achieving near 490% better performance when compared to cryptoprocessors with similar features in the literature.
Methods and systems for providing reconfigurable and recoverable computing resources

NASA Technical Reports Server (NTRS)

Stange, Kent (Inventor); Hess, Richard (Inventor); Kelley, Gerald B (Inventor); Rogers, Randy (Inventor)

2010-01-01

A method for optimizing the use of digital computing resources to achieve reliability and availability of the computing resources is disclosed. The method comprises providing one or more processors with a recovery mechanism, the one or more processors executing one or more applications. A determination is made whether the one or more processors needs to be reconfigured. A rapid recovery is employed to reconfigure the one or more processors when needed. A computing system that provides reconfigurable and recoverable computing resources is also disclosed. The system comprises one or more processors with a recovery mechanism, with the one or more processors configured to execute a first application, and an additional processor configured to execute a second application different than the first application. The additional processor is reconfigurable with rapid recovery such that the additional processor can execute the first application when one of the one more processors fails.

FLY MPI-2: a parallel tree code for LSS

NASA Astrophysics Data System (ADS)

Becciani, U.; Comparato, M.; Antonuccio-Delogu, V.

2006-04-01

New version program summaryProgram title: FLY 3.1 Catalogue identifier: ADSC_v2_0 Licensing provisions: yes Program summary URL:http://cpc.cs.qub.ac.uk/summaries/ADSC_v2_0 Program obtainable from: CPC Program Library, Queen's University of Belfast, N. Ireland No. of lines in distributed program, including test data, etc.: 158 172 No. of bytes in distributed program, including test data, etc.: 4 719 953 Distribution format: tar.gz Programming language: Fortran 90, C Computer: Beowulf cluster, PC, MPP systems Operating system: Linux, Aix RAM: 100M words Catalogue identifier of previous version: ADSC_v1_0 Journal reference of previous version: Comput. Phys. Comm. 155 (2003) 159 Does the new version supersede the previous version?: yes Nature of problem: FLY is a parallel collisionless N-body code for the calculation of the gravitational force Solution method: FLY is based on the hierarchical oct-tree domain decomposition introduced by Barnes and Hut (1986) Reasons for the new version: The new version of FLY is implemented by using the MPI-2 standard: the distributed version 3.1 was developed by using the MPICH2 library on a PC Linux cluster. Today the FLY performance allows us to consider the FLY code among the most powerful parallel codes for tree N-body simulations. Another important new feature regards the availability of an interface with hydrodynamical Paramesh based codes. Simulations must follow a box large enough to accurately represent the power spectrum of fluctuations on very large scales so that we may hope to compare them meaningfully with real data. The number of particles then sets the mass resolution of the simulation, which we would like to make as fine as possible. The idea to build an interface between two codes, that have different and complementary cosmological tasks, allows us to execute complex cosmological simulations with FLY, specialized for DM evolution, and a code specialized for hydrodynamical components that uses a Paramesh block structure. Summary of revisions: The parallel communication schema was totally changed. The new version adopts the MPICH2 library. Now FLY can be executed on all Unix systems having an MPI-2 standard library. The main data structure, is declared in a module procedure of FLY (fly_h.F90 routine). FLY creates the MPI Window object for one-sided communication for all the shared arrays, with a call like the following: CALL MPI_WIN_CREATE(POS, SIZE, REAL8, MPI_INFO_NULL, MPI_COMM_WORLD, WIN_POS, IERR) the following main window objects are created: win_pos, win_vel, win_acc: particles positions velocities and accelerations, win_pos_cell, win_mass_cell, win_quad, win_subp, win_grouping: cells positions, masses, quadrupole momenta, tree structure and grouping cells. Other windows are created for dynamic load balance and global counters. Restrictions: The program uses the leapfrog integrator schema, but could be changed by the user. Unusual features: FLY uses the MPI-2 standard: the MPICH2 library on Linux systems was adopted. To run this version of FLY the working directory must be shared among all the processors that execute FLY. Additional comments: Full documentation for the program is included in the distribution in the form of a README file, a User Guide and a Reference manuscript. Running time: IBM Linux Cluster 1350, 512 nodes with 2 processors for each node and 2 GB RAM for each processor, at Cineca, was adopted to make performance tests. Processor type: Intel Xeon Pentium IV 3.0 GHz and 512 KB cache (128 nodes have Nocona processors). Internal Network: Myricom LAN Card "C" Version and "D" Version. Operating System: Linux SuSE SLES 8. The code was compiled using the mpif90 compiler version 8.1 and with basic optimization options in order to have performances that could be useful compared with other generic clusters Processors
Rectangular Array Of Digital Processors For Planning Paths

NASA Technical Reports Server (NTRS)

Kemeny, Sabrina E.; Fossum, Eric R.; Nixon, Robert H.

1993-01-01

Prototype 24 x 25 rectangular array of asynchronous parallel digital processors rapidly finds best path across two-dimensional field, which could be patch of terrain traversed by robotic or military vehicle. Implemented as single-chip very-large-scale integrated circuit. Excepting processors on edges, each processor communicates with four nearest neighbors along paths representing travel to north, south, east, and west. Each processor contains delay generator in form of 8-bit ripple counter, preset to 1 of 256 possible values. Operation begins with choice of processor representing starting point. Transmits signals to nearest neighbor processors, which retransmits to other neighboring processors, and process repeats until signals propagated across entire field.
VINE-A NUMERICAL CODE FOR SIMULATING ASTROPHYSICAL SYSTEMS USING PARTICLES. II. IMPLEMENTATION AND PERFORMANCE CHARACTERISTICS

DOE Office of Scientific and Technical Information (OSTI.GOV)

Nelson, Andrew F.; Wetzstein, M.; Naab, T.

2009-10-01

We continue our presentation of VINE. In this paper, we begin with a description of relevant architectural properties of the serial and shared memory parallel computers on which VINE is intended to run, and describe their influences on the design of the code itself. We continue with a detailed description of a number of optimizations made to the layout of the particle data in memory and to our implementation of a binary tree used to access that data for use in gravitational force calculations and searches for smoothed particle hydrodynamics (SPH) neighbor particles. We describe the modifications to the codemore » necessary to obtain forces efficiently from special purpose 'GRAPE' hardware, the interfaces required to allow transparent substitution of those forces in the code instead of those obtained from the tree, and the modifications necessary to use both tree and GRAPE together as a fused GRAPE/tree combination. We conclude with an extensive series of performance tests, which demonstrate that the code can be run efficiently and without modification in serial on small workstations or in parallel using the OpenMP compiler directives on large-scale, shared memory parallel machines. We analyze the effects of the code optimizations and estimate that they improve its overall performance by more than an order of magnitude over that obtained by many other tree codes. Scaled parallel performance of the gravity and SPH calculations, together the most costly components of most simulations, is nearly linear up to at least 120 processors on moderate sized test problems using the Origin 3000 architecture, and to the maximum machine sizes available to us on several other architectures. At similar accuracy, performance of VINE, used in GRAPE-tree mode, is approximately a factor 2 slower than that of VINE, used in host-only mode. Further optimizations of the GRAPE/host communications could improve the speed by as much as a factor of 3, but have not yet been implemented in VINE. Finally, we find that although parallel performance on small problems may reach a plateau beyond which more processors bring no additional speedup, performance never decreases, a factor important for running large simulations on many processors with individual time steps, where only a small fraction of the total particles require updates at any given moment.« less
The Differential Effect of Attentional Condition on Subsequent Vocabulary Development

ERIC Educational Resources Information Center

Mohammed, Halah Abdulelah; Majid, Norazman Abdul; Abdullah, Tina

2016-01-01

This study addressed the potential methodological issues effect of attentional condition on subsequent vocabulary development from a different perspective, which addressed several potential methodological issues of previous research that have been based on psycholinguistic notion of second language learner as a limited capacity processor. The…
75 FR 9087 - Trade Adjustment Assistance for Farmers

Federal Register 2010, 2011, 2012, 2013, 2014

2010-03-01

... procedures by which producers of raw agricultural commodities can petition for certification, apply for... processors are eligible for program benefits. The purpose of TAA for Farmers is to assist producers of raw... specifically limits program benefits to producers of raw agricultural commodities. Length of Intensive Training...
50 CFR Table 41 to Part 679 - BSAI Crab PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2011 CFR

2011-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED... . . . Red king crab Zone 1 0.007 0.299 The PSC amount in number of animals available to trawl vessels in the...
50 CFR Table 41 to Part 679 - BSAI Crab PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2010 CFR

2010-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED... . . . Red king crab Zone 1 0.007 0.299 The PSC amount in number of animals available to trawl vessels in the...
50 CFR Table 41 to Part 679 - BSAI Crab PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2012 CFR

2012-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED... . . . Red king crab Zone 1 0.007 0.299 The PSC amount in number of animals available to trawl vessels in the...
50 CFR Table 41 to Part 679 - BSAI Crab PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2013 CFR

2013-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED... . . . Red king crab Zone 1 0.007 0.299 The PSC amount in number of animals available to trawl vessels in the...
50 CFR Table 41 to Part 679 - BSAI Crab PSC Sideboard Limits for AFA Catcher/Processors and AFA Catcher Vessels

Code of Federal Regulations, 2014 CFR

2014-10-01

... CONSERVATION AND MANAGEMENT, NATIONAL OCEANIC AND ATMOSPHERIC ADMINISTRATION, DEPARTMENT OF COMMERCE (CONTINUED... . . . Red king crab Zone 1 0.007 0.299 The PSC amount in number of animals available to trawl vessels in the...
Energy challenges in optical access and aggregation networks.

PubMed

Kilper, Daniel C; Rastegarfar, Houman

2016-03-06

Scalability is a critical issue for access and aggregation networks as they must support the growth in both the size of data capacity demands and the multiplicity of access points. The number of connected devices, the Internet of Things, is growing to the tens of billions. Prevailing communication paradigms are reaching physical limitations that make continued growth problematic. Challenges are emerging in electronic and optical systems and energy increasingly plays a central role. With the spectral efficiency of optical systems approaching the Shannon limit, increasing parallelism is required to support higher capacities. For electronic systems, as the density and speed increases, the total system energy, thermal density and energy per bit are moving into regimes that become impractical to support-for example requiring single-chip processor powers above the 100 W limit common today. We examine communication network scaling and energy use from the Internet core down to the computer processor core and consider implications for optical networks. Optical switching in data centres is identified as a potential model from which scalable access and aggregation networks for the future Internet, with the application of integrated photonic devices and intelligent hybrid networking, will emerge. © 2016 The Author(s).
Buffered coscheduling for parallel programming and enhanced fault tolerance

DOEpatents

Petrini, Fabrizio [Los Alamos, NM; Feng, Wu-chun [Los Alamos, NM

2006-01-31

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors
Cache Energy Optimization Techniques For Modern Processors

DOE Office of Scientific and Technical Information (OSTI.GOV)

Mittal, Sparsh

2013-01-01

Modern multicore processors are employing large last-level caches, for example Intel's E7-8800 processor uses 24MB L3 cache. Further, with each CMOS technology generation, leakage energy has been dramatically increasing and hence, leakage energy is expected to become a major source of energy dissipation, especially in last-level caches (LLCs). The conventional schemes of cache energy saving either aim at saving dynamic energy or are based on properties specific to first-level caches, and thus these schemes have limited utility for last-level caches. Further, several other techniques require offline profiling or per-application tuning and hence are not suitable for product systems. In thismore » book, we present novel cache leakage energy saving schemes for single-core and multicore systems; desktop, QoS, real-time and server systems. Also, we present cache energy saving techniques for caches designed with both conventional SRAM devices and emerging non-volatile devices such as STT-RAM (spin-torque transfer RAM). We present software-controlled, hardware-assisted techniques which use dynamic cache reconfiguration to configure the cache to the most energy efficient configuration while keeping the performance loss bounded. To profile and test a large number of potential configurations, we utilize low-overhead, micro-architecture components, which can be easily integrated into modern processor chips. We adopt a system-wide approach to save energy to ensure that cache reconfiguration does not increase energy consumption of other components of the processor. We have compared our techniques with state-of-the-art techniques and have found that our techniques outperform them in terms of energy efficiency and other relevant metrics. The techniques presented in this book have important applications in improving energy-efficiency of higher-end embedded, desktop, QoS, real-time, server processors and multitasking systems. This book is intended to be a valuable guide for both newcomers and veterans in the field of cache power management. It will help graduate students, CAD tool developers and designers in understanding the need of energy efficiency in modern computing systems. Further, it will be useful for researchers in gaining insights into algorithms and techniques for micro-architectural and system-level energy optimization using dynamic cache reconfiguration. We sincerely believe that the ``food for thought'' presented in this book will inspire the readers to develop even better ideas for designing ``green'' processors of tomorrow.« less
Flight design system level C requirements. Solid rocket booster and external tank impact prediction processors. [space transportation system

NASA Technical Reports Server (NTRS)

Seale, R. H.

1979-01-01

The prediction of the SRB and ET impact areas requires six separate processors. The SRB impact prediction processor computes the impact areas and related trajectory data for each SRB element. Output from this processor is stored on a secure file accessible by the SRB impact plot processor which generates the required plots. Similarly the ET RTLS impact prediction processor and the ET RTLS impact plot processor generates the ET impact footprints for return-to-launch-site (RTLS) profiles. The ET nominal/AOA/ATO impact prediction processor and the ET nominal/AOA/ATO impact plot processor generate the ET impact footprints for non-RTLS profiles. The SRB and ET impact processors compute the size and shape of the impact footprints by tabular lookup in a stored footprint dispersion data base. The location of each footprint is determined by simulating a reference trajectory and computing the reference impact point location. To insure consistency among all flight design system (FDS) users, much input required by these processors will be obtained from the FDS master data base.
On-chip programmable ultra-wideband microwave photonic phase shifter and true time delay unit.

PubMed

Burla, Maurizio; Cortés, Luis Romero; Li, Ming; Wang, Xu; Chrostowski, Lukas; Azaña, José

2014-11-01

We proposed and experimentally demonstrated an ultra-broadband on-chip microwave photonic processor that can operate both as RF phase shifter (PS) and true-time-delay (TTD) line, with continuous tuning. The processor is based on a silicon dual-phase-shifted waveguide Bragg grating (DPS-WBG) realized with a CMOS compatible process. We experimentally demonstrated the generation of delay up to 19.4 ps over 10 GHz instantaneous bandwidth and a phase shift of approximately 160° over the bandwidth 22-29 GHz. The available RF measurement setup ultimately limits the phase shifting demonstration as the device is capable of providing up to 300° phase shift for RF frequencies over a record bandwidth approaching 1 THz.
Compact time- and space-integrating SAR processor: design and development status

NASA Astrophysics Data System (ADS)

Haney, Michael W.; Levy, James J.; Christensen, Marc P.; Michael, Robert R., Jr.; Mock, Michael M.

1994-06-01

Progress toward a flight demonstration of the acousto-optic time- and space- integrating real-time SAR image formation processor program is reported. The concept overcomes the size and power consumption limitations of electronic approaches by using compact, rugged, and low-power analog optical signal processing techniques for the most computationally taxing portions of the SAR imaging problem. Flexibility and performance are maintained by the use of digital electronics for the critical low-complexity filter generation and output image processing functions. The results reported include tests of a laboratory version of the concept, a description of the compact optical design that will be implemented, and an overview of the electronic interface and controller modules of the flight-test system.
XVis: Visualization for the Extreme-Scale Scientific-Computation Ecosystem: Mid-year report FY17 Q2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.; Pugmire, David; Rogers, David

The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressingmore » four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.« less
XVis: Visualization for the Extreme-Scale Scientific-Computation Ecosystem: Year-end report FY17.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.; Pugmire, David; Rogers, David

The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressingmore » four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.« less
XVis: Visualization for the Extreme-Scale Scientific-Computation Ecosystem. Mid-year report FY16 Q2

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.; Sewell, Christopher; Childs, Hank

The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressingmore » four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.« less
XVis: Visualization for the Extreme-Scale Scientific-Computation Ecosystem: Year-end report FY15 Q4.

DOE Office of Scientific and Technical Information (OSTI.GOV)

Moreland, Kenneth D.; Sewell, Christopher; Childs, Hank

The XVis project brings together the key elements of research to enable scientific discovery at extreme scale. Scientific computing will no longer be purely about how fast computations can be performed. Energy constraints, processor changes, and I/O limitations necessitate significant changes in both the software applications used in scientific computation and the ways in which scientists use them. Components for modeling, simulation, analysis, and visualization must work together in a computational ecosystem, rather than working independently as they have in the past. This project provides the necessary research and infrastructure for scientific discovery in this new computational ecosystem by addressingmore » four interlocking challenges: emerging processor technology, in situ integration, usability, and proxy analysis.« less

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore

DOE Office of Scientific and Technical Information (OSTI.GOV)

Liao, C; Quinlan, D J; Willcock, J J

2008-12-12

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructuremore » which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.« less
Parallel approach to incorporating face image information into dialogue processing

NASA Astrophysics Data System (ADS)

Ren, Fuji

2000-10-01

There are many kinds of so-called irregular expressions in natural dialogues. Even if the content of a conversation is the same in words, different meanings can be interpreted by a person's feeling or face expression. To have a good understanding of dialogues, it is required in a flexible dialogue processing system to infer the speaker's view properly. However, it is difficult to obtain the meaning of the speaker's sentences in various scenes using traditional methods. In this paper, a new approach for dialogue processing that incorporates information from the speaker's face is presented. We first divide conversation statements into several simple tasks. Second, we process each simple task using an independent processor. Third, we employ some speaker's face information to estimate the view of the speakers to solve ambiguities in dialogues. The approach presented in this paper can work efficiently, because independent processors run in parallel, writing partial results to a shared memory, incorporating partial results at appropriate points, and complementing each other. A parallel algorithm and a method for employing the face information in a dialogue machine translation will be discussed, and some results will be included in this paper.
Advanced data management design for autonomous telerobotic systems in space using spaceborne symbolic processors

NASA Technical Reports Server (NTRS)

Goforth, Andre

1987-01-01

The use of computers in autonomous telerobots is reaching the point where advanced distributed processing concepts and techniques are needed to support the functioning of Space Station era telerobotic systems. Three major issues that have impact on the design of data management functions in a telerobot are covered. It also presents a design concept that incorporates an intelligent systems manager (ISM) running on a spaceborne symbolic processor (SSP), to address these issues. The first issue is the support of a system-wide control architecture or control philosophy. Salient features of two candidates are presented that impose constraints on data management design. The second issue is the role of data management in terms of system integration. This referes to providing shared or coordinated data processing and storage resources to a variety of telerobotic components such as vision, mechanical sensing, real-time coordinated multiple limb and end effector control, and planning and reasoning. The third issue is hardware that supports symbolic processing in conjunction with standard data I/O and numeric processing. A SSP that currently is seen to be technologically feasible and is being developed is described and used as a baseline in the design concept.
Massively Multithreaded Maxflow for Image Segmentation on the Cray XMT-2

PubMed Central

Bokhari, Shahid H.; Çatalyürek, Ümit V.; Gurcan, Metin N.

2014-01-01

SUMMARY Image segmentation is a very important step in the computerized analysis of digital images. The maxflow mincut approach has been successfully used to obtain minimum energy segmentations of images in many fields. Classical algorithms for maxflow in networks do not directly lend themselves to efficient parallel implementations on contemporary parallel processors. We present the results of an implementation of Goldberg-Tarjan preflow-push algorithm on the Cray XMT-2 massively multithreaded supercomputer. This machine has hardware support for 128 threads in each physical processor, a uniformly accessible shared memory of up to 4 TB and hardware synchronization for each 64 bit word. It is thus well-suited to the parallelization of graph theoretic algorithms, such as preflow-push. We describe the implementation of the preflow-push code on the XMT-2 and present the results of timing experiments on a series of synthetically generated as well as real images. Our results indicate very good performance on large images and pave the way for practical applications of this machine architecture for image analysis in a production setting. The largest images we have run are 320002 pixels in size, which are well beyond the largest previously reported in the literature. PMID:25598745
Optics Program Modified for Multithreaded Parallel Computing

NASA Technical Reports Server (NTRS)

Lou, John; Bedding, Dave; Basinger, Scott

2006-01-01

A powerful high-performance computer program for simulating and analyzing adaptive and controlled optical systems has been developed by modifying the serial version of the Modeling and Analysis for Controlled Optical Systems (MACOS) program to impart capabilities for multithreaded parallel processing on computing systems ranging from supercomputers down to Symmetric Multiprocessing (SMP) personal computers. The modifications included the incorporation of OpenMP, a portable and widely supported application interface software, that can be used to explicitly add multithreaded parallelism to an application program under a shared-memory programming model. OpenMP was applied to parallelize ray-tracing calculations, one of the major computing components in MACOS. Multithreading is also used in the diffraction propagation of light in MACOS based on pthreads [POSIX Thread, (where "POSIX" signifies a portable operating system for UNIX)]. In tests of the parallelized version of MACOS, the speedup in ray-tracing calculations was found to be linear, or proportional to the number of processors, while the speedup in diffraction calculations ranged from 50 to 60 percent, depending on the type and number of processors. The parallelized version of MACOS is portable, and, to the user, its interface is basically the same as that of the original serial version of MACOS.
Dynamic Load-Balancing for Distributed Heterogeneous Computing of Parallel CFD Problems

NASA Technical Reports Server (NTRS)

Ecer, A.; Chien, Y. P.; Boenisch, T.; Akay, H. U.

2000-01-01

The developed methodology is aimed at improving the efficiency of executing block-structured algorithms on parallel, distributed, heterogeneous computers. The basic approach of these algorithms is to divide the flow domain into many sub- domains called blocks, and solve the governing equations over these blocks. Dynamic load balancing problem is defined as the efficient distribution of the blocks among the available processors over a period of several hours of computations. In environments with computers of different architecture, operating systems, CPU speed, memory size, load, and network speed, balancing the loads and managing the communication between processors becomes crucial. Load balancing software tools for mutually dependent parallel processes have been created to efficiently utilize an advanced computation environment and algorithms. These tools are dynamic in nature because of the chances in the computer environment during execution time. More recently, these tools were extended to a second operating system: NT. In this paper, the problems associated with this application will be discussed. Also, the developed algorithms were combined with the load sharing capability of LSF to efficiently utilize workstation clusters for parallel computing. Finally, results will be presented on running a NASA based code ADPAC to demonstrate the developed tools for dynamic load balancing.
Proposed Political Federation of East African Countries: ’Benefit’ to Tanzania

DTIC Science & Technology

2010-03-01

minerals-gold, diamonds, tanzanite, coal, iron ores, nickels, natural gas, fertile agricultural and pasture land, forests and wildlife. In 2003...exports. The Musoma Dairy limited is a Tanzanian milk processor. The Musoma Dairy Limited complained that its exports had been denied entry into Kenya...by the Kenya Revenue Authority on the grounds that it failed the required qualifications of customs rules.60 The rules affecting the Musoma Dairy
Elementary and Middle School Children's Acceptance of Lower Calorie Flavored Milk as Measured by Milk Shipment and Participation in the National School Lunch Program

ERIC Educational Resources Information Center

Yon, Bethany A.; Johnson, Rachel K.

2014-01-01

Background: The United States Department of Agriculture's (USDA) new nutrition standards for school meals include sweeping changes setting upper limits on calories served and limit milk offerings to low fat or fat-free and, if flavored, only fat-free. Milk processors are lowering the calories in flavored milks. As changes to milk impact…
Electromagnetic versus electrical coupling of personal frequency modulation (FM) receivers to cochlear implant sound processors.

PubMed

Schafer, Erin C; Romine, Denise; Musgrave, Elizabeth; Momin, Sadaf; Huynh, Christy

2013-01-01

Previous research has suggested that electrically coupled frequency modulation (FM) systems substantially improved speech-recognition performance in noise in individuals with cochlear implants (CIs). However, there is limited evidence to support the use of electromagnetically coupled (neck loop) FM receivers with contemporary CI sound processors containing telecoils. The primary goal of this study was to compare speech-recognition performance in noise and subjective ratings of adolescents and adults using one of three contemporary CI sound processors coupled to electromagnetically and electrically coupled FM receivers from Oticon. A repeated-measures design was used to compare speech-recognition performance in noise and subjective ratings without and with the FM systems across three test sessions (Experiment 1) and to compare performance at different FM-gain settings (Experiment 2). Descriptive statistics were used in Experiment 3 to describe output differences measured through a CI sound processor. Experiment 1 included nine adolescents or adults with unilateral or bilateral Advanced Bionics Harmony (n = 3), Cochlear Nucleus 5 (n = 3), and MED-EL OPUS 2 (n = 3) CI sound processors. In Experiment 2, seven of the original nine participants were tested. In Experiment 3, electroacoustic output was measured from a Nucleus 5 sound processor when coupled to the electromagnetically coupled Oticon Arc neck loop and electrically coupled Oticon R2. In Experiment 1, participants completed a field trial with each FM receiver and three test sessions that included speech-recognition performance in noise and a subjective rating scale. In Experiment 2, participants were tested in three receiver-gain conditions. Results in both experiments were analyzed using repeated-measures analysis of variance. Experiment 3 involved electroacoustic-test measures to determine the monitor-earphone output of the CI alone and CI coupled to the two FM receivers. The results in Experiment 1 suggested that both FM receivers provided significantly better speech-recognition performance in noise than the CI alone; however, the electromagnetically coupled receiver provided significantly better speech-recognition performance in noise and better ratings in some situations than the electrically coupled receiver when set to the same gain. In Experiment 2, the primary analysis suggested significantly better speech-recognition performance in noise for the neck-loop versus electrically coupled receiver, but a second analysis, using the best performance across gain settings for each device, revealed no significant differences between the two FM receivers. Experiment 3 revealed monitor-earphone output differences in the Nucleus 5 sound processor for the two FM receivers when set to the +8 setting used in Experiment 1 but equal output when the electrically coupled device was set to a +16 gain setting and the electromagnetically coupled device was set to the +8 gain setting. Individuals with contemporary sound processors may show more favorable speech-recognition performance in noise electromagnetically coupled FM systems (i.e., Oticon Arc), which is most likely related to the input processing and signal processing pathway within the CI sound processor for direct input versus telecoil input. Further research is warranted to replicate these findings with a larger sample size and to develop and validate a more objective approach to fitting FM systems to CI sound processors. American Academy of Audiology.
Chemistry and Biochemstry of Peanut Skins. Implications of Utilization

USDA-ARS?s Scientific Manuscript database

Peanut shelling plants in the US produce thousands of tons of peanut skins each year. Currently, this material is considered a waste product with limited end uses and no real monetary value. Peanut skins were obtained from a regional peanut processor and subjected to a several types of solvent ext...
Cactus: Writing an Article

ERIC Educational Resources Information Center

Hyde, Hartley; Spencer, Toby

2010-01-01

Some people became mathematics or science teachers by default. There was once such a limited range of subjects that students who could not write essays did mathematics and science. Computers changed that. Word processor software helped some people overcome huge spelling and grammar hurdles and made it easy to edit and manipulate text. Would-be…
Workload - An examination of the concept

NASA Technical Reports Server (NTRS)

Gopher, Daniel; Donchin, Emanuel

1986-01-01

The relations between task difficulty and workload and workload and performance are examined. The architecture and limitations of the central processor are discussed. Various procedures for measuring workload are described and evaluated. Consideration is given to normative and descriptive approaches; subjective, performance, and arousal measures; performance operating characteristics; and psychophysiological measures of workload.
The next generation of microbiological testing of poultry

USDA-ARS?s Scientific Manuscript database

Microbiological testing of food products is a common practice of food processors to ensure compliance with food safety criteria. Sampling on its own is of limited value, but when applied regularly at different stages of the food chain, microbiology testing can be an integral part of a quality contr...
Limited Area Coverage/High Resolution Picture Transmission (LAC/HRPT) tape IJ grid pixel extraction processor user's manual

NASA Technical Reports Server (NTRS)

Obrien, S. O. (Principal Investigator)

1980-01-01

The program, LACREG, extracted all pixels that are contained in a specific IJ grid section. The pixels, along with a header record are stored in a disk file defined by the user. The program will extract up to 99 IJ grid sections.
12 CFR 235.7 - Limitations on payment card restrictions.

Code of Federal Regulations, 2014 CFR

2014-01-01

... Section 235.7 Banks and Banking FEDERAL RESERVE SYSTEM (CONTINUED) BOARD OF GOVERNORS OF THE FEDERAL... restrictions. (a) Prohibition on network exclusivity—(1) In general. An issuer or payment card network shall not directly or through any agent, processor, or licensed member of a payment card network, by...
12 CFR 235.7 - Limitations on payment card restrictions.

Code of Federal Regulations, 2013 CFR

2013-01-01

... Section 235.7 Banks and Banking FEDERAL RESERVE SYSTEM (CONTINUED) BOARD OF GOVERNORS OF THE FEDERAL... restrictions. (a) Prohibition on network exclusivity—(1) In general. An issuer or payment card network shall not directly or through any agent, processor, or licensed member of a payment card network, by...
12 CFR 235.7 - Limitations on payment card restrictions.

Code of Federal Regulations, 2012 CFR

2012-01-01

... Section 235.7 Banks and Banking FEDERAL RESERVE SYSTEM (CONTINUED) BOARD OF GOVERNORS OF THE FEDERAL... restrictions. (a) Prohibition on network exclusivity—(1) In general. An issuer or payment card network shall not directly or through any agent, processor, or licensed member of a payment card network, by...
Network Interface Specification for the T1 Microprocessor

DTIC Science & Technology

1994-05-01

features data transfer directly to/from processor registers, hardware dispatch directly to Active Message handlers (along with limited context...Implementation Choices 9 3.1 Overview .................................... 9 3.2 Context ..................................... 10 3.3 Data Transfer...details of the data transfer functional units, interconnect structure, and network operation. Application Layer Communication Model Communication
21 CFR 123.7 - Corrective actions.

Code of Federal Regulations, 2010 CFR

2010-04-01

... 21 Food and Drugs 2 2010-04-01 2010-04-01 false Corrective actions. 123.7 Section 123.7 Food and... CONSUMPTION FISH AND FISHERY PRODUCTS General Provisions § 123.7 Corrective actions. (a) Whenever a deviation from a critical limit occurs, a processor shall take corrective action either by: (1) Following a...
Experiment in Onboard Synthetic Aperture Radar Data Processing

NASA Technical Reports Server (NTRS)

Holland, Matthew

2011-01-01

Single event upsets (SEUs) are a threat to any computing system running on hardware that has not been physically radiation hardened. In addition to mandating the use of performance-limited, hardened heritage equipment, prior techniques for dealing with the SEU problem often involved hardware-based error detection and correction (EDAC). With limited computing resources, software- based EDAC, or any more elaborate recovery methods, were often not feasible. Synthetic aperture radars (SARs), when operated in the space environment, are interesting due to their relevance to NASAs objectives, but problematic in the sense of producing prodigious amounts of raw data. Prior implementations of the SAR data processing algorithm have been too slow, too computationally intensive, and require too much application memory for onboard execution to be a realistic option when using the type of heritage processing technology described above. This standard C-language implementation of SAR data processing is distributed over many cores of a Tilera Multicore Processor, and employs novel Radiation Hardening by Software (RHBS) techniques designed to protect the component processes (one per core) and their shared application memory from the sort of SEUs expected in the space environment. The source code includes calls to Tilera APIs, and a specialized Tilera compiler is required to produce a Tilera executable. The compiled application reads input data describing the position and orientation of a radar platform, as well as its radar-burst data, over time and writes out processed data in a form that is useful for analysis of the radar observations.

Coding, testing and documentation of processors for the flight design system

NASA Technical Reports Server (NTRS)

1980-01-01

The general functional design and implementation of processors for a space flight design system are briefly described. Discussions of a basetime initialization processor; conic, analytical, and precision coasting flight processors; and an orbit lifetime processor are included. The functions of several utility routines are also discussed.
The computational structural mechanics testbed generic structural-element processor manual

NASA Technical Reports Server (NTRS)

Stanley, Gary M.; Nour-Omid, Shahram

1990-01-01

The usage and development of structural finite element processors based on the CSM Testbed's Generic Element Processor (GEP) template is documented. By convention, such processors have names of the form ESi, where i is an integer. This manual is therefore intended for both Testbed users who wish to invoke ES processors during the course of a structural analysis, and Testbed developers who wish to construct new element processors (or modify existing ones).
Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors

NASA Technical Reports Server (NTRS)

Fijany, Amir (Inventor); Bejczy, Antal K. (Inventor)

1994-01-01

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.
System and method for representing and manipulating three-dimensional objects on massively parallel architectures

DOEpatents

Karasick, Michael S.; Strip, David R.

1996-01-01

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.
Switch for serial or parallel communication networks

DOEpatents

Crosette, D.B.

1994-07-19

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.
Switch for serial or parallel communication networks

DOEpatents

Crosette, Dario B.

1994-01-01

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.
Parallel hyperbolic PDE simulation on clusters: Cell versus GPU

NASA Astrophysics Data System (ADS)

Rostrup, Scott; De Sterck, Hans

2010-12-01

Increasingly, high-performance computing is looking towards data-parallel computational devices to enhance computational performance. Two technologies that have received significant attention are IBM's Cell Processor and NVIDIA's CUDA programming model for graphics processing unit (GPU) computing. In this paper we investigate the acceleration of parallel hyperbolic partial differential equation simulation on structured grids with explicit time integration on clusters with Cell and GPU backends. The message passing interface (MPI) is used for communication between nodes at the coarsest level of parallelism. Optimizations of the simulation code at the several finer levels of parallelism that the data-parallel devices provide are described in terms of data layout, data flow and data-parallel instructions. Optimized Cell and GPU performance are compared with reference code performance on a single x86 central processing unit (CPU) core in single and double precision. We further compare the CPU, Cell and GPU platforms on a chip-to-chip basis, and compare performance on single cluster nodes with two CPUs, two Cell processors or two GPUs in a shared memory configuration (without MPI). We finally compare performance on clusters with 32 CPUs, 32 Cell processors, and 32 GPUs using MPI. Our GPU cluster results use NVIDIA Tesla GPUs with GT200 architecture, but some preliminary results on recently introduced NVIDIA GPUs with the next-generation Fermi architecture are also included. This paper provides computational scientists and engineers who are considering porting their codes to accelerator environments with insight into how structured grid based explicit algorithms can be optimized for clusters with Cell and GPU accelerators. It also provides insight into the speed-up that may be gained on current and future accelerator architectures for this class of applications. Program summaryProgram title: SWsolver Catalogue identifier: AEGY_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEGY_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPL v3 No. of lines in distributed program, including test data, etc.: 59 168 No. of bytes in distributed program, including test data, etc.: 453 409 Distribution format: tar.gz Programming language: C, CUDA Computer: Parallel Computing Clusters. Individual compute nodes may consist of x86 CPU, Cell processor, or x86 CPU with attached NVIDIA GPU accelerator. Operating system: Linux Has the code been vectorised or parallelized?: Yes. Tested on 1-128 x86 CPU cores, 1-32 Cell Processors, and 1-32 NVIDIA GPUs. RAM: Tested on Problems requiring up to 4 GB per compute node. Classification: 12 External routines: MPI, CUDA, IBM Cell SDK Nature of problem: MPI-parallel simulation of Shallow Water equations using high-resolution 2D hyperbolic equation solver on regular Cartesian grids for x86 CPU, Cell Processor, and NVIDIA GPU using CUDA. Solution method: SWsolver provides 3 implementations of a high-resolution 2D Shallow Water equation solver on regular Cartesian grids, for CPU, Cell Processor, and NVIDIA GPU. Each implementation uses MPI to divide work across a parallel computing cluster. Additional comments: Sub-program numdiff is used for the test run.
List-mode PET image reconstruction for motion correction using the Intel XEON PHI co-processor

NASA Astrophysics Data System (ADS)

Ryder, W. J.; Angelis, G. I.; Bashar, R.; Gillam, J. E.; Fulton, R.; Meikle, S.

2014-03-01

List-mode image reconstruction with motion correction is computationally expensive, as it requires projection of hundreds of millions of rays through a 3D array. To decrease reconstruction time it is possible to use symmetric multiprocessing computers or graphics processing units. The former can have high financial costs, while the latter can require refactoring of algorithms. The Xeon Phi is a new co-processor card with a Many Integrated Core architecture that can run 4 multiple-instruction, multiple data threads per core with each thread having a 512-bit single instruction, multiple data vector register. Thus, it is possible to run in the region of 220 threads simultaneously. The aim of this study was to investigate whether the Xeon Phi co-processor card is a viable alternative to an x86 Linux server for accelerating List-mode PET image reconstruction for motion correction. An existing list-mode image reconstruction algorithm with motion correction was ported to run on the Xeon Phi coprocessor with the multi-threading implemented using pthreads. There were no differences between images reconstructed using the Phi co-processor card and images reconstructed using the same algorithm run on a Linux server. However, it was found that the reconstruction runtimes were 3 times greater for the Phi than the server. A new version of the image reconstruction algorithm was developed in C++ using OpenMP for mutli-threading and the Phi runtimes decreased to 1.67 times that of the host Linux server. Data transfer from the host to co-processor card was found to be a rate-limiting step; this needs to be carefully considered in order to maximize runtime speeds. When considering the purchase price of a Linux workstation with Xeon Phi co-processor card and top of the range Linux server, the former is a cost-effective computation resource for list-mode image reconstruction. A multi-Phi workstation could be a viable alternative to cluster computers at a lower cost for medical imaging applications.
Novel processor architecture for onboard infrared sensors

NASA Astrophysics Data System (ADS)

Hihara, Hiroki; Iwasaki, Akira; Tamagawa, Nobuo; Kuribayashi, Mitsunobu; Hashimoto, Masanori; Mitsuyama, Yukio; Ochi, Hiroyuki; Onodera, Hidetoshi; Kanbara, Hiroyuki; Wakabayashi, Kazutoshi; Tada, Munehiro

2016-09-01

Infrared sensor system is a major concern for inter-planetary missions that investigate the nature and the formation processes of planets and asteroids. The infrared sensor system requires signal preprocessing functions that compensate for the intensity of infrared image sensors to get high quality data and high compression ratio through the limited capacity of transmission channels towards ground stations. For those implementations, combinations of Field Programmable Gate Arrays (FPGAs) and microprocessors are employed by AKATSUKI, the Venus Climate Orbiter, and HAYABUSA2, the asteroid probe. On the other hand, much smaller size and lower power consumption are demanded for future missions to accommodate more sensors. To fulfill this future demand, we developed a novel processor architecture which consists of reconfigurable cluster cores and programmable-logic cells with complementary atom switches. The complementary atom switches enable hardware programming without configuration memories, and thus soft-error on logic circuit connection is completely eliminated. This is a noteworthy advantage for space applications which cannot be found in conventional re-writable FPGAs. Almost one-tenth of lower power consumption is expected compared to conventional re-writable FPGAs because of the elimination of configuration memories. The proposed processor architecture can be reconfigured by behavioral synthesis with higher level language specification. Consequently, compensation functions are implemented in a single chip without accommodating program memories, which is accompanied with conventional microprocessors, while maintaining the comparable performance. This enables us to embed a processor element on each infrared signal detector output channel.
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava

2017-01-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particlemore » tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.« less
Linear Spectral Analysis of Plume Emissions Using an Optical Matrix Processor

NASA Technical Reports Server (NTRS)

Gary, C. K.

1992-01-01

Plume spectrometry provides a means to monitor the health of a burning rocket engine, and optical matrix processors provide a means to analyze the plume spectra in real time. By observing the spectrum of the exhaust plume of a rocket engine, researchers have detected anomalous behavior of the engine and have even determined the failure of some equipment before it would normally have been noticed. The spectrum of the plume is analyzed by isolating information in the spectrum about the various materials present to estimate what materials are being burned in the engine. Scientists at the Marshall Space Flight Center (MSFC) have implemented a high resolution spectrometer to discriminate the spectral peaks of the many species present in the plume. Researchers at the Stennis Space Center Demonstration Testbed Facility (DTF) have implemented a high resolution spectrometer observing a 1200-lb. thrust engine. At this facility, known concentrations of contaminants can be introduced into the burn, allowing for the confirmation of diagnostic algorithms. While the high resolution of the measured spectra has allowed greatly increased insight into the functioning of the engine, the large data flows generated limit the ability to perform real-time processing. The use of an optical matrix processor and the linear analysis technique described below may allow for the detailed real-time analysis of the engine's health. A small optical matrix processor can perform the required mathematical analysis both quicker and with less energy than a large electronic computer dedicated to the same spectral analysis routine.
Feasibility of optically interconnected parallel processors using wavelength division multiplexing

DOE Office of Scientific and Technical Information (OSTI.GOV)

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little informationmore » is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.« less
Parallelized Kalman-Filter-Based Reconstruction of Particle Tracks on Many-Core Processors and GPUs

NASA Astrophysics Data System (ADS)

Cerati, Giuseppe; Elmer, Peter; Krutelyov, Slava; Lantz, Steven; Lefebvre, Matthieu; Masciovecchio, Mario; McDermott, Kevin; Riley, Daniel; Tadel, Matevž; Wittich, Peter; Würthwein, Frank; Yagil, Avi

2017-08-01

For over a decade now, physical and energy constraints have limited clock speed improvements in commodity microprocessors. Instead, chipmakers have been pushed into producing lower-power, multi-core processors such as Graphical Processing Units (GPU), ARM CPUs, and Intel MICs. Broad-based efforts from manufacturers and developers have been devoted to making these processors user-friendly enough to perform general computations. However, extracting performance from a larger number of cores, as well as specialized vector or SIMD units, requires special care in algorithm design and code optimization. One of the most computationally challenging problems in high-energy particle experiments is finding and fitting the charged-particle tracks during event reconstruction. This is expected to become by far the dominant problem at the High-Luminosity Large Hadron Collider (HL-LHC), for example. Today the most common track finding methods are those based on the Kalman filter. Experience with Kalman techniques on real tracking detector systems has shown that they are robust and provide high physics performance. This is why they are currently in use at the LHC, both in the trigger and offine. Previously we reported on the significant parallel speedups that resulted from our investigations to adapt Kalman filters to track fitting and track building on Intel Xeon and Xeon Phi. Here, we discuss our progresses toward the understanding of these processors and the new developments to port the Kalman filter to NVIDIA GPUs.
Key Technologies of Phone Storage Forensics Based on ARM Architecture

NASA Astrophysics Data System (ADS)

Zhang, Jianghan; Che, Shengbing

2018-03-01

Smart phones are mainly running Android, IOS and Windows Phone three mobile platform operating systems. The android smart phone has the best market shares and its processor chips are almost ARM software architecture. The chips memory address mapping mechanism of ARM software architecture is different with x86 software architecture. To forensics to android mart phone, we need to understand three key technologies: memory data acquisition, the conversion mechanism from virtual address to the physical address, and find the system’s key data. This article presents a viable solution which does not rely on the operating system API for a complete solution to these three issues.
Multiprogramming performance degradation - Case study on a shared memory multiprocessor

NASA Technical Reports Server (NTRS)

Dimpsey, R. T.; Iyer, R. K.

1989-01-01

The performance degradation due to multiprogramming overhead is quantified for a parallel-processing machine. Measurements of real workloads were taken, and it was found that there is a moderate correlation between the completion time of a program and the amount of system overhead measured during program execution. Experiments in controlled environments were then conducted to calculate a lower bound on the performance degradation of parallel jobs caused by multiprogramming overhead. The results show that the multiprogramming overhead of parallel jobs consumes at least 4 percent of the processor time. When two or more serial jobs are introduced into the system, this amount increases to 5.3 percent
A mechanism for efficient debugging of parallel programs

DOE Office of Scientific and Technical Information (OSTI.GOV)

Miller, B.P.; Choi, J.D.

1988-01-01

This paper addresses the design and implementation of an integrated debugging system for parallel programs running on shared memory multi-processors (SMMP). The authors describe the use of flowback analysis to provide information on causal relationships between events in a program's execution without re-executing the program for debugging. The authors introduce a mechanism called incremental tracing that, by using semantic analyses of the debugged program, makes the flowback analysis practical with only a small amount of trace generated during execution. The extend flowback analysis to apply to parallel programs and describe a method to detect race conditions in the interactions ofmore » the co-operating processes.« less
Kalman filter tracking on parallel architectures

NASA Astrophysics Data System (ADS)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
The Spectral Element Method for Geophysical Flows

NASA Astrophysics Data System (ADS)

Taylor, Mark

1998-11-01

We will describe SEAM, a Spectral Element Atmospheric Model. SEAM solves the 3D primitive equations used in climate modeling and medium range forecasting. SEAM uses a spectral element discretization for the surface of the globe and finite differences in the vertical direction. The model is spectrally accurate, as demonstrated by a variety of test cases. It is well suited for modern distributed-shared memory computers, sustaining over 24 GFLOPS on a 240 processor HP Exemplar. This performance has allowed us to run several interesting simulations in full spherical geometry at high resolution (over 22 million grid points).
A general model for memory interference in a multiprocessor system with memory hierarchy

NASA Technical Reports Server (NTRS)

Taha, Badie A.; Standley, Hilda M.

1989-01-01

The problem of memory interference in a multiprocessor system with a hierarchy of shared buses and memories is addressed. The behavior of the processors is represented by a sequence of memory requests with each followed by a determined amount of processing time. A statistical queuing network model for determining the extent of memory interference in multiprocessor systems with clusters of memory hierarchies is presented. The performance of the system is measured by the expected number of busy memory clusters. The results of the analytic model are compared with simulation results, and the correlation between them is found to be very high.
The role of graphics super-workstations in a supercomputing environment

NASA Technical Reports Server (NTRS)

Levin, E.

1989-01-01

A new class of very powerful workstations has recently become available which integrate near supercomputer computational performance with very powerful and high quality graphics capability. These graphics super-workstations are expected to play an increasingly important role in providing an enhanced environment for supercomputer users. Their potential uses include: off-loading the supercomputer (by serving as stand-alone processors, by post-processing of the output of supercomputer calculations, and by distributed or shared processing), scientific visualization (understanding of results, communication of results), and by real time interaction with the supercomputer (to steer an iterative computation, to abort a bad run, or to explore and develop new algorithms).

TTEthernet for Integrated Spacecraft Networks

NASA Technical Reports Server (NTRS)

Loveless, Andrew

2015-01-01

Aerospace projects have traditionally employed federated avionics architectures, in which each computer system is designed to perform one specific function (e.g. navigation). There are obvious downsides to this approach, including excessive weight (from so much computing hardware), and inefficient processor utilization (since modern processors are capable of performing multiple tasks). There has therefore been a push for integrated modular avionics (IMA), in which common computing platforms can be leveraged for different purposes. This consolidation of multiple vehicle functions to shared computing platforms can significantly reduce spacecraft cost, weight, and design complexity. However, the application of IMA principles introduces significant challenges, as the data network must accommodate traffic of mixed criticality and performance levels - potentially all related to the same shared computer hardware. Because individual network technologies are rarely so competent, the development of truly integrated network architectures often proves unreasonable. Several different types of networks are utilized - each suited to support a specific vehicle function. Critical functions are typically driven by precise timing loops, requiring networks with strict guarantees regarding message latency (i.e. determinism) and fault-tolerance. Alternatively, non-critical systems generally employ data networks prioritizing flexibility and high performance over reliable operation. Switched Ethernet has seen widespread success filling this role in terrestrial applications. Its high speed, flexibility, and the availability of inexpensive commercial off-the-shelf (COTS) components make it desirable for inclusion in spacecraft platforms. Basic Ethernet configurations have been incorporated into several preexisting aerospace projects, including both the Space Shuttle and International Space Station (ISS). However, classical switched Ethernet cannot provide the high level of network determinism required by real-time spacecraft applications. Even with modern advancements, the uncoordinated (i.e. event-driven) nature of Ethernet communication unavoidably leads to message contention within network switches. The arbitration process used to resolve such conflicts introduces variation in the time it takes for messages to be forwarded. TTEthernet1 introduces decentralized clock synchronization to switched Ethernet, enabling message transmission according to a time-triggered (TT) paradigm. A network planning tool is used to allocate each device a finite amount of time in which it may transmit a frame. Each time slot is repeated sequentially to form a periodic communication schedule that is then loaded onto each TTEthernet device (e.g. switches and end systems). Each network participant references the synchronized time in order to dispatch messages at predetermined instances. This schedule guarantees that no contention exists between time-triggered Ethernet frames in the network switches, therefore eliminating the need for arbitration (and the timing variation it causes). Besides time-triggered messaging, TTEthernet networks may provide two additional traffic classes to support communication of different criticality levels. In the rate-constrained (RC) traffic class, the frame payload size and rate of transmission along each communication channel are limited to predetermined maximums. The network switches can therefore be configured to accommodate the known worst-case traffic pattern, and buffer overflows can be eliminated. The best-effort (BE) traffic class behaves akin to classical Ethernet. No guarantees are provided regarding transmission latency or successful message delivery. TTEthernet coordinates transmission of all three traffic classes over the same physical connections, therefore accommodating the full spectrum of traffic criticality levels required in IMA architectures. Common computing platforms (e.g. LRUs) can share networking resources in such a way that failures in non-critical systems (using BE or RC communication modes) cannot impact flight-critical functions (using TT communication). Furthermore, TTEthernet hardware (e.g. switches, cabling) can be shared by both TTEthernet and classical Ethernet traffic.
Broadcasting collective operation contributions throughout a parallel computer

DOEpatents

Faraj, Ahmad [Rochester, MN

2012-02-21

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.
LANDSAT-D flight segment operations manual. Appendix B: OBC software operations

NASA Technical Reports Server (NTRS)

Talipsky, R.

1981-01-01

The LANDSAT 4 satellite contains two NASA standard spacecraft computers and 65,536 words of memory. Onboard computer software is divided into flight executive and applications processors. Both applications processors and the flight executive use one or more of 67 system tables to obtain variables, constants, and software flags. Output from the software for monitoring operation is via 49 OBC telemetry reports subcommutated in the spacecraft telemetry. Information is provided about the flight software as it is used to control the various spacecraft operations and interpret operational OBC telemetry. Processor function descriptions, processor operation, software constraints, processor system tables, processor telemetry, and processor flow charts are presented.
Managing Power Heterogeneity

NASA Astrophysics Data System (ADS)

Pruhs, Kirk

A particularly important emergent technology is heterogeneous processors (or cores), which many computer architects believe will be the dominant architectural design in the future. The main advantage of a heterogeneous architecture, relative to an architecture of identical processors, is that it allows for the inclusion of processors whose design is specialized for particular types of jobs, and for jobs to be assigned to a processor best suited for that job. Most notably, it is envisioned that these heterogeneous architectures will consist of a small number of high-power high-performance processors for critical jobs, and a larger number of lower-power lower-performance processors for less critical jobs. Naturally, the lower-power processors would be more energy efficient in terms of the computation performed per unit of energy expended, and would generate less heat per unit of computation. For a given area and power budget, heterogeneous designs can give significantly better performance for standard workloads. Moreover, even processors that were designed to be homogeneous, are increasingly likely to be heterogeneous at run time: the dominant underlying cause is the increasing variability in the fabrication process as the feature size is scaled down (although run time faults will also play a role). Since manufacturing yields would be unacceptably low if every processor/core was required to be perfect, and since there would be significant performance loss from derating the entire chip to the functioning of the least functional processor (which is what would be required in order to attain processor homogeneity), some processor heterogeneity seems inevitable in chips with many processors/cores.
Parallel implementation of an adaptive and parameter-free N-body integrator

NASA Astrophysics Data System (ADS)

Pruett, C. David; Ingham, William H.; Herman, Ralph D.

2011-05-01

Previously, Pruett et al. (2003) [3] described an N-body integrator of arbitrarily high order M with an asymptotic operation count of O(MN). The algorithm's structure lends itself readily to data parallelization, which we document and demonstrate here in the integration of point-mass systems subject to Newtonian gravitation. High order is shown to benefit parallel efficiency. The resulting N-body integrator is robust, parameter-free, highly accurate, and adaptive in both time-step and order. Moreover, it exhibits linear speedup on distributed parallel processors, provided that each processor is assigned at least a handful of bodies. Program summaryProgram title: PNB.f90 Catalogue identifier: AEIK_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEIK_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC license, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 3052 No. of bytes in distributed program, including test data, etc.: 68 600 Distribution format: tar.gz Programming language: Fortran 90 and OpenMPI Computer: All shared or distributed memory parallel processors Operating system: Unix/Linux Has the code been vectorized or parallelized?: The code has been parallelized but has not been explicitly vectorized. RAM: Dependent upon N Classification: 4.3, 4.12, 6.5 Nature of problem: High accuracy numerical evaluation of trajectories of N point masses each subject to Newtonian gravitation. Solution method: Parallel and adaptive extrapolation in time via power series of arbitrary degree. Running time: 5.1 s for the demo program supplied with the package.
Multi-Core Processor Memory Contention Benchmark Analysis Case Study

NASA Technical Reports Server (NTRS)

Simon, Tyler; McGalliard, James

2009-01-01

Multi-core processors dominate current mainframe, server, and high performance computing (HPC) systems. This paper provides synthetic kernel and natural benchmark results from an HPC system at the NASA Goddard Space Flight Center that illustrate the performance impacts of multi-core (dual- and quad-core) vs. single core processor systems. Analysis of processor design, application source code, and synthetic and natural test results all indicate that multi-core processors can suffer from significant memory subsystem contention compared to similar single-core processors.
Simulink/PARS Integration Support

DOE Office of Scientific and Technical Information (OSTI.GOV)

Vacaliuc, B.; Nakhaee, N.

2013-12-18

The state of the art for signal processor hardware has far out-paced the development tools for placing applications on that hardware. In addition, signal processors are available in a variety of architectures, each uniquely capable of handling specific types of signal processing efficiently. With these processors becoming smaller and demanding less power, it has become possible to group multiple processors, a heterogeneous set of processors, into single systems. Different portions of the desired problem set can be assigned to different processor types as appropriate. As software development tools do not keep pace with these processors, especially when multiple processors ofmore » different types are used, a method is needed to enable software code portability among multiple processors and multiple types of processors along with their respective software environments. Sundance DSP, Inc. has developed a software toolkit called “PARS”, whose objective is to provide a framework that uses suites of tools provided by different vendors, along with modeling tools and a real time operating system, to build an application that spans different processor types. The software language used to express the behavior of the system is a very high level modeling language, “Simulink”, a MathWorks product. ORNL has used this toolkit to effectively implement several deliverables. This CRADA describes this collaboration between ORNL and Sundance DSP, Inc.« less
SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processors with scanning CCD photodetectors

NASA Astrophysics Data System (ADS)

Esepkina, N. A.; Lavrov, A. P.; Anan'ev, M. N.; Blagodarnyi, V. S.; Ivanov, S. I.; Mansyrev, M. I.; Molodyakov, S. A.

1995-10-01

Two new types of optoelectronic radio-signal processors were investigated. Charge-coupled device (CCD) photodetectors are used in these processors under continuous scanning conditions, i.e. in a time delay and storage mode. One of these processors is based on a CCD photodetector array with a reference-signal amplitude transparency and the other is an adaptive acousto-optical signal processor with linear frequency modulation. The processor with the transparency performs multichannel discrete—analogue convolution of an input signal with a corresponding kernel of the transformation determined by the transparency. If a light source is an array of light-emitting diodes of special (stripe) geometry, the optical stages of the processor can be made from optical fibre components and the whole processor then becomes a rigid 'sandwich' (a compact hybrid optoelectronic microcircuit). A report is given also of a study of a prototype processor with optical fibre components for the reception of signals from a system with antenna aperture synthesis, which forms a radio image of the Earth.
System and method for representing and manipulating three-dimensional objects on massively parallel architectures

DOEpatents

Karasick, M.S.; Strip, D.R.

1996-01-30

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.
Scalable Triadic Analysis of Large-Scale Graphs: Multi-Core vs. Multi-Processor vs. Multi-Threaded Shared Memory Architectures

DOE Office of Scientific and Technical Information (OSTI.GOV)

Chin, George; Marquez, Andres; Choudhury, Sutanay

2012-09-01

Triadic analysis encompasses a useful set of graph mining methods that is centered on the concept of a triad, which is a subgraph of three nodes and the configuration of directed edges across the nodes. Such methods are often applied in the social sciences as well as many other diverse fields. Triadic methods commonly operate on a triad census that counts the number of triads of every possible edge configuration in a graph. Like other graph algorithms, triadic census algorithms do not scale well when graphs reach tens of millions to billions of nodes. To enable the triadic analysis ofmore » large-scale graphs, we developed and optimized a triad census algorithm to efficiently execute on shared memory architectures. We will retrace the development and evolution of a parallel triad census algorithm. Over the course of several versions, we continually adapted the code’s data structures and program logic to expose more opportunities to exploit parallelism on shared memory that would translate into improved computational performance. We will recall the critical steps and modifications that occurred during code development and optimization. Furthermore, we will compare the performances of triad census algorithm versions on three specific systems: Cray XMT, HP Superdome, and AMD multi-core NUMA machine. These three systems have shared memory architectures but with markedly different hardware capabilities to manage parallelism.« less
Parallel Computation of the Regional Ocean Modeling System (ROMS)

DOE Office of Scientific and Technical Information (OSTI.GOV)

Wang, P; Song, Y T; Chao, Y

2005-04-05

The Regional Ocean Modeling System (ROMS) is a regional ocean general circulation modeling system solving the free surface, hydrostatic, primitive equations over varying topography. It is free software distributed world-wide for studying both complex coastal ocean problems and the basin-to-global scale ocean circulation. The original ROMS code could only be run on shared-memory systems. With the increasing need to simulate larger model domains with finer resolutions and on a variety of computer platforms, there is a need in the ocean-modeling community to have a ROMS code that can be run on any parallel computer ranging from 10 to hundreds ofmore » processors. Recently, we have explored parallelization for ROMS using the MPI programming model. In this paper, an efficient parallelization strategy for such a large-scale scientific software package, based on an existing shared-memory computing model, is presented. In addition, scientific applications and data-performance issues on a couple of SGI systems, including Columbia, the world's third-fastest supercomputer, are discussed.« less
Optimizing CMS build infrastructure via Apache Mesos

DOE PAGES

Abdurachmanov, David; Degano, Alessandro; Elmer, Peter; ...

2015-12-23

The Offline Software of the CMS Experiment at the Large Hadron Collider (LHC) at CERN consists of 6M lines of in-house code, developed over a decade by nearly 1000 physicists, as well as a comparable amount of general use open-source code. A critical ingredient to the success of the construction and early operation of the WLCG was the convergence, around the year 2000, on the use of a homogeneous environment of commodity x86-64 processors and Linux.Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, Jenkins, Spark, Aurora,more » and other applications on a dynamically shared pool of nodes. Lastly, we present how we migrated our continuous integration system to schedule jobs on a relatively small Apache Mesos enabled cluster and how this resulted in better resource usage, higher peak performance and lower latency thanks to the dynamic scheduling capabilities of Mesos.« less
Optimizing CMS build infrastructure via Apache Mesos

DOE Office of Scientific and Technical Information (OSTI.GOV)

Abdurachmanov, David; Degano, Alessandro; Elmer, Peter

The Offline Software of the CMS Experiment at the Large Hadron Collider (LHC) at CERN consists of 6M lines of in-house code, developed over a decade by nearly 1000 physicists, as well as a comparable amount of general use open-source code. A critical ingredient to the success of the construction and early operation of the WLCG was the convergence, around the year 2000, on the use of a homogeneous environment of commodity x86-64 processors and Linux.Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks. It can run Hadoop, Jenkins, Spark, Aurora,more » and other applications on a dynamically shared pool of nodes. Lastly, we present how we migrated our continuous integration system to schedule jobs on a relatively small Apache Mesos enabled cluster and how this resulted in better resource usage, higher peak performance and lower latency thanks to the dynamic scheduling capabilities of Mesos.« less
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction

DOE PAGES

Kumar, B.; Huang, C. -H.; Sadayappan, P.; ...

1995-01-01

In this article, we present a program generation strategy of Strassen's matrix multiplication algorithm using a programming methodology based on tensor product formulas. In this methodology, block recursive programs such as the fast Fourier Transforms and Strassen's matrix multiplication algorithm are expressed as algebraic formulas involving tensor products and other matrix operations. Such formulas can be systematically translated to high-performance parallel/vector codes for various architectures. In this article, we present a nonrecursive implementation of Strassen's algorithm for shared memory vector processors such as the Cray Y-MP. A previous implementation of Strassen's algorithm synthesized from tensor product formulas required working storagemore » of size O(7 n ) for multiplying 2 n × 2 n matrices. We present a modified formulation in which the working storage requirement is reduced to O(4 n ). The modified formulation exhibits sufficient parallelism for efficient implementation on a shared memory multiprocessor. Performance results on a Cray Y-MP8/64 are presented.« less
Peer-to-peer Cooperative Scheduling Architecture for National Grid Infrastructure

NASA Astrophysics Data System (ADS)

Matyska, Ludek; Ruda, Miroslav; Toth, Simon

For some ten years, the Czech National Grid Infrastructure MetaCentrum uses a single central PBSPro installation to schedule jobs across the country. This centralized approach keeps a full track about all the clusters, providing support for jobs spanning several sites, implementation for the fair-share policy and better overall control of the grid environment. Despite a steady progress in the increased stability and resilience to intermittent very short network failures, growing number of sites and processors makes this architecture, with a single point of failure and scalability limits, obsolete. As a result, a new scheduling architecture is proposed, which relies on higher autonomy of clusters. It is based on a peer to peer network of semi-independent schedulers for each site or even cluster. Each scheduler accepts jobs for the whole infrastructure, cooperating with other schedulers on implementation of global policies like central job accounting, fair-share, or submission of jobs across several sites. The scheduling system is integrated with the Magrathea system to support scheduling of virtual clusters, including the setup of their internal network, again eventually spanning several sites. On the other hand, each scheduler is local to one of several clusters and is able to directly control and submit jobs to them even if the connection of other scheduling peers is lost. In parallel to the change of the overall architecture, the scheduling system itself is being replaced. Instead of PBSPro, chosen originally for its declared support of large scale distributed environment, the new scheduling architecture is based on the open-source Torque system. The implementation and support for the most desired properties in PBSPro and Torque are discussed and the necessary modifications to Torque to support the MetaCentrum scheduling architecture are presented, too.
PS3 CELL Development for Scientific Computation and Research

NASA Astrophysics Data System (ADS)

Christiansen, M.; Sevre, E.; Wang, S. M.; Yuen, D. A.; Liu, S.; Lyness, M. D.; Broten, M.

2007-12-01

The Cell processor is one of the most powerful processors on the market, and researchers in the earth sciences may find its parallel architecture to be very useful. A cell processor, with 7 cores, can easily be obtained for experimentation by purchasing a PlayStation 3 (PS3) and installing linux and the IBM SDK. Each core of the PS3 is capable of 25 GFLOPS giving a potential limit of 150 GFLOPS when using all 6 SPUs (synergistic processing units) by using vectorized algorithms. We have used the Cell's computational power to create a program which takes simulated tsunami datasets, parses them, and returns a colorized height field image using ray casting techniques. As expected, the time required to create an image is inversely proportional to the number of SPUs used. We believe that this trend will continue when multiple PS3s are chained using OpenMP functionality and are in the process of researching this. By using the Cell to visualize tsunami data, we have found that its greatest feature is its power. This fact entwines well with the needs of the scientific community where the limiting factor is time. Any algorithm, such as the heat equation, that can be subdivided into multiple parts can take advantage of the PS3 Cell's ability to split the computations across the 6 SPUs reducing required run time by one sixth. Further vectorization of the code can allow for 4 simultanious floating point operations by using the SIMD (single instruction multiple data) capabilities of the SPU increasing efficiency 24 times.
Enabling MPEG-2 video playback in embedded systems through improved data cache efficiency

NASA Astrophysics Data System (ADS)

Soderquist, Peter; Leeser, Miriam E.

1999-01-01

Digital video decoding, enabled by the MPEG-2 Video standard, is an important future application for embedded systems, particularly PDAs and other information appliances. Many such system require portability and wireless communication capabilities, and thus face severe limitations in size and power consumption. This places a premium on integration and efficiency, and favors software solutions for video functionality over specialized hardware. The processors in most embedded system currently lack the computational power needed to perform video decoding, but a related and equally important problem is the required data bandwidth, and the need to cost-effectively insure adequate data supply. MPEG data sets are very large, and generate significant amounts of excess memory traffic for standard data caches, up to 100 times the amount required for decoding. Meanwhile, cost and power limitations restrict cache sizes in embedded systems. Some systems, including many media processors, eliminate caches in favor of memories under direct, painstaking software control in the manner of digital signal processors. Yet MPEG data has locality which caches can exploit if properly optimized, providing fast, flexible, and automatic data supply. We propose a set of enhancements which target the specific needs of the heterogeneous types within the MPEG decoder working set. These optimizations significantly improve the efficiency of small caches, reducing cache-memory traffic by almost 70 percent, and can make an enhanced 4 KB cache perform better than a standard 1 MB cache. This performance improvement can enable high-resolution, full frame rate video playback in cheaper, smaller system than woudl otherwise be possible.
Distributed micro-radar system for detection and tracking of low-profile, low-altitude targets

NASA Astrophysics Data System (ADS)

Gorwara, Ashok; Molchanov, Pavlo

2016-05-01

Proposed airborne surveillance radar system can detect, locate, track, and classify low-profile, low-altitude targets: from traditional fixed and rotary wing aircraft to non-traditional targets like unmanned aircraft systems (drones) and even small projectiles. Distributed micro-radar system is the next step in the development of passive monopulse direction finder proposed by Stephen E. Lipsky in the 80s. To extend high frequency limit and provide high sensitivity over the broadband of frequencies, multiple angularly spaced directional antennas are coupled with front end circuits and separately connected to a direction finder processor by a digital interface. Integration of antennas with front end circuits allows to exclude waveguide lines which limits system bandwidth and creates frequency dependent phase errors. Digitizing of received signals proximate to antennas allows loose distribution of antennas and dramatically decrease phase errors connected with waveguides. Accuracy of direction finding in proposed micro-radar in this case will be determined by time accuracy of digital processor and sampling frequency. Multi-band, multi-functional antennas can be distributed around the perimeter of a Unmanned Aircraft System (UAS) and connected to the processor by digital interface or can be distributed between swarm/formation of mini/micro UAS and connected wirelessly. Expendable micro-radars can be distributed by perimeter of defense object and create multi-static radar network. Low-profile, lowaltitude, high speed targets, like small projectiles, create a Doppler shift in a narrow frequency band. This signal can be effectively filtrated and detected with high probability. Proposed micro-radar can work in passive, monostatic or bistatic regime.
In vivo experiences with magnetic resonance imaging scans in Vibrant Soundbridge type 503 implantees.

PubMed

Todt, I; Mittmann, P; Ernst, A; Mutze, S; Rademacher, G

2018-05-01

To observe the effects of magnetic resonance imaging scans in Vibrant Soundbridge 503 implantees at 1.5T in vivo. In a prospective case study of five Vibrant Soundbridge 503 implantees, 1.5T magnetic resonance imaging scans were performed with and without a headband. The degree of pain was evaluated using a visual analogue scale. Scan-related pure tone audiogram and audio processor fitting changes were assessed. In all patients, magnetic resonance imaging scans were performed without any degree of pain or change in pure tone audiogram or audio processor fitting, even without a headband. In this series, 1.5T magnetic resonance imaging scans were performed with the Vibrant Soundbridge 503 without complications. Limitations persist in terms of magnetic artefacts.
Vector processing efficiency of plasma MHD codes by use of the FACOM 230-75 APU

NASA Astrophysics Data System (ADS)

Matsuura, T.; Tanaka, Y.; Naraoka, K.; Takizuka, T.; Tsunematsu, T.; Tokuda, S.; Azumi, M.; Kurita, G.; Takeda, T.

1982-06-01

In the framework of pipelined vector architecture, the efficiency of vector processing is assessed with respect to plasma MHD codes in nuclear fusion research. By using a vector processor, the FACOM 230-75 APU, the limit of the enhancement factor due to parallelism of current vector machines is examined for three numerical codes based on a fluid model. Reasonable speed-up factors of approximately 6,6 and 4 times faster than the highly optimized scalar version are obtained for ERATO (linear stability code), AEOLUS-R1 (nonlinear stability code) and APOLLO (1-1/2D transport code), respectively. Problems of the pipelined vector processors are discussed from the viewpoint of restructuring, optimization and choice of algorithms. In conclusion, the important concept of "concurrency within pipelined parallelism" is emphasized.

Electrochemical sensing using voltage-current time differential

DOE Office of Scientific and Technical Information (OSTI.GOV)

Woo, Leta Yar-Li; Glass, Robert Scott; Fitzpatrick, Joseph Jay

2017-02-28

A device for signal processing. The device includes a signal generator, a signal detector, and a processor. The signal generator generates an original waveform. The signal detector detects an affected waveform. The processor is coupled to the signal detector. The processor receives the affected waveform from the signal detector. The processor also compares at least one portion of the affected waveform with the original waveform. The processor also determines a difference between the affected waveform and the original waveform. The processor also determines a value corresponding to a unique portion of the determined difference between the original and affected waveforms.more » The processor also outputs the determined value.« less
Accuracy requirements of optical linear algebra processors in adaptive optics imaging systems

NASA Technical Reports Server (NTRS)

Downie, John D.; Goodman, Joseph W.

1989-01-01

The accuracy requirements of optical processors in adaptive optics systems are determined by estimating the required accuracy in a general optical linear algebra processor (OLAP) that results in a smaller average residual aberration than that achieved with a conventional electronic digital processor with some specific computation speed. Special attention is given to an error analysis of a general OLAP with regard to the residual aberration that is created in an adaptive mirror system by the inaccuracies of the processor, and to the effect of computational speed of an electronic processor on the correction. Results are presented on the ability of an OLAP to compete with a digital processor in various situations.
12 CFR 332.12 - Limits on sharing account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 4 2010-01-01 2010-01-01 false Limits on sharing account number information... REGULATIONS AND STATEMENTS OF GENERAL POLICY PRIVACY OF CONSUMER FINANCIAL INFORMATION Limits on Disclosures § 332.12 Limits on sharing account number information for marketing purposes. (a) General prohibition on...
12 CFR 216.12 - Limits on sharing account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 2 2010-01-01 2010-01-01 false Limits on sharing account number information... GOVERNORS OF THE FEDERAL RESERVE SYSTEM PRIVACY OF CONSUMER FINANCIAL INFORMATION (REGULATION P) Limits on Disclosures § 216.12 Limits on sharing account number information for marketing purposes. (a) General...
12 CFR 40.12 - Limits on sharing account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 1 2010-01-01 2010-01-01 false Limits on sharing account number information... OF THE TREASURY PRIVACY OF CONSUMER FINANCIAL INFORMATION Limits on Disclosures § 40.12 Limits on sharing account number information for marketing purposes. (a) General prohibition on disclosure of...
12 CFR 573.12 - Limits on sharing account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 5 2010-01-01 2010-01-01 false Limits on sharing account number information..., DEPARTMENT OF THE TREASURY PRIVACY OF CONSUMER FINANCIAL INFORMATION Limits on Disclosures § 573.12 Limits on sharing account number information for marketing purposes. (a) General prohibition on disclosure of...
12 CFR 716.12 - Limits on sharing of account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 12 Banks and Banking 6 2010-01-01 2010-01-01 false Limits on sharing of account number information... REGULATIONS AFFECTING CREDIT UNIONS PRIVACY OF CONSUMER FINANCIAL INFORMATION Limits on Disclosures § 716.12 Limits on sharing of account number information for marketing purposes. (a) General prohibition on...
16 CFR 313.12 - Limits on sharing account number information for marketing purposes.

Code of Federal Regulations, 2010 CFR

2010-01-01

... 16 Commercial Practices 1 2010-01-01 2010-01-01 false Limits on sharing account number information... REGULATIONS UNDER SPECIFIC ACTS OF CONGRESS PRIVACY OF CONSUMER FINANCIAL INFORMATION Limits on Disclosures § 313.12 Limits on sharing account number information for marketing purposes. (a) General prohibition on...
50 CFR 679.81 - Rockfish Program annual harvester and processor privileges.

Code of Federal Regulations, 2011 CFR

2011-10-01

... amount (MRA) limits—(1) Rockfish cooperative. A vessel assigned to a rockfish cooperative and fishing... this part. (6) Maximum retainable amounts (MRA). (i) The MRA for an incidental catch species for..., shortraker and rougheye rockfish are incidental catch species and are limited to an aggregate MRA of 2.0...
50 CFR 679.82 - Rockfish Program use caps and sideboard limits.

Code of Federal Regulations, 2013 CFR

2013-10-01

... not participate in directed fishing for arrowtooth flounder, deep-water flatfish, and rex sole in the GOA (or in waters adjacent to the GOA when arrowtooth flounder, deep-water flatfish, and rex sole... authority of all eligible LLP licenses in the catcher/processor sector. (ii) For the deep-water halibut PSC...
50 CFR 679.82 - Rockfish Program use caps and sideboard limits.

Code of Federal Regulations, 2014 CFR

2014-10-01

... not participate in directed fishing for arrowtooth flounder, deep-water flatfish, and rex sole in the GOA (or in waters adjacent to the GOA when arrowtooth flounder, deep-water flatfish, and rex sole... authority of all eligible LLP licenses in the catcher/processor sector. (ii) For the deep-water halibut PSC...
77 FR 20296 - Significant New Use Rules on Certain Chemical Substances

Federal Register 2010, 2011, 2012, 2013, 2014

2012-04-04

.... Potentially affected entities may include, but are not limited to: Manufacturers, importers, or processors of... regarding entities likely to be affected by this action. Other types of entities not listed in this unit... of manufacturing and processing of a chemical substance. The extent to which a use changes the type...
40 CFR 721.6070 - Alkyl phosphonate ammonium salts.

Code of Federal Regulations, 2010 CFR

2010-07-01

... salts (PMNs P-93-725 and P-93-726) are subject to reporting under this section for the significant new... water. Requirements as specified in § 721.90 (a)(4), (b)(4), and (c)(4) (where N = 400 ppb). (b...), and (k) are applicable to manufacturers, importers, and processors of this substance. (2) Limitations...
21 CFR 120.10 - Corrective actions.

Code of Federal Regulations, 2010 CFR

2010-04-01

... 21 Food and Drugs 2 2010-04-01 2010-04-01 false Corrective actions. 120.10 Section 120.10 Food and... actions. Whenever a deviation from a critical limit occurs, a processor shall take corrective action by... develop written corrective action plans, which become part of their HACCP plans in accordance with § 120.8...
SPORT: An Algorithm for Divisible Load Scheduling with Result Collection on Heterogeneous Systems

NASA Astrophysics Data System (ADS)

Ghatpande, Abhay; Nakazato, Hidenori; Beaumont, Olivier; Watanabe, Hiroshi

Divisible Load Theory (DLT) is an established mathematical framework to study Divisible Load Scheduling (DLS). However, traditional DLT does not address the scheduling of results back to source (i. e., result collection), nor does it comprehensively deal with system heterogeneity. In this paper, the DLSRCHETS (DLS with Result Collection on HET-erogeneous Systems) problem is addressed. The few papers to date that have dealt with DLSRCHETS, proposed simplistic LIFO (Last In, First Out) and FIFO (First In, First Out) type of schedules as solutions to DLSRCHETS. In this paper, a new polynomial time heuristic algorithm, SPORT (System Parameters based Optimized Result Transfer), is proposed as a solution to the DLSRCHETS problem. With the help of simulations, it is proved that the performance of SPORT is significantly better than existing algorithms. The other major contributions of this paper include, for the first time ever, (a) the derivation of the condition to identify the presence of idle time in a FIFO schedule for two processors, (b) the identification of the limiting condition for the optimality of FIFO and LIFO schedules for two processors, and (c) the introduction of the concept of equivalent processor in DLS for heterogeneous systems with result collection.
A parallel implementation of an off-lattice individual-based model of multicellular populations

NASA Astrophysics Data System (ADS)

Harvey, Daniel G.; Fletcher, Alexander G.; Osborne, James M.; Pitt-Francis, Joe

2015-07-01

As computational models of multicellular populations include ever more detailed descriptions of biophysical and biochemical processes, the computational cost of simulating such models limits their ability to generate novel scientific hypotheses and testable predictions. While developments in microchip technology continue to increase the power of individual processors, parallel computing offers an immediate increase in available processing power. To make full use of parallel computing technology, it is necessary to develop specialised algorithms. To this end, we present a parallel algorithm for a class of off-lattice individual-based models of multicellular populations. The algorithm divides the spatial domain between computing processes and comprises communication routines that ensure the model is correctly simulated on multiple processors. The parallel algorithm is shown to accurately reproduce the results of a deterministic simulation performed using a pre-existing serial implementation. We test the scaling of computation time, memory use and load balancing as more processes are used to simulate a cell population of fixed size. We find approximate linear scaling of both speed-up and memory consumption on up to 32 processor cores. Dynamic load balancing is shown to provide speed-up for non-regular spatial distributions of cells in the case of a growing population.
Modeling heterogeneous processor scheduling for real time systems

NASA Technical Reports Server (NTRS)

Leathrum, J. F.; Mielke, R. R.; Stoughton, J. W.

1994-01-01

A new model is presented to describe dataflow algorithms implemented in a multiprocessing system. Called the resource/data flow graph (RDFG), the model explicitly represents cyclo-static processor schedules as circuits of processor arcs which reflect the order that processors execute graph nodes. The model also allows the guarantee of meeting hard real-time deadlines. When unfolded, the model identifies statically the processor schedule. The model therefore is useful for determining the throughput and latency of systems with heterogeneous processors. The applicability of the model is demonstrated using a space surveillance algorithm.
Parallel processor for real-time structural control

NASA Astrophysics Data System (ADS)

Tise, Bert L.

1993-07-01

A parallel processor that is optimized for real-time linear control has been developed. This modular system consists of A/D modules, D/A modules, and floating-point processor modules. The scalable processor uses up to 1,000 Motorola DSP96002 floating-point processors for a peak computational rate of 60 GFLOPS. Sampling rates up to 625 kHz are supported by this analog-in to analog-out controller. The high processing rate and parallel architecture make this processor suitable for computing state-space equations and other multiply/accumulate-intensive digital filters. Processor features include 14-bit conversion devices, low input-to-output latency, 240 Mbyte/s synchronous backplane bus, low-skew clock distribution circuit, VME connection to host computer, parallelizing code generator, and look- up-tables for actuator linearization. This processor was designed primarily for experiments in structural control. The A/D modules sample sensors mounted on the structure and the floating- point processor modules compute the outputs using the programmed control equations. The outputs are sent through the D/A module to the power amps used to drive the structure's actuators. The host computer is a Sun workstation. An OpenWindows-based control panel is provided to facilitate data transfer to and from the processor, as well as to control the operating mode of the processor. A diagnostic mode is provided to allow stimulation of the structure and acquisition of the structural response via sensor inputs.
Testing and operating a multiprocessor chip with processor redundancy

DOEpatents

Bellofatto, Ralph E; Douskey, Steven M; Haring, Rudolf A; McManus, Moyra K; Ohmacht, Martin; Schmunkamp, Dietmar; Sugavanam, Krishnan; Weatherford, Bryan J

2014-10-21

A system and method for improving the yield rate of a multiprocessor semiconductor chip that includes primary processor cores and one or more redundant processor cores. A first tester conducts a first test on one or more processor cores, and encodes results of the first test in an on-chip non-volatile memory. A second tester conducts a second test on the processor cores, and encodes results of the second test in an external non-volatile storage device. An override bit of a multiplexer is set if a processor core fails the second test. In response to the override bit, the multiplexer selects a physical-to-logical mapping of processor IDs according to one of: the encoded results in the memory device or the encoded results in the external storage device. On-chip logic configures the processor cores according to the selected physical-to-logical mapping.
DOE Office of Scientific and Technical Information (OSTI.GOV)

Reed, D.A.; Grunwald, D.C.

The spectrum of parallel processor designs can be divided into three sections according to the number and complexity of the processors. At one end there are simple, bit-serial processors. Any one of thee processors is of little value, but when it is coupled with many others, the aggregate computing power can be large. This approach to parallel processing can be likened to a colony of termites devouring a log. The most notable examples of this approach are the NASA/Goodyear Massively Parallel Processor, which has 16K one-bit processors, and the Thinking Machines Connection Machine, which has 64K one-bit processors. At themore » other end of the spectrum, a small number of processors, each built using the fastest available technology and the most sophisticated architecture, are combined. An example of this approach is the Cray X-MP. This type of parallel processing is akin to four woodmen attacking the log with chainsaws.« less

Some links on this page may take you to non-federal websites. Their policies may differ from this site.