These are representative sample records from Science.gov related to your search topic.
For comprehensive and current results, perform a real-time search at Science.gov.
1

Design Space Exploration for Massively Parallel Processor Arrays  

Microsoft Academic Search

In this paper, we describe an approach for the optimiza- tion of dedicated co-processors that are implemented either in hardware (ASIC) or congware (FPGA). Such massively parallel co-processors are typically part of a heterogeneous hardware\\/software-system. Each co- processor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational

Frank Hannig; Jürgen Teich

2001-01-01

2

Titanic: a VLSI based content addressable parallel array processor  

SciTech Connect

A design is presented for a content addressable parallel array processor (CAPAP) which is both practical and feasible. Its practicality stems from an extensive program of research into real applications of content addressability and parallelism. The feasibility of the design stems from development under a set of conservative engineering constraints tied to limitations of VLSI technology. 1 ref.

Weems, C.; Levitan, S.; Foster, C.

1982-01-01

3

Digital Parallel Processor Array for Optimum Path Planning  

NASA Technical Reports Server (NTRS)

The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

1996-01-01

4

Decentralized dynamic resource management support for massively parallel processor arrays  

Microsoft Academic Search

ed resource management methodology for massively parallel processor ar­ rays. It enables processing elements to autonomously explore resource availability in their neighborhood. To support resource exploration, we introduce specialized controllers, which can be attached to each of the processing elements. We propose different types of architectures for the exploration controller: fast FSM­ based designs as well as flexible programmable controllers.

Vahid Lari; Andriy Narovlyanskyy; Frank Hannig; Jiirgen Teich

2011-01-01

5

Seasat synthetic-aperture radar data reduction using parallel programmable array processors  

NASA Technical Reports Server (NTRS)

This paper presents a digital processing system that produces the Seasat synthetic-aperture radar (SAR) imagery. The system consists of a SEL 32/77 host minicomputer and three AP-120B array processors. The partitioning of the SAR processing functions and the design of software modules is described. The rationale for selecting the parallel array processor architecture and the methodology for developing the parallel processing scheme on this system is described. This system attains a Seasat SAR data reduction speed of 2.5 h per 25-m resolution 4-look and 100 km x 100 km image frame. A preliminary performance evaluation of this parallel processing system and potential future applications for remote sensing data reduction are described.

Wu, C.; Barkan, B.; Karplus, W. J.; Caswell, D.

1982-01-01

6

Array processor architecture  

NASA Technical Reports Server (NTRS)

A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1983-01-01

7

Massively parallel processor computer  

NASA Technical Reports Server (NTRS)

An apparatus for processing multidimensional data with strong spatial characteristics, such as raw image data, characterized by a large number of parallel data streams in an ordered array is described. It comprises a large number (e.g., 16,384 in a 128 x 128 array) of parallel processing elements operating simultaneously and independently on single bit slices of a corresponding array of incoming data streams under control of a single set of instructions. Each of the processing elements comprises a bidirectional data bus in communication with a register for storing single bit slices together with a random access memory unit and associated circuitry, including a binary counter/shift register device, for performing logical and arithmetical computations on the bit slices, and an I/O unit for interfacing the bidirectional data bus with the data stream source. The massively parallel processor architecture enables very high speed processing of large amounts of ordered parallel data, including spatial translation by shifting or sliding of bits vertically or horizontally to neighboring processing elements.

Fung, L. W. (inventor)

1983-01-01

8

Array processor architecture connection network  

NASA Technical Reports Server (NTRS)

A connection network is disclosed for use between a parallel array of processors and a parallel array of memory modules for establishing non-conflicting data communications paths between requested memory modules and requesting processors. The connection network includes a plurality of switching elements interposed between the processor array and the memory modules array in an Omega networking architecture. Each switching element includes a first and a second processor side port, a first and a second memory module side port, and control logic circuitry for providing data connections between the first and second processor ports and the first and second memory module ports. The control logic circuitry includes strobe logic for examining data arriving at the first and the second processor ports to indicate when the data arriving is requesting data from a requesting processor to a requested memory module. Further, connection circuitry is associated with the strobe logic for examining requesting data arriving at the first and the second processor ports for providing a data connection therefrom to the first and the second memory module ports in response thereto when the data connection so provided does not conflict with a pre-established data connection currently in use.

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1982-01-01

9

Spaceborne Processor Array  

NASA Technical Reports Server (NTRS)

A Spaceborne Processor Array in Multifunctional Structure (SPAMS) can lower the total mass of the electronic and structural overhead of spacecraft, resulting in reduced launch costs, while increasing the science return through dynamic onboard computing. SPAMS integrates the multifunctional structure (MFS) and the Gilgamesh Memory, Intelligence, and Network Device (MIND) multi-core in-memory computer architecture into a single-system super-architecture. This transforms every inch of a spacecraft into a sharable, interconnected, smart computing element to increase computing performance while simultaneously reducing mass. The MIND in-memory architecture provides a foundation for high-performance, low-power, and fault-tolerant computing. The MIND chip has an internal structure that includes memory, processing, and communication functionality. The Gilgamesh is a scalable system comprising multiple MIND chips interconnected to operate as a single, tightly coupled, parallel computer. The array of MIND components shares a global, virtual name space for program variables and tasks that are allocated at run time to the distributed physical memory and processing resources. Individual processor- memory nodes can be activated or powered down at run time to provide active power management and to configure around faults. A SPAMS system is comprised of a distributed Gilgamesh array built into MFS, interfaces into instrument and communication subsystems, a mass storage interface, and a radiation-hardened flight computer.

Chow, Edward T.; Schatzel, Donald V.; Whitaker, William D.; Sterling, Thomas

2008-01-01

10

Digital Control of the Parallel Interleaved Solar Array Regulator Using the Digital Signal Processor  

Microsoft Academic Search

In this paper, a digital control approach for the parallel interleaved solar array regulator (SAR) is proposed. The proposed control scheme achieves stable operation in the entire region of the solar array (SA). Additionally, by making the effective load characteristic of the SAR seen by the SA as a resistive load sink, the current sharing among the converter modules is

H. S. Bae; S. H. Park; J. H. Lee; B. H. Cho; S. S. Jang

2006-01-01

11

Array processors in chemistry  

SciTech Connect

The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

Ostlund, N.S.

1980-01-01

12

Massively parallel processor  

NASA Technical Reports Server (NTRS)

A brief description is given of the Massively Parallel Processor (MPP). Major applications of the MPP are in the area of image processing (where the operands are often very small integers) from very high spatial resolution passive image sensors, signal processing of radar data, and numerical modeling simulations of climate. The system can be programmed in assembly language or a high level language. Information on background, status, architecture, programming, hardware reliability, applications, and the MPP's development as a national resource for parallel algorithm research are presented in outline form.

1985-01-01

13

Parallel Analog-to-Digital Image Processor  

NASA Technical Reports Server (NTRS)

Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

Lokerson, D. C.

1987-01-01

14

Adaptively Parallel Processor Allocation for Cilk Jobs  

E-print Network

The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known ...

Sen, Siddhartha

15

Array Processor Has Power and Flexibility  

NASA Technical Reports Server (NTRS)

Proposed processor architecture would have flexibility of a multi-processor and computational power of a lockstep array. Using an efficient interconnection network, it accomodates a large number of individual processors and memory modules. Array architecture would be suitable for very large scientific simulation problems and other applications.

Barnes, G. H.; Lundstrom, S. F.; Shafer, P. E.

1982-01-01

16

Image processing using one-dimensional processor arrays  

Microsoft Academic Search

The first half of this paper presents the design rationale for CNAPS, a specialized one-dimensional (1-D) processor array developed by Adaptive Solutions Inc. In this context, we discuss the problem of Amdahl's law which severely constrains special-purpose architectures. We also discuss specific architectural decisions such as the kind of parallelism, the computational precision of the processors, on-chip versus off-chip processor

DAN W. HAMMERSTROM; DANIEL P. LULICH

1996-01-01

17

The Use of a Microcomputer Based Array Processor for Real Time Laser Velocimeter Data Processing  

NASA Technical Reports Server (NTRS)

The application of an array processor to laser velocimeter data processing is presented. The hardware is described along with the method of parallel programming required by the array processor. A portion of the data processing program is described in detail. The increase in computational speed of a microcomputer equipped with an array processor is illustrated by comparative testing with a minicomputer.

Meyers, James F.

1990-01-01

18

Ultrafast Fourier-transform parallel processor  

SciTech Connect

A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

Greenberg, W.L.

1980-04-01

19

Exploring parallelism during processor design space exploration  

Microsoft Academic Search

To exploit the spatial parallelism within target applications, various processor architectures are proposed. However, to estimate the scope of parallelism from high-level application remains a daunting task. A re-targetable as well as efficient High-Level Language (HLL) compiler is needed for that purpose. Building such a compiler in the early phase of processor modeling is extremely difficult. This paper proposes efficient

A. Chattopadhyay; Y. Jia; D. Kammler; R. Leupers; G. Ascheid; H. Meyr

2010-01-01

20

Configurable Soft Processor Arrays Using the OpenFire Processor  

Microsoft Academic Search

Single-chip multiprocessor systems, while requiring significantly less design effort than custom hardware solutions, fall behind custom RTL in performance. In an effort to decrease this performance gap, the individual processors in an array can be tailored to their specific application. In this paper we present the OpenFire, a Xilinx MicroBlaze-compatible processor designed for configurable array research. A sample application is

Stephen Craven; Cameron Patterson; Peter Athanas

2005-01-01

21

Computing Flow Transition On Parallel Processors  

NASA Technical Reports Server (NTRS)

Parallel algorithm developed on multiple-microprocessor computer. Program initiated to develop computer codes capable of directly simulating and mathematically modeling transition process at mach numbers ranging from subsonic to hypersonic. Parallel computers potentially offer reduction of processing time; processing time inversely proportional to number of available processors.

Bokhari, S.; Erlebacher, G.; Hussaini, M. Y.

1993-01-01

22

Optical Interferometric Parallel Data Processor  

NASA Technical Reports Server (NTRS)

Image data processed faster than in present electronic systems. Optical parallel-processing system effectively calculates two-dimensional Fourier transforms in time required by light to travel from plane 1 to plane 8. Coherence interferometer at plane 4 splits light into parts that form double image at plane 6 if projection screen placed there.

Breckinridge, J. B.

1987-01-01

23

Parallel processor programs in the Federal Government  

NASA Technical Reports Server (NTRS)

In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

1985-01-01

24

Fault-tolerant parallel processor  

SciTech Connect

This paper addresses issues central to the design and operation of an ultrareliable, Byzantine resilient parallel computer. Interprocessor connectivity requirements are met by treating connectivity as a resource that is shared among many processing elements, allowing flexibility in their configuration and reducing complexity. Redundant groups are synchronized solely by message transmissions and receptions, which aslo provide input data consistency and output voting. Reliability analysis results are presented that demonstrate the reduced failure probability of such a system. Performance analysis results are presented that quantify the temporal overhead involved in executing such fault-tolerance-specific operations. Empirical performance measurements of prototypes of the architecture are presented. 30 refs.

Harper, R.E.; Lala, J.H. (Charles Stark Draper Laboratory, Inc., Cambridge, MA (USA))

1991-06-01

25

Grundy: Parallel Processor Architecture Makes Programming Easy  

NASA Astrophysics Data System (ADS)

Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

Meier, Robert J.

1985-12-01

26

Optical Array Processor: Laboratory Results  

NASA Astrophysics Data System (ADS)

A Space Integrating (SI) Optical Linear Algebra Processor (OLAP) is described and laboratory results on its performance in several practical engineering problems are presented. The applications include its use in the solution of a nonlinear matrix equation for optimal control and a parabolic Partial Differential Equation (PDE), the transient diffusion equation with two spatial variables. Frequency-multiplexed, analog and high accuracy non-base-two data encoding are used and discussed. A multi-processor OLAP architecture is described and partitioning and data flow issues are addressed.

Casasent, David; Jackson, James; Vaerewyck, Gerard

1987-01-01

27

Binocular Disparity Calculation on a Massively-Parallel Analog Vision Processor  

E-print Network

We studied neuromorphic models of binocular disparity processing and mapped them onto a vision chip containing a massively parallel analog processor array. Our goal was to make efficient use of the available hardware while ...

Mandal, Soumyajit

28

Multibus-based parallel processor for simulation  

NASA Technical Reports Server (NTRS)

A Multibus-based parallel processor simulation system is described. The system is intended to serve as a vehicle for gaining hands-on experience, testing system and application software, and evaluating parallel processor performance during development of a larger system based on the horizontal/vertical-bus interprocessor communication mechanism. The prototype system consists of up to seven Intel iSBC 86/12A single-board computers which serve as processing elements, a multiple transmission controller (MTC) designed to support system operation, and an Intel Model 225 Microcomputer Development System which serves as the user interface and input/output processor. All components are interconnected by a Multibus/IEEE 796 bus. An important characteristic of the system is that it provides a mechanism for a processing element to broadcast data to other selected processing elements. This parallel transfer capability is provided through the design of the MTC and a minor modification to the iSBC 86/12A board. The operation of the MTC, the basic hardware-level operation of the system, and pertinent details about the iSBC 86/12A and the Multibus are described.

Ogrady, E. P.; Wang, C.-H.

1983-01-01

29

Multiple-fold clustered processor mesh array  

NASA Technical Reports Server (NTRS)

The multiple-fold clustered processor mesh array is a triangular organization of clustered processing elements. This multiple-fold array maintains functional equivalence to the nearest neighbor mesh computer with uni-directional interprocessor communications, but with half the number of connection wires. In addition, the connectivity of the multiple-folded organization is superior to the standard square mesh due to the improved connectivity between the clustered processors. One of the primary application areas targeted is High Performance Architectures for image processing.

Pechanek, Gerald G.; Vassiliadis, Stamatis; Delgado, Jose G.

1993-01-01

30

Phased array antenna beamforming using optical processor  

NASA Technical Reports Server (NTRS)

The feasibility of optical processor based beamforming for microwave array antennas is investigated. The primary focus is on systems utilizing the 20/30 GHz communications band and a transmit configuration exclusively to serve this band. A mathematical model is developed for computation of candidate design configurations. The model is capable of determination of the necessary design parameters required for spatial aspects of the microwave 'footprint' (beam) formation. Computed example beams transmitted from geosynchronous orbit are presented to demonstrate network capabilities. The effect of the processor on the output microwave signal to noise quality at the antenna interface is also considered.

Anderson, L. P.; Boldissar, F.; Chang, D. C. D.

1991-01-01

31

The Massively Parallel Processor and its applications. [for environmental monitoring  

NASA Technical Reports Server (NTRS)

A long-term experimental development program conducted at Goddard Space Flight Center to implement an ultrahigh-speed data processing system known as the Massively Parallel Processor (MPP) is described. The MPP is a single instruction multiple data stream computer designed to perform logical, integer, and floating point arithmetic operations on variable word length data. Information is presented on system architecture, the system configuration, the array unit architecture, individual processing units, and expected operating rates for several image processing applications (including the processing of Landsat data).

Strong, J. P.; Schaefer, D. H.; Fischer, J. R.; Wallgren, K. R.; Bracken, P. A.

1979-01-01

32

ALGORITHMIC PARTIAL ANALOG-TO-DIGITAL CONVERSION IN MIXED-SIGNAL ARRAY PROCESSORS  

E-print Network

ALGORITHMIC PARTIAL ANALOG-TO-DIGITAL CONVERSION IN MIXED-SIGNAL ARRAY PROCESSORS Roman Genov-parallel and row-cumulative partial algorithmic, analog- to-digital conversion on the array. 1. INTRODUCTION] was demonstrated in [4]. Kerneltron II performs row-parallel delta-sigma analog-to-digital conversion combined

Cauwenberghs, Gert

33

Maximum likelihood identification using an array processor  

NASA Technical Reports Server (NTRS)

Maximum likelihood estimation (MLE) is a method used to calculate the parameters of a dynamic system. It can be applied to a large class of problems and has good statistical properties. The main disadvantage of the MLE method is the amount of computation required. This paper describes how the computation time can be reduced significantly by using an array processor. The estimation of the parameters of a dynamic model of the Space Station is used as an example to evaluate the method.

Sridhar, Banavar; Aubrun, Jean-Noel

1987-01-01

34

Scan line graphics generation on the massively parallel processor  

NASA Technical Reports Server (NTRS)

Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

Dorband, John E.

1988-01-01

35

Chemical network problems solved on NASA/Goddard's massively parallel processor computer  

NASA Technical Reports Server (NTRS)

The single instruction stream, multiple data stream Massively Parallel Processor (MPP) unit consists of 16,384 bit serial arithmetic processors configured as a 128 x 128 array whose speed can exceed that of current supercomputers (Cyber 205). The applicability of the MPP for solving reaction network problems is presented and discussed, including the mapping of the calculation to the architecture, and CPU timing comparisons.

Cho, Seog Y.; Carmichael, Gregory R.

1987-01-01

36

Massively Parallel MRI Detector Arrays  

PubMed Central

Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

Keil, Boris; Wald, Lawrence L

2013-01-01

37

Task and instruction scheduling in parallel multithreaded processors  

E-print Network

TASK AND INSTRUCTION SCHEDULING IN PARALLEL MULTITHREADED PROCESSORS A Thesis by AMITABH MISHRA Submitted to the Office of Graduate Studies of Texas ASSAM University in partial fulfillment of the requirements for the degree of MASTER... OF SCIENCE December 1996 Major Subject: Computer Science TASK AND INSTRUCTION SCHEDULING IN PARALLEL MULTITHREADED PROCESSORS A Thesis by AMITABH MISHRA Submitted to Texas ASM University in partial fulfillment of the requirements for the degree...

Mishra, Amitabh

2012-06-07

38

Processor-Minimum Scheduling of Real-Time Parallel Tasks  

NASA Astrophysics Data System (ADS)

We propose a polynomial-time algorithm for the scheduling of real-time parallel tasks on multicore processors. The proposed algorithm always finds a feasible schedule using the minimum number of processing cores, where tasks have properties of linear speedup, flexible preemption, arbitrary deadlines and arrivals, and parallelism bound. The time complexity of the proposed algorithm is O(M3· log N) for M tasks and N processors in the worst case.

Lee, Wan Yeon; Lee, Kyungwoo; Kim, Kyong Hoon; Ko, Young Woong

39

DFT algorithms for bit-serial GaAs array processor architectures  

NASA Technical Reports Server (NTRS)

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

Mcmillan, Gary B.

1988-01-01

40

DSC: Scheduling Parallel Tasks on an Unbounded Number of Processors  

Microsoft Academic Search

We present a low-complexity heuristic, named the dominant sequence clusteringalgorithm (DSC), for scheduling parallel tasks on an unbounded number of completelyconnected processors. The performance of DSC is on average, comparable to, or evenbetter than, other higher-complexity algorithms. We assume no task duplication andnonzero communication overhead between processors. Finding the optimum solution forarbitrary directed acyclic task graphs (DAG's) is NP-complete. DSC

Tao Yang; Apostolos Gerasoulis

1994-01-01

41

Increasing processor utilization during parallel computation rundown  

NASA Technical Reports Server (NTRS)

Some parallel processing environments provide for asynchronous execution and completion of general purpose parallel computations from a single computational phase. When all the computations from such a phase are complete, a new parallel computational phase is begun. Depending upon the granularity of the parallel computations to be performed, there may be a shortage of available work as a particular computational phase draws to a close (computational rundown). This can result in the waste of computing resources and the delay of the overall problem. In many practical instances, strict sequential ordering of phases of parallel computation is not totally required. In such cases, the beginning of one phase can be correctly computed before the end of a previous phase is completed. This allows additional work to be generated somewhat earlier to keep computing resources busy during each computational rundown. The conditions under which this can occur are identified and the frequency of occurrence of such overlapping in an actual parallel Navier-Stokes code is reported. A language construct is suggested and possible control strategies for the management of such computational phase overlapping are discussed.

Jones, W. H.

1986-01-01

42

The RAP: a ring array processor for layered network calculations  

Microsoft Academic Search

The authors have designed and implemented a ring array processor, RAP, for fast implementation of layered neural network algorithms. The RAP is a multi-DSP system targeted at continuous speech recognition using connectionist algorithms. Four boards, each with four Texas Instruments, TMS 320C30 DSPs, serve as an array processor for a 68020-based host running a real-time operating system. The overall system

N. Morgan; J. Beck; P. Kohn; J. Bilmes; E. Allman; J. Beer

1990-01-01

43

Parallel Association Rule Mining with Minimum Inter-Processor Communication  

E-print Network

Parallel Association Rule Mining with Minimum Inter-Processor Communication Mohammad El-Hajj Department of Computing Science University of Alberta Edmonton, AB, Canada mohammad@cs.ualberta.ca Osmar R. Za¨iane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane

Zaiane, Osmar R.

44

A performance study of the hypercube parallel processor architecture  

Microsoft Academic Search

This paper investigates the relationship between workload characteristics and process speedup obtainable on a hypercube parallel processor architecture. There were two goals: first was to determine the functional relationship between workload characteristics and speedup, and second was to show how simulation could be used to model the concurrently executing process to allow estimation of such a relation. The hypercube implementation

C. A. Lamanna; W. H. Jr. Shaw

1991-01-01

45

MASA: a multithreaded processor architecture for parallel symbolic computing  

Microsoft Academic Search

MASA is a “first cut” at a processor architecture intended as a building block for a multiprocessor that can execute parallel Lisp programs efficiently. MASA features a tagged architecture, multiple contexts, fast trap handling, and a synchronization bit in every memory word. MASA's principal novelty is its use of multiple contexts both to support multithreaded execution—interleaved execution from separate instruction

Robert H. Halstead Jr.; Tetsuya Fujita

1988-01-01

46

Staging memory for massively parallel processor  

NASA Technical Reports Server (NTRS)

The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

Batcher, Kenneth E. (Inventor)

1988-01-01

47

A parallel particle-in-cell model for the massively parallel processor  

NASA Technical Reports Server (NTRS)

The availability of the nearest-neighbor communication-incorporating Massively Parallel Processor has prompted the development of a two-dimensional, particle-in-cell algorithm which loads particles in a cell randomly onto a row of processors, filling only half of them with particles. Due to the simplification of communications among processors achieved in a row by the vacant processors and the random-particle sequence, the algorithm efficiently sorts particles and performs gather/scatter procedures for collecting charge density according to their cells. The algorithm calculates electric fields at the cells by FFT.

Lin, C. S.; Thring, A. L.; Koga, J.; Seiler, E. J.

1990-01-01

48

Orbital Systolic Algorithms and Array Processors for Solution of the Algebraic Path Problem  

NASA Astrophysics Data System (ADS)

The algebraic path problem (APP) is a general framework which unifies several solution procedures for a number of well-known matrix and graph problems. In this paper, we present a new 3-dimensional (3-D) orbital algebraic path algorithm and corresponding 2-D toroidal array processors which solve the n × n APP in the theoretically minimal number of 3n time-steps. The coordinated time-space scheduling of the computing and data movement in this 3-D algorithm is based on the modular function which preserves the main technological advantages of systolic processing: simplicity, regularity, locality of communications, pipelining, etc. Our design of the 2-D systolic array processors is based on a classical 3-D?2-D space transformation. We have also shown how a data manipulation (copying and alignment) can be effectively implemented in these array processors in a massively-parallel fashion by using a matrix-matrix multiply-add operation.

Sedukhin, Stanislav G.; Miyazaki, Toshiaki; Kuroda, Kenichi

49

Potential of minicomputer/array-processor system for nonlinear finite-element analysis  

NASA Technical Reports Server (NTRS)

The potential of using a minicomputer/array-processor system for the efficient solution of large-scale, nonlinear, finite-element problems is studied. A Prime 750 is used as the host computer, and a software simulator residing on the Prime is employed to assess the performance of the Floating Point Systems AP-120B array processor. Major hardware characteristics of the system such as virtual memory and parallel and pipeline processing are reviewed, and the interplay between various hardware components is examined. Effective use of the minicomputer/array-processor system for nonlinear analysis requires the following: (1) proper selection of the computational procedure and the capability to vectorize the numerical algorithms; (2) reduction of input-output operations; and (3) overlapping host and array-processor operations. A detailed discussion is given of techniques to accomplish each of these tasks. Two benchmark problems with 1715 and 3230 degrees of freedom, respectively, are selected to measure the anticipated gain in speed obtained by using the proposed algorithms on the array processor.

Strohkorb, G. A.; Noor, A. K.

1983-01-01

50

Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids  

DOEpatents

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

2011-11-08

51

The Architecture of the Butter y Plus Parallel Processor Department of Computer Science  

E-print Network

CS{1988{6 The Architecture of the Butter y Plus Parallel Processor David Kotz Department of the Butter y Plus Parallel Processor David Kotz December 16, 1987 Abstract This paper investigates the architecture of the Butter y Plus Parallel Processor, an MIMD shared-memory machine based on the Motorola MC

Kotz, David

52

A performance study of the hypercube parallel processor architecture  

SciTech Connect

This paper investigates the relationship between workload characteristics and process speedup obtainable on a hypercube parallel processor architecture. There were two goals: first was to determine the functional relationship between workload characteristics and speedup, and second was to show how simulation could be used to model the concurrently executing process to allow estimation of such a relation. The hypercube implementation used in this study was a packet-switched network with predetermined routing and a balanced computational workload. Three independent variables were controlled: total computational workload, number of processors and the message traffic load. A benchmark program was used to estimate the fundamental timing models and to validate a discrete event simulation. Results of this study are useful to software designers seeking to predict the degree of performance improvement attainable on a hypercube class machine. The methodology and results can be extended to other parallel processing architectures.

Lamanna, C.A. (Air Force Inst. of Tech., Wright-Patterson AFB, OH (United States)); Shaw, W.H. Jr. (Florida Inst. of Tech., Melbourne, FL (United States))

1991-03-01

53

Ring-array processor distribution topology for optical interconnects  

NASA Technical Reports Server (NTRS)

The existing linear and rectangular processor distribution topologies for optical interconnects, although promising in many respects, cannot solve problems such as clock skews, the lack of supporting elements for efficient optical implementation, etc. The use of a ring-array processor distribution topology, however, can overcome these problems. Here, a study of the ring-array topology is conducted with an aim of implementing various fast clock rate, high-performance, compact optical networks for digital electronic multiprocessor computers. Practical design issues are addressed. Some proof-of-principle experimental results are included.

Li, Yao; Ha, Berlin; Wang, Ting; Wang, Sunyu; Katz, A.; Lu, X. J.; Kanterakis, E.

1992-01-01

54

Feasibility of optically interconnected parallel processors using wavelength division multiplexing  

SciTech Connect

New national security demands require enhanced computing systems for nearly ab initio simulations of extremely complex systems and analyzing unprecedented quantities of remote sensing data. This computational performance is being sought using parallel processing systems, in which many less powerful processors are ganged together to achieve high aggregate performance. Such systems require increased capability to communicate information between individual processor and memory elements. As it is likely that the limited performance of today`s electronic interconnects will prevent the system from achieving its ultimate performance, there is great interest in using fiber optic technology to improve interconnect communication. However, little information is available to quantify the requirements on fiber optical hardware technology for this application. Furthermore, we have sought to explore interconnect architectures that use the complete communication richness of the optical domain rather than using optics as a simple replacement for electronic interconnects. These considerations have led us to study the performance of a moderate size parallel processor with optical interconnects using multiple optical wavelengths. We quantify the bandwidth, latency, and concurrency requirements which allow a bus-type interconnect to achieve scalable computing performance using up to 256 nodes, each operating at GFLOP performance. Our key conclusion is that scalable performance, to {approx}150 GFLOPS, is achievable for several scientific codes using an optical bus with a small number of WDM channels (8 to 32), only one WDM channel received per node, and achievable optoelectronic bandwidth and latency requirements. 21 refs. , 10 figs.

Deri, R.J.; De Groot, A.J.; Haigh, R.E.

1996-03-01

55

Automatic Processor Lower Bound Formulas for Array Computations  

Microsoft Academic Search

In the directed acyclic graph (dag) model of algorithms, consider the following problem for precedence-constraine d multiprocessor schedules for array computations: Given a sequence of dags and linear schedules parameterized by , compute a lower bound on the number of processors re- quired by the schedule as a function of . This problem is formulated so that the number of

Peter R. Cappello; Ömer Egecioglu

2002-01-01

56

Parallel information transfer in a multinode quantum information processor.  

PubMed

We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is obtained by moving the system into an interaction frame where all pairs of cross-node qubits are effectively coupled via an exchange interaction between actuator elements of each node. All control is achieved via actuator-only modulation, leading to fast implementations of a universal set of internode quantum gates. The method is expected to be nearly independent of actuator decoherence and may be made insensitive to experimental variations of system parameters by appropriate design of control sequences. We show, in particular, how the induced cross-node coupling channel may be used to swap the complete quantum states of the local processors in parallel. PMID:22540778

Borneman, T W; Granade, C E; Cory, D G

2012-04-01

57

Optimal mapping of irregular finite element domains to parallel processors  

NASA Technical Reports Server (NTRS)

Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

Flower, J.; Otto, S.; Salama, M.

1987-01-01

58

Parallel Media Processors for the Billion-Transistor Era Jason Fritts, Zhao Wu, and Wayne Wolf  

E-print Network

Parallel Media Processors for the Billion-Transistor Era Jason Fritts, Zhao Wu, and Wayne Wolf Dept}@ee.princeton.edu Abstract This paper describes the challenges presented by single- chip parallel media processors (PMPs to realize the full potential of programmable media processors. This paper provides both a survey of research

Fritts, Jason

59

A taxonomy of reconfiguration techniques for fault-tolerant processor arrays--  

SciTech Connect

The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.

Chean, M. (Shell Development Co., Houston, TX (USA)); Fortes, J.A.B. (Purdue Univ., Lafayette, IN (USA))

1990-01-01

60

The language parallel Pascal and other aspects of the massively parallel processor  

NASA Technical Reports Server (NTRS)

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

Reeves, A. P.; Bruner, J. D.

1982-01-01

61

A CMOS Imager With a Programmable Bit-Serial Column-Parallel SIMD/MIMD Processor  

E-print Network

An imager with an integrated fully programmable bit-serial column-parallel processor is proposed to meet the demand for a compact and versatile system-on-imager chip for consumer applications. The on-imager processor is ...

Yamashita, Hirofumi

62

Particle simulation of plasmas on the massively parallel processor  

NASA Technical Reports Server (NTRS)

Particle simulations, in which collective phenomena in plasmas are studied by following the self consistent motions of many discrete particles, involve several highly repetitive sets of calculations that are readily adaptable to SIMD parallel processing. A fully electromagnetic, relativistic plasma simulation for the massively parallel processor is described. The particle motions are followed in 2 1/2 dimensions on a 128 x 128 grid, with periodic boundary conditions. The two dimensional simulation space is mapped directly onto the processor network; a Fast Fourier Transform is used to solve the field equations. Particle data are stored according to an Eulerian scheme, i.e., the information associated with each particle is moved from one local memory to another as the particle moves across the spatial grid. The method is applied to the study of the nonlinear development of the whistler instability in a magnetospheric plasma model, with an anisotropic electron temperature. The wave distribution function is included as a new diagnostic to allow simulation results to be compared with satellite observations.

Gledhill, I. M. A.; Storey, L. R. O.

1987-01-01

63

Phase space simulation of collisionless stellar systems on the massively parallel processor  

NASA Technical Reports Server (NTRS)

A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem.

White, Richard L.

1987-01-01

64

An informal introduction to program transformation and parallel processors  

SciTech Connect

In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

1994-08-01

65

Prototype Focal-Plane-Array Optoelectronic Image Processor  

NASA Technical Reports Server (NTRS)

Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.

Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey

1995-01-01

66

Design of a dataway processor for a parallel image signal processing system  

Microsoft Academic Search

Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication

Mitsuru Nomura; Tetsurou Fujii; Sadayasu Ono

1995-01-01

67

Solution of large linear systems of equations on the massively parallel processor  

NASA Technical Reports Server (NTRS)

The Massively Parallel Processor (MPP) was designed as a special machine for specific applications in image processing. As a parallel machine, with a large number of processors that can be reconfigured in different combinations it is also applicable to other problems that require a large number of processors. The solution of linear systems of equations on the MPP is investigated. The solution times achieved are compared to those obtained with a serial machine and the performance of the MPP is discussed.

Ida, Nathan; Udawatta, Kapila

1987-01-01

68

Complexity Results for Permuting Data and Other Computations on Parallel Processors  

Microsoft Academic Search

For a wide class of problems, we obtain lower bounds for algorithms executed on certain parallel processors. These bounds show that for sufficiently large problems many known algorithms are optimal. The central result of the paper is the following sharper lower bound for permutation algorithms. Any permutation algorithm for N data items on a P processor parallel machine without shared

Allan Gottlieb; Clyde P. Kruskal

1984-01-01

69

Derivative constraints for broad-band element space antenna array processors  

Microsoft Academic Search

In this paper a class of linear constraints, also termed as derivative constraints, which is applicable to broad-band element space antenna array processors, is presented. The performance characteristics of the optimum processor with derivative constraints are demonstrated by computer studies involving two types of array geometries, namely linear and circular arrays. As a consequence of derivative constraints, the beam width

Meng Er; A. Cantoni

1983-01-01

70

ALGORITHMICPARTIALANALOG-TO-DIGITALCONVERSION IN MIXED-SIGNAL ARRAY PROCESSORS  

E-print Network

was demonstrated in [4]. Kernelrmn II performs row-parallel delta-sigma analog-to-digital conversion combined throughput for low-resolution inputs. The architecture employs algorithmic delta-sigma analog- to-digital-cumulative partial algorithmic. analog- to-digital conversion on the array. 1. INTRODUCTION An internally analog

Genov, Roman

71

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

1999-08-24

72

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

Deri, Robert J. (Pleasanton, CA); Brooks, III, Eugene D. (Livermore, CA); Haigh, Ronald E. (Tracy, CA); DeGroot, Anthony J. (Castro Valley, CA)

1999-01-01

73

A 125 GOPS 583 mW Network-on-Chip Based Parallel Processor With Bio-Inspired Visual Attention Engine  

Microsoft Academic Search

A network-on-chip (NoC) based parallel processor is presented for bio-inspired real-time object recognition with visual attention algorithm. It contains an ARM10-compatible 32-bit main processor, 8 single-instruction multiple-data (SIMD) clusters with 8 processing elements in each cluster, a cellular neural network based visual attention engine (VAE), a matching accelerator, and a DMA-like external interface. The VAE with 2-D shift register array

Kwanho Kim; Seungjin Lee; Joo-Young Kim; Minsu Kim; Hoi-Jun Yoo

2009-01-01

74

On nonlinear finite element analysis in single-, multi- and parallel-processors  

NASA Technical Reports Server (NTRS)

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

75

A Mobile Robot with Onboard Parallel Processor and Large Workspace Arm  

Microsoft Academic Search

a. The MIT AI Lab's second mobile robot, MOBOT-2, has a number of unique design features. In this paper we describe two of them in detail. First, MOBOT-2 has an extremely cheap 32 processor distributed control system. The proces- sor system, called BARNACLE, runs asynchronously with no central locus of control. Unlike almost all other parallel processors this one has

Rodney A. Brooks; Jon Connell; Anita Flynn

1986-01-01

76

Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -  

NASA Technical Reports Server (NTRS)

Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

Chen, Paul Peichuan

1993-01-01

77

Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications  

NASA Technical Reports Server (NTRS)

A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

1997-01-01

78

Design of a dataway processor for a parallel image signal processing system  

NASA Astrophysics Data System (ADS)

Recently, demands for high-speed signal processing have been increasing especially in the field of image data compression, computer graphics, and medical imaging. To achieve sufficient power for real-time image processing, we have been developing parallel signal-processing systems. This paper describes a communication processor called 'dataway processor' designed for a new scalable parallel signal-processing system. The processor has six high-speed communication links (Dataways), a data-packet routing controller, a RISC CORE, and a DMA controller. Each communication link operates at 8-bit parallel in a full duplex mode at 50 MHz. Moreover, data routing, DMA, and CORE operations are processed in parallel. Therefore, sufficient throughput is available for high-speed digital video signals. The processor is designed in a top- down fashion using a CAD system called 'PARTHENON.' The hardware is fabricated using 0.5-micrometers CMOS technology, and its hardware is about 200 K gates.

Nomura, Mitsuru; Fujii, Tetsuro; Ono, Sadayasu

1995-04-01

79

Locality Conscious Processor Allocation and Scheduling for Mixed Parallel Applications  

Microsoft Academic Search

Complex applications can often be viewed as a collection of coarse-grained data-parallel application components with precedence constraints. It has been shown that combining task and data parallelism (mixed parallelism) can be an effective execution paradigm for these applications. In this paper, we present an algorithm to compute the appropriate mix of task and data parallelism based on the scalability characteristics

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald Sabin; Ümit V. Çatalyürek; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

2006-01-01

80

Parallel/Series-Fed Microstrip Array Antenna  

NASA Technical Reports Server (NTRS)

Characteristics include low cross-polarization and high efficiency. Microstrip array antenna fabricated on two rectangular dielectric substrates. Produces fan-shaped beam polarized parallel to its short axis. Mounted conformally on outside surface of aircraft for use in synthetic-aperture radar. Other antennas of similar design mounted on roofs or sides of buildings, ships, or land vehicles for use in radar or communications.

Huang, John

1994-01-01

81

Image-algebra programming environment for a new fine-grained massively parallel processor  

Microsoft Academic Search

One of the major obstacles facing developers of parallel image processing applications is the lack of efficient programming environments. In this paper, we describe the environment currently under development for supporting image algebra operations on a fine grained, massively parallel processor, the PAL. A graphical design tool is described as are some issues that arise in retargeting a C++ library

Joseph N. Wilson; Robert D. Jackson; Patrick C. Coffield

1997-01-01

82

Using algebra for massively parallel processor design and utilization  

NASA Technical Reports Server (NTRS)

This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

Campbell, Lowell; Fellows, Michael R.

1990-01-01

83

Optimization of Routing and Reconfiguration Overhead in Programmable Processor Array Architectures  

Microsoft Academic Search

In this paper, we present a constraint programming-based approach for optimization of routing and reconfiguration overhead for a class of reconfigurable processor array architectures called weakly programmable. For a given set of different algorithms the execution of which is supposed to be switched upon request at run-time, we provide static solutions for optimal routing of data between processor elements as

Christophe Wolinski; Krzysztof Kuchcinski; Jürgen Teich; Frank Hannig

2008-01-01

84

Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors  

NASA Technical Reports Server (NTRS)

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

Fijany, Amir (inventor); Bejczy, Antal K. (inventor)

1994-01-01

85

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

Belytschko, Ted

1989-01-01

86

Aligning parallel arrays to reduce communication  

NASA Technical Reports Server (NTRS)

Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

1994-01-01

87

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing  

E-print Network

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing Mayank Daga, Ashwin M applications can be bottlenecked by data transfers between the CPU and GPU over PCIe. Emerging heterogeneous computing architectures that "fuse" the function- ality of the CPU and GPU, e.g., AMD Fusion and Intel

Virginia Tech

88

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

1989-01-01

89

A Standardized Test Case (STC) For Sensor Array Processor Evaluation  

Microsoft Academic Search

The use of oversimplified test data for evaluating new and promising space-time signal processing algorithms is common in the signal processing literature. A preoccupation with the case of two high SNR planewaves has promoted angular resolution as the primary metric of performance. Threshold detection and processor robustness to model error have become secondary. Accordingly, a realistic seven source standardized test

N. L. Owsley

1991-01-01

90

Impact of shipping Ball-Grid-Array Notebook processors in tape and reel on the PC supply chain  

E-print Network

Today, approximately 90% of Intel notebook processors are packaged in PGA (Pin Grid Array) and 10% are packaged in BGA (Ball Grid Array). Intel has recently made a decision to transform the notebook industry by creating a ...

Chuang, Pamela

2012-01-01

91

MEMS Microphone Array and Signal Processor for Realtime Object Detection  

NASA Astrophysics Data System (ADS)

We have developed an ultrasonic sound processing system for 3D imaging with 128 microelectromechanical systems (MEMS) microphones and a highly configurable field programmable gate array (FPGA). The system consists of a sensor array board, analog-to-digital converter (ADC) modules, and a processing board. The ultrasonic MEMS sensors are precisely aligned on a printed circuit board (PCB) to form a 16 × 8 planar grid with 112 ° × 50° viewing angles for wide-band signals with the center frequency at 40 KHz.

Maeda, Yasushige; Sugimoto, Masanori; Hashizume, Hiromichi

92

Software development on the High-Speed Systolic Array Processor (HISSAP): Lessons learned. Final report, Mar 88-Mar 91  

SciTech Connect

This report documents the lessons learned in programming the Naval Ocean System Center's (NOSC's) High-Speed Systolic Array Processor (HISSAP) testbed. The procedures used for code generation, along with the programming utilities provided in the software development environment, are discussed with regard to their impact on the efficient implementation of algorithms on a parallel processing system such as HISSAP. This information is intended for considerations pertaining to software-development environments in future Navy parallel processing systems. Many of HISSAP's software-development utilities played key roles in the implementation of two computationally intensive algorithms: the Multiple-Signal Classification algorithm (MUSIC) and a four-channel, narrowband, finite-impulse response (FIR) filter. The introduction of utilities not included with the HISSAP tools would undoubtedly have increased the speed and efficiency of software development.

Tirpak, F.M.

1991-06-01

93

Algorithm-Based Error Detection Of A Cholesky Factor Updating Systolic Array Using Cordic Processors  

NASA Astrophysics Data System (ADS)

Lincoln Laboratory has developed an architecture for a folded linear systolic array using fixed-point CORDIC processors, applicable to adaptive nulling for a radar sidelobe canceler. The algorithm implemented uses triangularization by Givens rotations to solve a least-squares problem in the voltage domain. In this paper, the implementation of an inexpensive algorithm-based error-detection scheme is proposed for this systolic array. Column average checksum encoding is intended to detect most errors caused by the failure of any single arithmetic unit. It retains or almost retains the 100% processor utilization of Lincoln Laboratory's novel design. For the case of 64 degrees of freedom, the increase in time complexity is only 3%. The increase in hardware is mainly two adders and two comparators per CORDIC processor. We believe that the small increase in cost will be amply offset by the improvement in system performance brought about by this error detection.

Chou, S. I.; Rader, Charles M.

1989-12-01

94

Implementation of context independent code on a new array processor: The Super-65  

NASA Technical Reports Server (NTRS)

The feasibility of rewriting standard uniprocessor programs into code which contains no context-dependent branches is explored. Context independent code (CIC) would contain no branches that might require different processing elements to branch different ways. In order to investigate the possibilities and restrictions of CIC, several programs were recoded into CIC and a four-element array processor was built. This processor (the Super-65) consisted of three 6502 microprocessors and the Apple II microcomputer. The results obtained were somewhat dependent upon the specific architecture of the Super-65 but within bounds, the throughput of the array processor was found to increase linearly with the number of processing elements (PEs). The slope of throughput versus PEs is highly dependent on the program and varied from 0.33 to 1.00 for the sample programs.

Colbert, R. O.; Bowhill, S. A.

1981-01-01

95

Othello Solver based on a soft-core MIMD processor array  

Microsoft Academic Search

This report presents an Othello Solver based on a 32-bit original soft-core Multiple Instruction stream, Multiple Data stream (MIMD) processor array targeting a single field programmable gate array (FPGA), Cyclone II (EP2C70D896C6N), on a DE2 Development and Education Board (Altera Corp.). The solver can execute a move-checking operation, a disc flipping operation, a move selection operation, an evaluation operation, and

T. Mabuchi; T. Watanabe; R. Moriwaki; Y. Aoyama; A. Gundjalam; Y. Yamaji; H. Nakada; M. Watanabe

2010-01-01

96

High Linearity Voltage Response Parallel-Array Cell  

NASA Astrophysics Data System (ADS)

We studied in detail a cell consisting of two parallel SQUID arrays or two parallel superconducting interference filters (SQIFs) connected differentially with the goal of achieving highly linear voltage response to magnetic signal. In these different cell designs, we accounted for realistic values of coupling inductances in contrast to limiting case of vanishing inductances considered earlier. We found that a cell based on regular parallel SQUID arrays produces higher linearity as compared to the cell based on SQIFs. This high-linearity cell can be used for realizing Superconducting Quantum Arrays (SQA) capable of providing a broadband, highly-linear magnetic field-to-voltage transfer function and high dynamic range.

Kornev, V.; Kolotinskiy, N.; Skripka, V.; Sharafiev, A.; Soloviev, I.; Mukhanov, O.

2014-05-01

97

A survey of problems and preliminary results concerning parallel processing and parallel processors  

Microsoft Academic Search

After an introduction which discusses the significance of a trend to the design of parallel processing systems, the paper describes some of the results obtained to date in a project which aims to develop and evaluate a unified hardware-software parallel processing computing system and the techniques for its use.

M. Lehman

1966-01-01

98

Parallel collective resonances in arrays of gold nanorods.  

PubMed

In this work we discuss the excitation of parallel collective resonances in arrays of gold nanoparticles. Parallel collective resonances result from the coupling of the nanoparticles localized surface plasmons with diffraction orders traveling in the direction parallel to the polarization vector. While they provide field enhancement and delocalization as the standard collective resonances, our results suggest that parallel resonances could exhibit greater tolerance to index asymmetry in the environment surrounding the arrays. The near- and far-field properties of these resonances are analyzed, both experimentally and numerically. PMID:24645987

Vitrey, Alan; Aigouy, Lionel; Prieto, Patricia; García-Martín, José Miguel; González, María U

2014-04-01

99

Sensor array processor evaluation with a standardized test case (STC)  

Microsoft Academic Search

A standard test case (STC) for the comparative evaluation of the plethora of modern space-time signal processing techniques is summarized. The STC is thought to be realistic in the context of superimposed multiple discrete and continuous ambient noise fields for a uniform linear array of sensors. Attention is given to the MUSIC direction finder (DF), the ESPRIT DF, the linear

Alain C. Barthelemy; Norman L. Owsley

1991-01-01

100

Run-time recognition of task parallelism within the P++ parallel array class library  

SciTech Connect

This paper explores the use of a run-time system to recognize task parallelism with a C++ array class library. Run-time systems currently support data parallelism in P++, FORTRAN 90 D, and High Performance FORTRAN. But data parallelism in insufficient for many applications, including adaptive mesh refinement. Without access to both data and task parallelism such applications exhibit several orders of magnitude more message passing and poor performance. In this work, a C++ array class library is used to implement deferred evaluation and run-time dependence for task parallelism recognition, tp obtain task parallelism through a data flow interpretation of data parallel array statements. Performance results show that that analysis and optimizations are both efficient and practical, allowing us to consider more substantial optimizations.

Parsons, R.; Quinlan, D.

1993-11-01

101

Parallel implementation of RX anomaly detection on multi-core processors: impact of data partitioning strategies  

NASA Astrophysics Data System (ADS)

Anomaly detection is an important task for remotely sensed hyperspectral data exploitation. One of the most widely used and successful algorithms for anomaly detection in hyperspectral images is the Reed-Xiaoli (RX) algorithm. Despite its wide acceptance and high computational complexity when applied to real hyperspectral scenes, few documented parallel implementations of this algorithm exist, in particular for multi-core processors. The advantage of multi-core platforms over other specialized parallel architectures is that they are a low-power, inexpensive, widely available and well-known technology. A critical issue in the parallel implementation of RX is the sample covariance matrix calculation, which can be approached in global or local fashion. This aspect is crucial for the RX implementation since the consideration of a local or global strategy for the computation of the sample covariance matrix is expected to affect both the scalability of the parallel solution and the anomaly detection results. In this paper, we develop new parallel implementations of the RX in multi-core processors and specifically investigate the impact of different data partitioning strategies when parallelizing its computations. For this purpose, we consider both global and local data partitioning strategies in the spatial domain of the scene, and further analyze their scalability in different multi-core platforms. The numerical effectiveness of the considered solutions is evaluated using receiver operating characteristics (ROC) curves, analyzing their capacity to detect thermal hot spots (anomalies) in hyperspectral data collected by the NASA's Airborne Visible Infra- Red Imaging Spectrometer system over the World Trade Center in New York, five days after the terrorist attacks of September 11th, 2001.

Molero, Jose M.; Garzón, Ester M.; García, Inmaculada; Plaza, Antonio

2011-11-01

102

Q-plates micro-arrays for parallel processing of the photon orbital angular momentum  

NASA Astrophysics Data System (ADS)

We report on the realization of electrically tunable micro-arrays of space-variant optically anisotropic optical vortex generators. Each individual light orbital angular momentum processor consists of a microscopic self-engineered nematic liquid crystal q-plate made of a nonsingular topological defect spontaneously formed under electric field. Both structural and optical characterizations of the obtained spin-orbit optical interface are analyzed. An analytical model is derived and results of simulations are compared with experimental data. The application potential in terms of parallel processing of the optical orbital angular momentum is quantitatively discussed.

Loussert, Charles; Kushnir, Kateryna; Brasselet, Etienne

2014-09-01

103

Parallel Contextual Hexagonal Array Grammars and Languages  

NASA Astrophysics Data System (ADS)

Hexagonal patterns are known to occur in the literature on picture processing and image analysis. Siromoney et al. constructed hexagonal array grammars for generating hexagonal arrays and hexagonal patterns. On the other hand, Marcus introduced a class of grammars called contextual grammars in contrast to Chomskian grammars that generate words by starting with an initial word and adding iteratively pairs of words called contexts associated to a set of words called selector to the words already obtained.

Thomas, D. G.; Begam, M. H.; David, N. G.

104

Efficient Interprocedural Array Dataflow Analysis for Automatic Program Parallelization \\Lambda  

E-print Network

parallelizing compiler. Our techniques are based on guarded array regions and the resulting tool runs faster, by one or two orders of magnitude, than other similarly powerful tools. Key words: Parallelizing compiler, but it can also support compiler techniques for memory performance enhancement and efficient message

Li, Zhiyuan

105

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors  

SciTech Connect

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

Aaby, Brandon G [ORNL; Perumalla, Kalyan S [ORNL; Seal, Sudip K [ORNL

2010-01-01

106

Inventory estimation on the massively parallel processor. [from satellite based images  

NASA Technical Reports Server (NTRS)

This paper describes algorithms for efficiently computing inventory estimates from satellite based images. The algorithms incorporate a one dimensional feature extraction which optimizes the pairwise sum of Fisher distances. Biases are eliminated with a premultiplication by the inverse of the analytically derived error matrix. The technique is demonstrated with a numerical example using statistics obtained from an actual Landsat scene. Attention was given to implementation of the Massively Parallel processor (MPP). A timing analysis demonstrates that the inventory estimation can be performed an order of magnitude faster on the MPP than on a conventional serial machine.

Argentiero, P. D.; Strong, J. P.; Koch, D. W.

1980-01-01

107

Block iterative restoration of astronomical images with the massively parallel processor  

NASA Technical Reports Server (NTRS)

A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

Heap, Sara R.; Lindler, Don J.

1987-01-01

108

Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

Learn, Mark Walter

2011-04-01

109

Parallel Access of Out-Of-Core Dense Extendible Arrays  

SciTech Connect

Datasets used in scientific and engineering applications are often modeled as dense multi-dimensional arrays. For very large datasets, the corresponding array models are typically stored out-of-core as array files. The array elements are mapped onto linear consecutive locations that correspond to the linear ordering of the multi-dimensional indices. Two conventional mappings used are the row-major order and the column-major order of multi-dimensional arrays. Such conventional mappings of dense array files highly limit the performance of applications and the extendibility of the dataset. Firstly, an array file that is organized in say row-major order causes applications that subsequently access the data in column-major order, to have abysmal performance. Secondly, any subsequent expansion of the array file is limited to only one dimension. Expansions of such out-of-core conventional arrays along arbitrary dimensions, require storage reorganization that can be very expensive. Wepresent a solution for storing out-of-core dense extendible arrays that resolve the two limitations. The method uses a mapping function F*(), together with information maintained in axial vectors, to compute the linear address of an extendible array element when passed its k-dimensional index. We also give the inverse function, F-1*() for deriving the k-dimensional index when given the linear address. We show how the mapping function, in combination with MPI-IO and a parallel file system, allows for the growth of the extendible array without reorganization and no significant performance degradation of applications accessing elements in any desired order. We give methods for reading and writing sub-arrays into and out of parallel applications that run on a cluster of workstations. The axial-vectors are replicated and maintained in each node that accesses sub-array elements.

Otoo, Ekow J; Rotem, Doron

2007-07-26

110

Application of an array processor to the analysis of magnetic data for the Doublet III tokamak  

SciTech Connect

Discussed herein is a fast computational technique employing the Floating Point Systems AP-190L array processor to analyze magnetic data for the Doublet III tokamak, a fusion research device. Interpretation of the experimental data requires the repeated solution of a free-boundary nonlinear partial differential equation, which describes the magnetohydrodynamic (MHD) equilibrium of the plasma. For this particular application, we have found that the array processor is only 1.4 and 3.5 times slower than the CDC-7600 and CRAY computers, respectively. The overhead on the host DEC-10 computer was kept to a minimum by chaining the complete Poisson solver and free-boundary algorithm into one single-load module using the vector function chainer (VFC). A simple time-sharing scheme for using the MHD code is also discussed.

Wang, T.S.; Saito, M.T.

1980-08-01

111

Nested crossbar connection networks for optically interconnected processor arrays for vector-matrix multiplication.  

PubMed

A family of new interconnection networks, termed the nested crossbar, has been developed. These networks are particularly well suited to optical interconnects due to their high bisection width and high degree of space invariance. Algorithms have been developed for computing vector-matrix multiplication with nested crossbar connected processor arrays with time growth rates between O(1) and O(N(1/2)). (N is the number of elements in the vector.) When these algorithms are implemented on holographic optically interconnected very large scale integrated processor arrays, the nested crossbar networks have area and time growth rates close to fundamental lower bounds. The nested crossbar networks also allow the use of a minimum number of transmitters and detectors. PMID:20562963

Feldman, M R; Guest, C C

1990-03-10

112

Animated computer graphics models of space and earth sciences data generated via the massively parallel processor  

NASA Technical Reports Server (NTRS)

The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

1987-01-01

113

GPU Kernels as Data-Parallel Array Computations in Haskell  

E-print Network

We present a novel high-level parallel programming model aimed at graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks. 1.

Sean Lee; Vinod Grover; Manuel M. T. Chakravarty; Gabriele Keller

2009-01-01

114

GPU Kernels as Data-Parallel Array Computations  

E-print Network

We present a novel high-level parallel programming model for graphics processing units (GPUs). We embed GPU kernels as data-parallel array computations in the purely functional language Haskell. GPU and CPU computations can be freely interleaved with the type system tracking the two different modes of computation. The embedded language of array computations is sufficiently limited that our system can automatically isolate and extract these computations and compile them to efficient GPU code. In this paper, we outline our approach and present the results of a few preliminary benchmarks. 1.

In Haskell; Sean Lee; Manuel M. T. Chakravarty; Vinod Grover; Gabriele Keller

115

Waveguide-fed parallel plate slot array antenna  

NASA Astrophysics Data System (ADS)

A novel planar antenna in which radiating slots are arrayed on one side of a square parallel plate waveguide and coupling slots occupy the other side is proposed. The antenna is excited via the coupling slots by a rectangular waveguide. In order to suppress unwanted reflections and to assure the purity of the transverse electromagnetic (TEM) traveling-wave mode in the parallel plate waveguide, all the slots are arrayed in pairs. An X-band model antenna was fabricated, and uniform aperture illumination was demonstrated with 48 percent antenna efficiency. These results demonstrate the feasibility of antennas of this type.

Hirokawa, Jiro; Ando, Makoto; Goto, Naohisa

1992-02-01

116

PARRAY: a unifying array representation for heterogeneous parallelism  

Microsoft Academic Search

This paper introduces a programming interface called PARRAY (or Parallelizing ARRAYs) that supports system-level succinct programming for heterogeneous parallel systems like GPU clusters. The current practice of software development requires combining several low-level libraries like Pthread, OpenMP, CUDA and MPI. Achieving productivity and portability is hard with different numbers and models of GPUs. PARRAY extends mainstream C programming with novel

Yifeng Chen; Xiang Cui; Hong Mei

2012-01-01

117

Optoelectronic implementation of a 256-channel sonar adaptive-array processor.  

PubMed

We present an optoelectronic implementation of an adaptive-array processor that is capable of performing beam forming and jammer nulling in signals of wide fractional bandwidth that are detected by an array of arbitrary topology. The optical system makes use of a two-dimensional scrolling spatial light modulator to represent an array of input signals in 256 tapped delay lines, two acousto-optic modulators for modulating the feedback error signal, and a photorefractive crystal for representing the adaptive weights as holographic gratings. Gradient-descent learning is used to dynamically adapt the holographic weights to optimally form multiple beams and to null out multiple interference sources, either in the near field or in the far field. Space-integration followed by differential heterodyne detection is used for generating the system's output. The processor is analyzed to show the effects of exponential weight decay on the optimum solution and on the convergence conditions. Several experimental results are presented that validate the system's capacity for broadband beam forming and jammer nulling for linear and circular arrays. PMID:15617279

Silveira, Paulo E X; Pati, Gour S; Wagner, Kelvin H

2004-12-10

118

Simulation study of a parallel processor with unbalanced loads. Master's thesis  

SciTech Connect

The purpose of this thesis was twofold: to estimate the impact of unbalanced computational loads on a parallel-processing architecture via Monte Carlo simulation; and second to investigate the impact of representing the dynamics of the parallel-processing problem via animated simulation. It is constrained to the hypercube architecture in which each node is connected in a predetermined topology and allowed to communicate to other nodes through calls to the operating system. Routing of messages through the network is fixed and specified within the operating system. Message-transmission preempts nodal processing causing internodal communications to complicate the concurrent operation of the network. Two independent variables are defined: 1) the degree of imbalance characterizes the nature or severity of the load imbalance, and 2) the degree of locality characterizes the node loadings with respect to node locations across the cube. A SLAM II simulation model of a generic 16 node hypercube was constructed in which each node processes a predetermined number of computational tasks and, following each task, sends a message to a single randomly chosen receiver node. An experiment was designed in which the independent variables, degree of imbalance and degree of locality were varied across two computation-to-IO ratios to determine their separate and interactive effects on the dependent variable, job speedup. ANOVA and regression techniques were used to estimate the relationship between load imbalance, locality, computation-to-IO ratio, and their interactions to job speedup. Results show that load imbalance severely impacts a parallel-processor's performance.

Moore, T.S.

1987-12-01

119

Recursive array layouts and fast parallel matrix multiplication  

Microsoft Academic Search

Matrix multiplication is an important kernel in linear alge bra al- gorithms, and the performance of both serial and parallel im ple- mentations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, tra ditional column-major or row-major array layouts incur high variability in memory system performance as matrix size varies. This paper in-

Siddhartha Chatterjee; Alvin R. Lebeckt; Praveen K. Patnala; Mithuna Thottethodi

1999-01-01

120

Design Issues in Parallel Array Languages for Shared Memory  

E-print Network

Design Issues in Parallel Array Languages for Shared Memory James Brodman1 , Basilio B. Fraguela2 extensively with HTAs in distributed memory en- vironments, only recently have we began to consider their adaption to shared memory environments such as those found in multicore systems. In this paper we review

Garzarán, María Jesús

121

High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects  

DOEpatents

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

2002-01-01

122

Parallel Algorithms for DNA Probe Placement on Small Oligonucleotide Arrays  

E-print Network

Oligonucleotide arrays are used in a wide range of genomic analyses, such as gene expression profiling, comparative genomic hybridization, chromatin immunoprecipitation, SNP detection, etc. During fabrication, the sites of an oligonucleotide array are selectively exposed to light in order to activate oligonucleotides for further synthesis. Optical effects can cause unwanted illumination at masked sites that are adjacent to the sites intentionally exposed to light. This results in synthesis of unforeseen sequences in masked sites and compromises interpretation of experimental data. To reduce such uncertainty, one can exploit freedom in how probes are assigned to array sites. The border length minimization problem (BLMP) seeks a placement of probes that minimizes the sum of border lengths in all masks. In this paper, we propose two parallel algorithms for the BLMP. The proposed parallel algorithms have the local-search paradigm at their core, and are especially developed for the BLMP. The results reported show ...

Trinca, Dragos

2011-01-01

123

Massively Parallel Machines. BMC also has the potential to supply massive computational power. General use of BMC is to construct parallel machines where each processor's state is encoded  

E-print Network

strand. BMC can perform massively parallel computations by executing recombinant DNA operations that act on all the DNA molecules at the same time. These recombinant DNA operations may be performed to execute the state of about 1018 processors, and since certain recombinant DNA operations can take many minutes

Reif, John H.

124

Mobile and replicated alignment of arrays in data-parallel programs  

NASA Technical Reports Server (NTRS)

When a data-parallel language like FORTRAN 90 is compiled for a distributed-memory machine, aggregate data objects (such as arrays) are distributed across the processor memories. The mapping determines the amount of residual communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: first, an alignment that maps all the objects to an abstract template, and then a distribution that maps the template to the processors. We solve two facets of the problem of finding alignments that reduce residual communication: we determine alignments that vary in loops, and objects that should have replicated alignments. We show that loop-dependent mobile alignment is sometimes necessary for optimum performance, and we provide algorithms with which a compiler can determine good mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself (via spread operations) or can be used to improve performance. We propose an algorithm based on network flow that determines which objects to replicate so as to minimize the total amount of broadcast communication in replication. This work on mobile and replicated alignment extends our earlier work on determining static alignment.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert

1993-01-01

125

Parallel computation of optimized arrays for 2-D electrical imaging surveys  

NASA Astrophysics Data System (ADS)

Modern automatic multi-electrode survey instruments have made it possible to use non-traditional arrays to maximize the subsurface resolution from electrical imaging surveys. Previous studies have shown that one of the best methods for generating optimized arrays is to select the set of array configurations that maximizes the model resolution for a homogeneous earth model. The Sherman-Morrison Rank-1 update is used to calculate the change in the model resolution when a new array is added to a selected set of array configurations. This method had the disadvantage that it required several hours of computer time even for short 2-D survey lines. The algorithm was modified to calculate the change in the model resolution rather than the entire resolution matrix. This reduces the computer time and memory required as well as the computational round-off errors. The matrix-vector multiplications for a single add-on array were replaced with matrix-matrix multiplications for 28 add-on arrays to further reduce the computer time. The temporary variables were stored in the double-precision Single Instruction Multiple Data (SIMD) registers within the CPU to minimize computer memory access. A further reduction in the computer time is achieved by using the computer graphics card Graphics Processor Unit (GPU) as a highly parallel mathematical coprocessor. This makes it possible to carry out the calculations for 512 add-on arrays in parallel using the GPU. The changes reduce the computer time by more than two orders of magnitude. The algorithm used to generate an optimized data set adds a specified number of new array configurations after each iteration to the existing set. The resolution of the optimized data set can be increased by adding a smaller number of new array configurations after each iteration. Although this increases the computer time required to generate an optimized data set with the same number of data points, the new fast numerical routines has made this practical on commonly available microcomputers.

Loke, M. H.; Wilkinson, P. B.; Chambers, J. E.

2010-12-01

126

Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor  

NASA Technical Reports Server (NTRS)

The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.

Field, E. I.

1975-01-01

127

A Parallel and Concurrent Implementation of Lin Kernighan Heuristic (LKH-2) for Solving Traveling Salesman Problem for Multi-Core Processors using SPC 3 Programming Model  

Microsoft Academic Search

With the arrival of multi-cores, every processor has now built-in parallel computational power and that can be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core processors. In this paper we have presented a combined parallel and concurrent implementation

Muhammad Ali Ismail; Shahid H. Mirza; Talat Altaf

2011-01-01

128

Lumped-Element Planar Strip Array (LPSA) for Parallel MRI  

PubMed Central

The recently introduced planar strip array (PSA) can significantly reduce scan times in parallel MRI by enabling the utilization of a large number of RF strip detectors that are inherently decoupled, and are tuned by adjusting the strip length to integer multiples of a quarter-wavelength (?/4) in the presence of a ground plane and dielectric substrate. In addition, the more explicit spatial information embedded in the phase of the signals from the strip array is advantageous (compared to loop arrays) for limiting aliasing artifacts in parallel MRI. However, losses in the detector as its natural resonance frequency approaches the Larmor frequency (where the wavelength is long at 1.5 T) may limit the signal-to-noise ratio (SNR) of the PSA. Moreover, the PSA’s inherent ?/4 structure severely limits our ability to adjust detector geometry to optimize the performance for a specific organ system, as is done with loop coils. In this study we replaced the dielectric substrate with discrete capacitors, which resulted in both SNR improvement and a tunable lumped-element PSA (LPSA) whose dimensions can be optimized within broad constraints, for a given region of interest (ROI) and MRI frequency. A detailed theoretical analysis of the LPSA is presented, including its equivalent circuit, electromagnetic fields, SNR, and g-factor maps for parallel MRI. Two different decoupling schemes for the LPSA are described. A four-element LPSA prototype was built to test the theory with quantitative measurements on images obtained with parallel and conventional acquisition schemes. PMID:14705058

Lee, Ray F.; Hardy, Christopher J.; Sodickson, Daniel K.; Bottomley, Paul A.

2007-01-01

129

General linear codes for fault-tolerant matrix operations on processor arrays  

NASA Technical Reports Server (NTRS)

Various checksum codes have been suggested for fault-tolerant matrix computations on processor arrays. Use of these codes is limited due to potential roundoff and overflow errors. Numerical errors may also be misconstrued as errors due to physical faults in the system. In this a set of linear codes is identified which can be used for fault-tolerant matrix operations such as matrix addition, multiplication, transposition, and LU-decomposition, with minimum numerical error. Encoding schemes are given for some of the example codes which fall under the general set of codes. With the help of experiments, a rule of thumb for the selection of a particular code for a given application is derived.

Nair, V. S. S.; Abraham, J. A.

1988-01-01

130

Concurrency emulation and analysis of parallel applications for multi-processor system-on-chip co-design  

Microsoft Academic Search

This paper presents a novel technique for the modeling and the simulation of parallel applications for Multi-Processor Systems-on-Chip (MPSoCs). This technique consists of an application-transparent emulation of OS primitives, includ- ing task creation, scheduling, synchronization etc.; this emu- lation guarantees compatibility with any program compiled against the standard POSIX library, independently of the target OS. This methodology can be used

Giovanni Beltrame; Luca Fossati; Donatella Sciuto

2008-01-01

131

Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems  

NASA Technical Reports Server (NTRS)

The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

Das, T. P.

1988-01-01

132

Nanocavity crossbar arrays for parallel electrochemical sensing on a chip  

PubMed Central

Summary We introduce a novel device for the mapping of redox-active compounds at high spatial resolution based on a crossbar electrode architecture. The sensor array is formed by two sets of 16 parallel band electrodes that are arranged perpendicular to each other on the wafer surface. At each intersection, the crossing bars are separated by a ca. 65 nm high nanocavity, which is stabilized by the surrounding passivation layer. During operation, perpendicular bar electrodes are biased to potentials above and below the redox potential of species under investigation, thus, enabling repeated subsequent reactions at the two electrodes. By this means, a redox cycling current is formed across the gap that can be measured externally. As the nanocavity devices feature a very high current amplification in redox cycling mode, individual sensing spots can be addressed in parallel, enabling high-throughput electrochemical imaging. This paper introduces the design of the device, discusses the fabrication process and demonstrates its capabilities in sequential and parallel data acquisition mode by using a hexacyanoferrate probe. PMID:25161846

Katelhon, Enno; Mayer, Dirk; Banzet, Marko; Offenhausser, Andreas

2014-01-01

133

Constant time algorithms for some geometric intersection problems on processor arrays with reconfigurable bus systems  

E-print Network

one b;, which is equal to 1. Initially, d; and b; are stored in processor P, . Then, we start counting the number of processors containing b;=1. The case, in which at most one or two processors containing b;=1, is handled trivially. However, when... is independent of the ' distance between the communicating processors. Other steps, involve a fixed number of computations within a processor and this trivially takes constant time. Hence, the time complexity of this algorithm is O(1). The size of the PARBS...

Pathikonda, Chakrapani

2012-06-07

134

An Integrated Approach for Processor Allocation and Scheduling of Mixed-Parallel Applications  

Microsoft Academic Search

Computationally complex applications can often be viewed as a collection of coarse-grained data-parallel tasks with precedence constraints. Researchers have shown that combining task and data parallelism (mixed parallelism) can be an effective approach for executing these applica- tions, as compared to pure task or data parallelism. In this paper, we present an approach to determine the appropri- ate mix of

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald Sabin; Ümit V. Çatalyürek; Tahsin M. Kurç; P. Sadayappan; Joel H. Saltz

2006-01-01

135

Force generation by a parallel array of actin filaments  

E-print Network

We develop a model to describe the force generated by an array of well- separated parallel biofilaments, such as actin filaments. The filaments are assumed to only be coupled through mechanical contact with a movable barrier. We calculate the filament density distribution and the force-velocity relation with a mean-field approach combined with simulations. We identify two regimes: a non-condensed regime at low force in which filaments are spread out spatially, and a condensed regime at high force in which filaments accumulate near the barrier. We confirm that in this model, the stall force is equal to N times the stall force of a single filament. However, surprisingly, for large N, we find that the velocity approaches zero at forces significantly lower than the stall force.

Tsekouras, K; Mallick, K; Joanny, J -F

2011-01-01

136

Parallelization Issues of Domain Specific Question Answering System on Cell B.E. Processors  

NASA Astrophysics Data System (ADS)

A question answering system is an information retrieval application which allows users to directly obtain appropriate answers to a question. In order to deal with an explosive growth of information over internet and increased number of processing stages in answer retrieval, time and processing hardware required by question answering system has increased. The need of hardware is currently served by connecting thousands of computers in cluster. But faster and less complex alternatives can be found as a multi-core processor. This paper presents a pioneer work by identifying major issues involved in porting a general question answering framework on a cell processor and their possible solutions. The work is evaluated by porting the indexing algorithm of our biomedical question answering system, INDOC (Internet Doctor) on cell processors.

Kumar, Tarun; Mittal, Ankush; Sondhi, Parikshit

137

A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array.  

PubMed

A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ~27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field. PMID:23742570

Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

2013-05-01

138

A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array  

NASA Astrophysics Data System (ADS)

A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ˜27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field.

Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

2013-05-01

139

Geoelectrical modeling of shallow structures using parallel and perpendicular arrays  

NASA Astrophysics Data System (ADS)

In this article we analyze the sensitivity of a geoelectrical modeling technique to image 2D shallow structures. Firstly, we extend a previously developed 2D method based on Rayleigh-Fourier expansions, in order to allow arbitrary locations for the electrodes and also 3D earth models. This method is an alternative to finite element and finite difference techniques and is especially suitable to model multilayered structures, with smooth irregular boundaries. Then, for simple 2D models we build up two synthetic pseudosections, one for electrode deployments parallel to a profile perpendicular to the strike, and other for deployments perpendicular to it. We analyze the advantages in using both pseudosections to model these structures. We also compare geoelectric results with the corresponding audiomagnetotelluric transverse electric and transverse magnetic responses. Finally, we perform a geoelectrical survey to image a shallow buried structure and show the goodness of the model fit obtained considering both pseudosections. For the examples studied here, we conclude that considering both pseudosections leads to a more accurate description of the structures. When a 2D anomaly is present, its effect on the perpendicular component is more focused, both in width and depth, than in the parallel component. Hence the perpendicular component helps to constrain the localization of the inhomogeneity. In addition, we find similarities between the geoelectric parallel and perpendicular responses and the corresponding audiomagnetotelluric transverse magnetic and transverse electric results, respectively. When inverting audiomagnetotelluric data using 2D codes, better resolution in the electrical imaging is obtained when both modes are considered; then it is expected that 2D imaging of geoelectric data including both arrays should lead to an optimization of the inversion process. Even more, if results of these inversions could be used in correlation with AMT results, it is clear that this kind of joint inversion should contribute to remove uncertainties allowing an improvement in the description of the actual structures.

Bonomo, Néstor; Osella, Ana; Martinelli, Patricia

2002-05-01

140

A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor  

NASA Astrophysics Data System (ADS)

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of the single flow implementation on the custom FTK input mezzanine (IM) board are presented. We report on the integration of 16 parallel engines in the same FPGA and the resulting performances. The parallel 2D-clustering implementation has sufficient processing power to meet the specification for the Pixel layers of ATLAS, for up to 80 overlapping pp collisions that correspond to the maximum LHC luminosity planned until 2022.

Sotiropoulou, C. L.; Gkaitatzis, S.; Annovi, A.; Beretta, M.; Kordas, K.; Nikolaidis, S.; Petridou, C.; Volpi, G.

2014-10-01

141

An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications  

Microsoft Academic Search

Complex parallel applications can often be modeled as directed acyclic graphs of coarse-grained application tasks with dependences. These applications exhibit both task and data parallelism, and combining these two (also called mixed parallelism) has been shown to be an effective model for their execution. In this paper, we present an algorithm to compute the appropriate mix of task and data

Nagavijayalakshmi Vydyanathan; Sriram Krishnamoorthy; Gerald M. Sabin; Ümit V. Çatalyürek; Tahsin M. Kurç; Ponnuswamy Sadayappan; Joel H. Saltz

2009-01-01

142

High-performance low-power bit-level systolic array signal processor with low-threshold dynamic logic circuits  

Microsoft Academic Search

MIT Lincoln Laboratory has developed a scalable full-custom cell library for implementing bit-level systolic array signal processors. The cell library achieves high performance and low power consumption by using dynamic logic circuits with low-threshold voltage CMOS devices. The cell library is designed to implement signal processing functions such as finite impulse response (FIR) filter, infinite impulse response (IIR) filter, polyphase

William S. Song; Michael M. Vai; Huy T. Nguyen

2001-01-01

143

Design of Peak and Charge Current-mode Control for Parallel Module Solar Array Regulator System  

Microsoft Academic Search

In general, the current mode controller is used to share the current in the parallel operation of a power stage. However, the parallel operation of the solar array regulator system (SAR) should be carefully designed because of the nonlinear characteristic solar array (SA), such as a negative dynamic resistance (rs) and a SA V-I curve. In the post works, the

S. H. Park; H. S. Bae; J. H. Lee; B. H. Cho

2006-01-01

144

A 125GOPS 583mW Network-on-Chip Based Parallel Processor with Bio-inspired Visual-Attention Engine  

Microsoft Academic Search

A network-on-chip (NoC) is applied to achieve extensive communication bandwidth required for parallel computing. A 125 GOPS NoC-based parallel processor with a bio-inspired visual attention engine (VAE) exploits both data and object-level parallelism while dissipating 583 mW by packet-based power management. The use of more PEs, VAE, and low latency NoC enables higher performance and power efficiency over the previous

Kwanho Kim; Seungjin Lee; Joo-Young Kim; Minsu Kim; Donghyun Kim; Jeong-Ho Woo; Hoi-Jun Yoo

2008-01-01

145

Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank  

NASA Astrophysics Data System (ADS)

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

146

Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.  

SciTech Connect

The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

Learn, Mark Walter

2012-01-01

147

Parallel algorithms for the maxima problem using an N-cube processor configuration  

E-print Network

. However, during the solution of the three-dimensional maxima problem, it will 17 Processor X[i, 2] X[i, l] W(i) X's(i) 000 001 010 011 100 101 110 111 1 3 2 2 3 8 4 6 5 5 6 4 7 7 8 1 (a) initial T T T T T T T T problem...-coordinate values. lt will once again be assumed that the coordinate values are unique (i. e. x(i, s) P x(j, s) for all i P j and 1 & s & k ), that N = 2n and that the points are in ascending order based on their xk-coordinate values. Pbs 35 and PRs...

Coffman, Sarah Wilson

2012-06-07

148

Low-power, real-time digital video stabilization using the HyperX parallel processor  

NASA Astrophysics Data System (ADS)

Coherent Logix has implemented a digital video stabilization algorithm for use in soldier systems and small unmanned air / ground vehicles that focuses on significantly reducing the size, weight, and power as compared to current implementations. The stabilization application was implemented on the HyperX architecture using a dataflow programming methodology and the ANSI C programming language. The initial implementation is capable of stabilizing an 800 x 600, 30 fps, full color video stream with a 53ms frame latency using a single 100 DSP core HyperX hx3100TM processor running at less than 3 W power draw. By comparison an Intel Core2 Duo processor running the same base algorithm on a 320x240, 15 fps stream consumes on the order of 18W. The HyperX implementation is an overall 100x improvement in performance (processing bandwidth increase times power improvement) over the GPP based platform. In addition the implementation only requires a minimal number of components to interface directly to the imaging sensor and helmet mounted display or the same computing architecture can be used to generate software defined radio waveforms for communications links. In this application, the global motion due to the camera is measured using a feature based algorithm (11 x 11 Difference of Gaussian filter and Features from Accelerated Segment Test) and model fitting (Random Sample Consensus). Features are matched in consecutive frames and a control system determines the affine transform to apply to the captured frame that will remove or dampen the camera / platform motion on a frame-by-frame basis.

Hunt, Martin A.; Tong, Lin; Bindloss, Keith; Zhong, Shang; Lim, Steve; Schmid, Benjamin J.; Tidwell, J. D.; Willson, Paul D.

2011-06-01

149

Appendix E: Parallel Pascal development system  

NASA Technical Reports Server (NTRS)

The Parallel Pascal Development System enables Parallel Pascal programs to be developed and tested on a conventional computer. It consists of several system programs, including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal subprograms. The library includes subprograms for using Parallel Pascal on a parallel system with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently manipulate arrays which have dimensions than the hardware. Programs can be conveninetly tested with small sized arrays on the conventional computer before attempting to run on a parallel system.

1985-01-01

150

Bounded budgeted parallel architecture versus control dominated architecture for hazard data-signal processor synthesis  

Microsoft Academic Search

Multimedia applications such as video and image processing are often characterized by a large number of data accesses (i.e. RAM accesses). In many digital signal-processing applications, the array access patterns are regular and periodic. In these cases, optimized Pipelined Memory Access Controllers can be generated. This technique is used to improve the pipeline access mode to RAM by creating specialized

Bertrand Le Gal; Emmanuel Casseau; Eric Martin

2005-01-01

151

Studying Thermal Management for Graphics-Processor Architectures Jeremy W. Sheaffer, Kevin Skadron, David P. Luebke  

E-print Network

adding not only performance but fundamentally new functionality. Graphics processors (GPUs) sport very high performance in their specialized domain; with massively parallel float- ing point arrays publicly- available simulation infrastructure has hampered academic research in GPU architecture. The lack

Humphrey, Marty

152

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

O'Sullivan, George A. (Pottersville, NJ); O'Sullivan, Joseph A. (St. Louis, MO)

1999-01-01

153

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

O'Sullivan, G.A.; O'Sullivan, J.A.

1999-07-27

154

Markov chain Monte Carlo methods for family trees using a parallel processor  

Microsoft Academic Search

A 1024 CPU parallel computer is used to obtain simulated genotypes in the Tristan da Cunha pedigree using random local updating methods. A four-colour theorem is invoked to justify simultaneous updating. Multiple copies of the program are run simultaneously. These results are used to infer the source of the B allele of the ABO blood group that is present in

Russell Bradford; Alun Thomas

1996-01-01

155

Markov chain Monte Carlo methods for family trees using a parallel processor  

E-print Network

\\Lambda School of Mathematical Sciences University of Bath Bath BA2 7AY Alun Thomas y Myriad Genetics 390 Wakara Way Salt Lake City Utah 84108 USA June 2, 1995 Abstract A 1024 cpu parallel computer is used rjb@maths.bath.ac.uk y Corresponding author, alun@myriad.com 1 #12; 1 Introduction Calculating

Bradford, Russell

156

Retinal Parallel Processors: More than 100 Independent Microcircuits Operate within a Single Interneuron  

PubMed Central

SUMMARY Most neurons are highly polarized cells with branched dendrites that receive and integrate synaptic inputs and extensive axons that deliver action potential output to distant targets. By contrast, amacrine cells, a diverse class of inhibitory interneurons in the inner retina, collect input and distribute output within the same neuritic network. The extent to which most amacrine cells integrate synaptic information and distribute their output is poorly understood. Here, we show that single A17 amacrine cells provide reciprocal feedback inhibition to presynaptic bipolar cells via hundreds of independent microcircuits operating in parallel. The A17 uses specialized morphological features, biophysical properties, and synaptic mechanisms to isolate feedback microcircuits and maximize its capacity to handle many independent processes. This example of a neuron employing distributed parallel processing rather than spatial integration provides insights into how unconventional neuronal morphology and physiology can maximize network function while minimizing wiring cost. PMID:20346762

Grimes, William N.; Zhang, Jun; Graydon, Cole W.; Kachar, Bechara; Diamond, Jeffrey S.

2010-01-01

157

Sparsely Faceted Arrays: A Mechanism Supporting Parallel Allocation, Communication, and Garbage Collection  

E-print Network

Conventional parallel computer architectures do not provide support for non-uniformly distributed objects. In this thesis, I introduce sparsely faceted arrays (SFAs), a new low-level mechanism for naming regions of memory, ...

Brown, Jeremy Hanford

2002-06-01

158

Image fiber optic space-CDMA parallel transmission experiment using 8 × 8 VCSEL/PD arrays  

NASA Astrophysics Data System (ADS)

We experimentally demonstrate space-code-division multiple access (space-CDMA) based two-dimensional (2-D) parallel optical interconnections by using image fibers and 8 × 8 vertical-cavity surface-emitting laser (VCSEL)/photo diode (PD) arrays. Two spatially encoded four-bit (2 × 2) parallel optical signals were emitted from 2-D VCSEL arrays and transmitted through image fibers. The encoded signals were multiplexed by an image-fiber coupler and detected by a 2-D PD array on the receiver side. The receiver recovered the intended parallel signal by decoding the signal. The transmission speed was 64 Mbps/ch (total throughput: 512 Mbps). Bit-error-rate (BER) measurement with a laterally misaligned PD array showed the array had a misalignment tolerance of 25 ?m for a BER performance of 10-9.

Nakamura, Moriya; Kitayama, Ken-Ichi; Igasaki, Yasunori; Shamoto, Naoki; Kaneda, Keiji

2002-11-01

159

Acoustooptic linear algebra processors - Architectures, algorithms, and applications  

NASA Technical Reports Server (NTRS)

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

Casasent, D.

1984-01-01

160

Parallel parsing on a one-way array of finite-state machines  

SciTech Connect

The authors show that a one-way two-dimensional iterative array of finite-state machines (2-DIA) can recognize and parse strings of any context-free language in linear time. What makes this result interesting and rather surprising is the fact that each processor of the array holds only a fixed amount of information (independent of the size of the input) and communicates with its neighbors in only one direction. This makes for a simple VLSI implementation. Although it is known that recognition can be done on a 2-DIA, previous parsing algorithms require the processors to have unbounded memory, even when the communication is two-way. They also consider the problem of finding approximate patterns in strings, the string-to-string correction problem, and the longest common subsequence problem, and show that they can be solved in linear time on a 2-DIA.

Chang, J.H.; Ibarra, O.H.; Palis, M.A.

1987-01-01

161

The TickerTAIP parallel RAID architecture  

Microsoft Academic Search

Traditional disk arrays have a centralized architecture, with a single controller through which all requests flow. Such a controller is a single point of failure, and its performance limits the maximum size that the array can grow to. We describe here TickerTAIP, a parallel architecture for disk arrays that distributed the controller functions across several loosely-coupled processors. The result is

Pei Cao; Swee Boon Lim; Shivakumar Venkataraman; John Wilkes

1993-01-01

162

Electrostatic quadrupole array for focusing parallel beams of charged particles  

DOEpatents

An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

Brodowski, John (Smithtown, NY)

1982-11-23

163

Method of up-front load balancing for local memory parallel processors  

NASA Technical Reports Server (NTRS)

In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

Baffes, Paul Thomas (inventor)

1990-01-01

164

Prism-coupled lenslet array for a super-parallel holographic correlator  

Microsoft Academic Search

We propose a design for a prism-coupled lenslet array (PCLLA) which is a key component in certain parallel-access holographic data processing systems. These systems, the super-parallel holographic optical correlator and the super-parallel holographic random access memory, have the potential to be useful in target tracking, identification, and optical information processing applications. We derive the phase transformation required by these architectures

J. T. Shen; A. Heifetz; G. S. Pati; M. S. Shahriar

165

Array combination for parallel imaging in Magnetic Resonance Imaging  

E-print Network

In Magnetic Resonance Imaging, the time required to generate an image is proportional to the number of steps used to encode the spatial information. In rapid imaging, an array of coil elements and receivers are used to reduce the number of encoding...

Spence, Dan Kenrick

2007-09-17

166

Resonator Fiber Optic Gyro with Bipolar Digital Serrodyne Scheme Using a Field-Programmable Gate Array-Based Digital Processor  

NASA Astrophysics Data System (ADS)

A field-programmable gate array-based digital processor is proposed and demonstrated experimentally for a resonator fiber optic gyro (R-FOG) with a bipolar digital serrodyne phase modulation scheme, which we previously proposed especially for R-FOG signal processing and its noise reduction. The processor has multi functions. First, it suppresses both the fast- and slow-drift components in the difference between the laser frequency and the resonator's resonant frequency. The fast-drift with a small amplitude is compensated for by a proportional controller with an oversampling function to reduce the quantization error, while the slow-drift with a large amplitude is tracked using an up/down counter. Second, it automatically adjusts the amplitude of the waveform for bipolar digital serrodyne phase modulation for waves travelling both in the resonator clockwise and counterclockwise. Bipolar laser frequency alternation required to track the resonator's resonant frequency is ideally realized by adjusting the phase modulation amplitude. This automatic adjustment also realizes an additional function for reducing the gyro drift caused by backscattering in the fiber resonator, which was originally implemented in the shape of the waveform for bipolar digital serrodyne phase modulation. Third, the FPGA generates a gyro output with open-loop operation. The R-FOG performance is demonstrated to be improved by applying these three functions with the FPGA.

Wang, Xijing; He, Zuyuan; Hotate, Kazuo

2011-04-01

167

A fast adaptive convex hull algorithm on two-dimensional processor arrays with a reconfigurable BUS system  

NASA Technical Reports Server (NTRS)

A bus system that can change dynamically to suit computational needs is referred to as reconfigurable. We present a fast adaptive convex hull algorithm on a two-dimensional processor array with a reconfigurable bus system (2-D PARBS, for short). Specifically, we show that computing the convex hull of a planar set of n points taken O(log n/log m) time on a 2-D PARBS of size mn x n with 3 less than or equal to m less than or equal to n. Our result implies that the convex hull of n points in the plane can be computed in O(1) time in a 2-D PARBS of size n(exp 1.5) x n.

Olariu, S.; Schwing, J.; Zhang, J.

1991-01-01

168

High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures  

NASA Technical Reports Server (NTRS)

Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

1985-01-01

169

High performance SPAD array detectors for parallel photon timing applications  

NASA Astrophysics Data System (ADS)

Over the past few years there has been a growing interest in monolithic arrays of single photon avalanche diodes (SPAD) for spatially resolved detection of faint ultrafast optical signals. SPADs implemented in planar technologies offer the typical advantages of microelectronic devices (small size, ruggedness, low voltage, low power, etc.). Furthermore, they have inherently higher photon detection efficiency than PMTs and are able to provide, beside sensitivities down to single-photons, very high acquisition speeds. Although currently available silicon devices reached remarkable performance, nevertheless further improvements are needed in order to meet the requirements of most demanding timeresolved techniques, it is necessary to face problems like electrical crosstalk between adjacent pixel, high detection efficiency in the red spectral range, large area, low dark counting rate. Moreover to develop array with high number of pixel became more and more important to develop all the TCSPC electronics with picosecond resolution to create a new family of detection system for TCSPC applications. Recent advances in our research on single photon time resolved array is here presented.

Rech, I.; Cammi, C.; Crotti, M.; Gulinatti, A.; Maccagnani, P.; Ghioni, M.; Cova, S.

2011-10-01

170

High-performance ultra-low power VLSI analog processor for data compression  

NASA Technical Reports Server (NTRS)

An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

Tawel, Raoul (Inventor)

1996-01-01

171

An asynchronous communication protocol for internode connections in a scalable processor array  

Microsoft Academic Search

The authors describe an asynchronous communication protocol and an interface circuit which are used for internode communication in a reconfigurable DSP array. The communication protocol is derived from methods applied in digital communication, where the received data is synchronized to the local clock. Data is written into a FIFO that works as an elastic storage. For simple synchronization of the

Jacob Levison; Ichiro Kuroda

1993-01-01

172

Stream Processors  

NASA Astrophysics Data System (ADS)

Stream processors, like other multi core architectures partition their functional units and storage into multiple processing elements. In contrast to typical architectures, which contain symmetric general-purpose cores and a cache hierarchy, stream processors have a significantly leaner design. Stream processors are specifically designed for the stream execution model, in which applications have large amounts of explicit parallel computation, structured and predictable control, and memory accesses that can be performed at a coarse granularity. Applications in the streaming model are expressed in a gather-compute-scatter form, yielding programs with explicit control over transferring data to and from on-chip memory. Relying on these characteristics, which are common to many media processing and scientific computing applications, stream architectures redefine the boundary between software and hardware responsibilities with software bearing much of the complexity required to manage concurrency, locality, and latency tolerance. Thus, stream processors have minimal control consisting of fetching medium- and coarse-grained instructions and executing them directly on the many ALUs. Moreover, the on-chip storage hierarchy of stream processors is under explicit software control, as is all communication, eliminating the need for complex reactive hardware mechanisms.

Erez, Mattan; Dally, William J.

173

Numerical Properties Of Algorithm-Based Fault-Tolerance For High Reliability Array Processors  

Microsoft Academic Search

Algorithm-Based Fault-Tolerance is a method of applying block error codes to array implementations of linear algebraic operations. When the underlying algebra is the real (or complex) number field, then numerical approximations to the reals cause certain fault-induced errors to be indistinguishable from roundoff errors, so that fault-induced errors may be either undetected or miscorrected. A worst case forward error analysis

William G. Bliss; Michael R. Lightner; Benjamin Friedlander

1988-01-01

174

Achieving supercomputer performance for neural net simulation with an array of digital signal processors  

SciTech Connect

Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W. [Swiss Federal Inst. of Technology, Zurich (Switzerland)] [Swiss Federal Inst. of Technology, Zurich (Switzerland)

1992-10-01

175

Parallel programming models for a multi-processor SoC platform applied to high-speed traffic management  

Microsoft Academic Search

In this paper, we describe the MultiFlex multi-processor SoC programming environment, with focus on two programming models: a distributed system object component (DSOC) message passing model, and a symmetrical multi-processing (SMP) model using shared memory. The MultiFlex tools map these models onto the StepNP multi-processor SoC platform, while making use of harware accelerators for message passing and task scheduling. We

Pierre G. Paulin; Chuck Pilkington; Michel Langevin; Essaid Bensoudane; Gabriela Nicolescu

2004-01-01

176

Upscaling and microstructural analysis of the flow-structure relation perpendicular to random, parallel fiber arrays  

E-print Network

, heat exchangers, (biological) filters and transport of ground water and pollutants (Bird et al., 2001 is calculated in the creeping flow regime for arrays of random, ideal, perfectly parallel fibers. On the micro of the mean gap width, which links the macro- and the micro-scales. Finally, we verify the validity of the

Luding, Stefan

177

Analysis of radiation by arrays of parallel vertical wire antennas over imperfect ground  

Microsoft Academic Search

Describes a computer program labelled RCVERT. This is a program for analyzing radiation from arrays of parallel vertical thin-wire antennas over the horizontal surface of an imperfectly conducting earth. The effects of the imperfectly conducting earth are accounted for approximately by using the method of reflection coefficients

T. Sarkar

1975-01-01

178

Parallel Beam Approximation for Calculation of Detection Efficiency of Crystals in PET Detector Arrays  

PubMed Central

In this work we propose a parallel beam approximation for the computation of the detection efficiency of crystals in a PET detector array. In this approximation the detection efficiency of a crystal is estimated using the distance between source and the crystal and the pre-calculated detection cross section of the crystal in a crystal array which is calculated for a uniform parallel beam of gammas. The pre-calculated detection cross sections for a few representative incident angles and gamma energies can be used to create a look-up table to be used in simulation studies or practical implementation of scatter or random correction algorithms. Utilizing the symmetries of the square crystal array, the pre-calculated look-up tables can be relatively small. The detection cross sections can be measured experimentally, calculated analytically or simulated using a Monte Carlo (MC) approach. In this work we used a MC simulation that takes into account the energy windowing, Compton scattering and factors in the “block effect”. The parallel beam approximation was validated by a separate MC simulation using point sources located at different positions around a crystal array. Experimentally measured detection efficiencies were compared with Monte Carlo simulated detection efficiencies. Results suggest that the parallel beam approximation provides an efficient and accurate way to compute the crystal detection efficiency, which can be used for estimation of random and scatter coincidences for PET data corrections.

Komarov, Sergey; Song, Tae Yong; Wu, Heyu; Tai, Yuan-Chuan

2014-01-01

179

Parallel systolic array implementation of multiuser detection for asynchronous DS\\/CDMA  

Microsoft Academic Search

In this paper, a parallel systolic array (PSA) is proposed to implement the multiuser detection for asynchronous DS\\/CDMA systems. Conventional implementation of asynchronous multiuser detection is to set up a sliding received signal window (RSW), and the multiuser detections are applied in the RSW window by window. As we know, in the two edges of a RSW, the interference produced

Ming Chen; Haifeng Wang

2001-01-01

180

A class of parallel algorithms for computation of the manipulator inertia matrix  

NASA Technical Reports Server (NTRS)

Parallel and parallel/pipeline algorithms for computation of the manipulator inertia matrix are presented. An algorithm based on composite rigid-body spatial inertia method, which provides better features for parallelization, is used for the computation of the inertia matrix. Two parallel algorithms are developed which achieve the time lower bound in computation. Also described is the mapping of these algorithms with topological variation on a two-dimensional processor array, with nearest-neighbor connection, and with cardinality variation on a linear processor array. An efficient parallel/pipeline algorithm for the linear array was also developed, but at significantly higher efficiency.

Fijany, Amir; Bejczy, Antal K.

1989-01-01

181

Development of a ground signal processor for digital synthetic array radar data  

NASA Technical Reports Server (NTRS)

A modified APQ-102 sidelooking array radar (SLAR) in a B-57 aircraft test bed is used, with other optical and infrared sensors, in remote sensing of Earth surface features for various users at NASA Johnson Space Center. The video from the radar is normally recorded on photographic film and subsequently processed photographically into high resolution radar images. Using a high speed sampling (digitizing) system, the two receiver channels of cross-and co-polarized video are recorded on wideband magnetic tape along with radar and platform parameters. These data are subsequently reformatted and processed into digital synthetic aperture radar images with the image data available on magnetic tape for subsequent analysis by investigators. The system design and results obtained are described.

Griffin, C. R.; Estes, J. M.

1981-01-01

182

Using a Cray Y-MP as an array processor for a RISC Workstation  

NASA Technical Reports Server (NTRS)

As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.

Lamaster, Hugh; Rogallo, Sarah J.

1992-01-01

183

ArrayStore: A Storage Manager for Complex Parallel Array Processing  

E-print Network

. Second, it develops a new and efficient storage- management mechanism that enables parallel processing. Categories and Subject Descriptors H.2.4 [Information Systems]: Database Management ---Systems; H.2.8 [Information Systems]: Database Management---Database applications General Terms Algorithms, Design

Balazinska, Magdalena

184

A parallel-series-fed microstrip array with high efficiency and low cross-polarization  

NASA Technical Reports Server (NTRS)

The requirements of a microstrip array with a vertically polarized fan beam are addressed that correspond to its use in C-band interferometric SAR. A combination of parallel- and series-feed techniques are utilized in an array design with a three-stage parallel-fed configuration to enhance bandwidth performance. The linearly polarized traveling-wave microstrip array antenna is fed by microstrip transmission lines in two rows of 36 elements that resonate at 5.30 GHz. The transmission lines are impedance-matched at every junction for all the waves that travel toward the two ends of the array. The two measured principal-plane patterns are shown, and the measured narrow-beam pattern is found to agree with the calculated values. The VSWR bandwidths and narrow and broad beamwidths of the antenna are found to permit efficient performance. The efficiency is attributed to the parallel and series-feed configuration which allows proper impedance matching, and low cross-polarization is a result of the antiphase feed technique employed in the configuration.

Huang, John

1992-01-01

185

Sequence information signal processor  

NASA Technical Reports Server (NTRS)

An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

Peterson, John C. (Inventor); Chow, Edward T. (Inventor); Waterman, Michael S. (Inventor); Hunkapillar, Timothy J. (Inventor)

1999-01-01

186

Sub30-nm resolution parallel EB lithography based on a planar type Si nanowire array ballistic electron source  

Microsoft Academic Search

Sub-30 nm resolution parallel EB lithography based on a planar type silicon nanowire array ballistic electron emitter (PBE) is demonstrated in this paper. The Parallel EB lithography is performed on a 1:1 electron projection system. The system consists of the PBE as a surface electron source, a target wafer parallel to the electron source, and uniform vertical electromagnetic fields. The

A. Kojima; H. Ohyi; N. Koshida

2009-01-01

187

Anisotropic charge and heat conduction through arrays of parallel elliptic cylinders in a continuous medium  

NASA Astrophysics Data System (ADS)

Arrays of circular pores in silicon can exhibit a phononic bandgap when the lattice constant is smaller than the phonon scattering length, and so have become of interest for use as thermoelectric materials, due to the large reduction in thermal conductivity that this bandgap can cause. The reduction in electrical conductivity is expected to be less, because the lattice constant of these arrays is engineered to be much larger than the electron scattering length. As a result, electron transport through the effective medium is well described by the diffusion equation, and the Seebeck coefficient is expected to increase. In this paper, we develop an expression for the purely diffusive thermal (or electrical) conductivity of a composite comprised of square or hexagonal arrays of parallel circular or elliptic cylinders of one material in a continuum of a second material. The transport parallel to the cylinders is straightforward, so we consider the transport in the two principal directions normal to the cylinders, using a self-consistent local field calculation based on the point dipole approximation. There are two limiting cases: large negative contrast (e.g., pores in a conductor) and large positive contrast (conducting pillars in air). In the large negative contrast case, the transport is only slightly affected parallel to the major axis of the elliptic cylinders but can be significantly affected parallel to the minor axis, even in the limit of zero volume fraction of pores. The positive contrast case is just the opposite: the transport is only slightly affected parallel to the minor axis of the pillars but can be significantly affected parallel to the major axis, even in the limit of zero volume fraction of pillars. The analytical results are compared to extensive FEA calculations obtained using Comsol™ and the agreement is generally very good, provided the cylinders are sufficiently small compared to the lattice constant.

Martin, James E.; Ribaudo, Troy

2013-04-01

188

Solving the Traveling Salesman Problem with a Parallel Branch-and-Bound Algorithm on a 1024 Processor Network  

E-print Network

: CombinatorialOptimization;Distributed Algorithms; DynamicLoad-Balancing; Parallel Branch-and-Bound; SymmetricSolving the Traveling Salesman Problem with a Parallel Branch-and-Bound Algorithm on a 1024 a parallelization of a higly e cient best- rst branch-and-bound algorithm to solve large symmetric traveling saleman

Cook, William

189

Parallel SPM cantilever arrays for large area surface metrology and lithography  

NASA Astrophysics Data System (ADS)

In this paper technology of scanning probe microscopy (SPM) surface metrology using arrays of piezoresistive thermally actuated cantilevers is discussed. The cantilever architecture presented here makes it possible to image surface topography using sensors operating in parallel. In this way the throughput of the sample imaging is increased, which is of crucial importance in measurements of large area samples. Application of piezoresistive detection scheme makes it possible to investigate quantitatively the interaction between the microprobe and the imaged surface. Integration of the thermal deflection actuator with the spring beam decreases the response time and enables fast and high resolution control of the tip sample distance. The results of topography parallel measurement using 1×4 cantilever array will be presented.

Gotszalk, Teodor; Ivanov, Tzvetan; Rangelow, Ivo W.

2014-04-01

190

Fast Calculation of Computer-Generated Fresnel Hologram Utilizing Distributed Parallel Processing and Array Operation  

Microsoft Academic Search

Fresnel CGH for a three-dimensional (3-D) object is generated by calculating the Fresnel diffraction, but it requires a huge amount of calculation. This is one reason for the difficulty in realizing real-time holography. We propose fast calculation method of computer-generated Fresnel hologram (Fresnel CGH) utilizing distributed parallel processing and array operation. In our method, a projected image with depth information

Shogo Nishi; Kojiro Shiba; Kunihiko Mori; Shigeru Nakayama; Sadayuki Murashima

2005-01-01

191

Photon detection with parallel asynchronous processing  

NASA Technical Reports Server (NTRS)

An approach to photon detection with a parallel asynchronous signal processor is described. The visible or IR photon-detection capability of the silicon p(+)-n-n(+) detectors and the parallel asynchronous processing are addressed separately. This approach would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture consisting of a stack of planar arrays of the devices would form a 2D array processor with a 2D array of inputs located directly behind a focal-plane detector array. A 2D image data stream would propagate in neuronlike asynchronous pulse-coded form through the laminar processor. Such systems can integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The possibility of multispectral image processing is addressed.

Coon, D. D.; Perera, A. G. U.

1990-01-01

192

Excitation of a Parallel Plate Waveguide by an Array of Rectangular Waveguides  

NASA Technical Reports Server (NTRS)

This work addresses the problem of excitation of a parallel plate waveguide by an array of rectangular waveguides that arises in applications such as the continuous transverse stub (CTS) antenna and dual-polarized parabolic cylindrical reflector antennas excited by a scanning line source. In order to design the junction region between the parallel plate waveguide and the linear array of rectangular waveguides, waveguide sizes have to be chosen so that the input match is adequate for the range of scan angles for both polarizations. Electromagnetic wave scattered by the junction of a parallel plate waveguide by an array of rectangular waveguides is analyzed by formulating coupled integral equations for the aperture electric field at the junction. The integral equations are solved by the method of moments. In order to make the computational process efficient and accurate, the method of weighted averaging was used to evaluate rapidly oscillating integrals encountered in the moment matrix. In addition, the real axis spectral integral is evaluated in a deformed contour for speed and accuracy. The MoM results for a large finite array have been validated by comparing its reflection coefficients with corresponding results for an infinite array generated by the commercial finite element code, HFSS. Once the aperture electric field is determined by MoM, the input reflection coefficients at each waveguide port, and coupling for each polarization over the range of useful scan angles, are easily obtained. Results for the input impedance and coupling characteristics for both the vertical and horizontal polarizations are presented over a range of scan angles. It is shown that the scan range is limited to about 35 for both polarizations and therefore the optimum waveguide is a square of size equal to about 0.62 free space wavelength.

Rengarajan, Sembiam

2011-01-01

193

Weak-Periodic Stochastic Resonance in a Parallel Array of Static Nonlinearities  

PubMed Central

This paper studies the output-input signal-to-noise ratio (SNR) gain of an uncoupled parallel array of static, yet arbitrary, nonlinear elements for transmitting a weak periodic signal in additive white noise. In the small-signal limit, an explicit expression for the SNR gain is derived. It serves to prove that the SNR gain is always a monotonically increasing function of the array size for any given nonlinearity and noisy environment. It also determines the SNR gain maximized by the locally optimal nonlinearity as the upper bound of the SNR gain achieved by an array of static nonlinear elements. With locally optimal nonlinearity, it is demonstrated that stochastic resonance cannot occur, i.e. adding internal noise into the array never improves the SNR gain. However, in an array of suboptimal but easily implemented threshold nonlinearities, we show the feasibility of situations where stochastic resonance occurs, and also the possibility of the SNR gain exceeding unity for a wide range of input noise distributions. PMID:23505523

Ma, Yumei; Duan, Fabing; Chapeau-Blondeau, Francois; Abbott, Derek

2013-01-01

194

Efficient field-based CAD of microwave circuits on massively parallel processor computer using TLM and Prony's methods  

Microsoft Academic Search

This paper reports progress in the CAD of microwave circuits using a parallel TLM code with Prony's method. With only 100 TLM time samples, the scattering parameters of a microwave bandpass filter are extracted via Prony's method on a normal workstation. Such a combination of the parallel TLM module and Prony's method brings efficient optimization using time domain techniques within

C. Eswarappa; P. P. M. So; W. J. R. Hoefer

1994-01-01

195

Parallel and series FED microstrip array with high efficiency and low cross polarization  

NASA Technical Reports Server (NTRS)

A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

Huang, John (inventor)

1995-01-01

196

A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically reconfigurable processor  

E-print Network

Instruction Cell Array (RICA). Several popular sorting algorithms adopted in MIMO decoding are analyzed communication standards such as IEEE 802.11n (WiFi), Long Term Evolution (LTE), IEEE 802.16 (WiMAX), etc

Arslan, Tughrul

197

Dual-thread Speculation: A Simple Approach to Uncover Thread-level Parallelism on a Simultaneous Multithreaded Processor  

Microsoft Academic Search

As chip multiprocessors with simultaneous multithreaded cores are becoming commonplace, there is a need for simple approaches\\u000a to exploit thread-level parallelism. In this paper, we consider thread-level speculation as a means to reap thread-level parallelism\\u000a out of application binaries. We first investigate the tradeoffs between scheduling speculative threads on the same core and\\u000a on different cores. While threads contend for

Fredrik Warg; Per Stenström

2008-01-01

198

PROPELLER-EPI With Parallel Imaging Using a Circularly Symmetric Phased-Array RF Coil at 3.0 T  

E-print Network

PROPELLER-EPI With Parallel Imaging Using a Circularly Symmetric Phased-Array RF Coil at 3.0 T (PROPELLER) and parallel imaging is presented for diffusion echo-planar im- aging (EPI) at high spatial the phase-encoding direction, and PROPELLER acquisi- tion to further decrease the echo train length (ETL

199

Design and implementation of a parallel array operator for the arbitrary remapping of data.  

SciTech Connect

The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.

Dietz, Steven; Choi, S. E. (Sung-Eun); Chamberlain, B. L. (Bradford L.); Snyder, Lawrence

2003-01-01

200

A survey of processors with explicit multithreading  

Microsoft Academic Search

Hardware multithreading is becoming a generally applied technique in the next generation of microprocessors. Several multithreaded processors are announced by industry or already into production in the areas of high-performance microprocessors, media, and network processors.A multithreaded processor is able to pursue two or more threads of control in parallel within the processor pipeline. The contexts of two or more threads

Theo Ungerer; Borut Robi?; Jurij Šilc

2003-01-01

201

Large-scale parallel arrays of silicon nanowires via block copolymer directed self-assembly  

NASA Astrophysics Data System (ADS)

Extending the resolution and spatial proximity of lithographic patterning below critical dimensions of 20 nm remains a key challenge with very-large-scale integration, especially if the persistent scaling of silicon electronic devices is sustained. One approach, which relies upon the directed self-assembly of block copolymers by chemical-epitaxy, is capable of achieving high density 1 : 1 patterning with critical dimensions approaching 5 nm. Herein, we outline an integration-favourable strategy for fabricating high areal density arrays of aligned silicon nanowires by directed self-assembly of a PS-b-PMMA block copolymer nanopatterns with a L0 (pitch) of 42 nm, on chemically pre-patterned surfaces. Parallel arrays (5 × 106 wires per cm) of uni-directional and isolated silicon nanowires on insulator substrates with critical dimension ranging from 15 to 19 nm were fabricated by using precision plasma etch processes; with each stage monitored by electron microscopy. This step-by-step approach provides detailed information on interfacial oxide formation at the device silicon layer, the polystyrene profile during plasma etching, final critical dimension uniformity and line edge roughness variation nanowire during processing. The resulting silicon-nanowire array devices exhibit Schottky-type behaviour and a clear field-effect. The measured values for resistivity and specific contact resistance were ((2.6 +/- 1.2) × 105 ?cm) and ((240 +/- 80) ?cm2) respectively. These values are typical for intrinsic (un-doped) silicon when contacted by high work function metal albeit counterintuitive as the resistivity of the starting wafer (~10 ?cm) is 4 orders of magnitude lower. In essence, the nanowires are so small and consist of so few atoms, that statistically, at the original doping level each nanowire contains less than a single dopant atom and consequently exhibits the electrical behaviour of the un-doped host material. Moreover this indicates that the processing successfully avoided unintentional doping. Therefore our approach permits tuning of the device steps to contact the nanowires functionality through careful selection of the initial bulk starting material and/or by means of post processing steps e.g. thermal annealing of metal contacts to produce high performance devices. We envision that such a controllable process, combined with the precision patterning of the aligned block copolymer nanopatterns, could prolong the scaling of nanoelectronics and potentially enable the fabrication of dense, parallel arrays of multi-gate field effect transistors.Extending the resolution and spatial proximity of lithographic patterning below critical dimensions of 20 nm remains a key challenge with very-large-scale integration, especially if the persistent scaling of silicon electronic devices is sustained. One approach, which relies upon the directed self-assembly of block copolymers by chemical-epitaxy, is capable of achieving high density 1 : 1 patterning with critical dimensions approaching 5 nm. Herein, we outline an integration-favourable strategy for fabricating high areal density arrays of aligned silicon nanowires by directed self-assembly of a PS-b-PMMA block copolymer nanopatterns with a L0 (pitch) of 42 nm, on chemically pre-patterned surfaces. Parallel arrays (5 × 106 wires per cm) of uni-directional and isolated silicon nanowires on insulator substrates with critical dimension ranging from 15 to 19 nm were fabricated by using precision plasma etch processes; with each stage monitored by electron microscopy. This step-by-step approach provides detailed information on interfacial oxide formation at the device silicon layer, the polystyrene profile during plasma etching, final critical dimension uniformity and line edge roughness variation nanowire during processing. The resulting silicon-nanowire array devices exhibit Schottky-type behaviour and a clear field-effect. The measured values for resistivity and specific contact resistance were ((2.6 +/- 1.2) × 105 ?cm) and ((240 +/- 80) ?cm2) respectively. These values are typic

Farrell, Richard A.; Kinahan, Niall T.; Hansel, Stefan; Stuen, Karl O.; Petkov, Nikolay; Shaw, Matthew T.; West, Laetitia E.; Djara, Vladimir; Dunne, Robert J.; Varona, Olga G.; Gleeson, Peter G.; Jung, Soon-Jung; Kim, Hye-Young; Kole?nik, Maria M.; Lutz, Tarek; Murray, Christopher P.; Holmes, Justin D.; Nealey, Paul F.; Duesberg, Georg S.; Krsti?, Vojislav; Morris, Michael A.

2012-05-01

202

Two-Dimensional Systolic Array For Kalman-Filter Computing  

NASA Technical Reports Server (NTRS)

Two-dimensional, systolic-array, parallel data processor performs Kalman filtering in real time. Algorithm rearranged to be Faddeev algorithm for generalized signal processing. Algorithm mapped onto very-large-scale integrated-circuit (VLSI) chip in two-dimensional, regular, simple, expandable array of concurrent processing cells. Processor does matrix/vector-based algebraic computations. Applications include adaptive control of robots, remote manipulators and flexible structures and processing radar signals to track targets.

Chang, Jaw John; Yeh, Hen-Geul

1988-01-01

203

Large-scale parallel arrays of silicon nanowires via block copolymer directed self-assembly.  

PubMed

Extending the resolution and spatial proximity of lithographic patterning below critical dimensions of 20 nm remains a key challenge with very-large-scale integration, especially if the persistent scaling of silicon electronic devices is sustained. One approach, which relies upon the directed self-assembly of block copolymers by chemical-epitaxy, is capable of achieving high density 1?:?1 patterning with critical dimensions approaching 5 nm. Herein, we outline an integration-favourable strategy for fabricating high areal density arrays of aligned silicon nanowires by directed self-assembly of a PS-b-PMMA block copolymer nanopatterns with a L(0) (pitch) of 42 nm, on chemically pre-patterned surfaces. Parallel arrays (5 × 10(6) wires per cm) of uni-directional and isolated silicon nanowires on insulator substrates with critical dimension ranging from 15 to 19 nm were fabricated by using precision plasma etch processes; with each stage monitored by electron microscopy. This step-by-step approach provides detailed information on interfacial oxide formation at the device silicon layer, the polystyrene profile during plasma etching, final critical dimension uniformity and line edge roughness variation nanowire during processing. The resulting silicon-nanowire array devices exhibit Schottky-type behaviour and a clear field-effect. The measured values for resistivity and specific contact resistance were ((2.6 ± 1.2) × 10(5)?cm) and ((240 ± 80) ?cm(2)) respectively. These values are typical for intrinsic (un-doped) silicon when contacted by high work function metal albeit counterintuitive as the resistivity of the starting wafer (?10 ?cm) is 4 orders of magnitude lower. In essence, the nanowires are so small and consist of so few atoms, that statistically, at the original doping level each nanowire contains less than a single dopant atom and consequently exhibits the electrical behaviour of the un-doped host material. Moreover this indicates that the processing successfully avoided unintentional doping. Therefore our approach permits tuning of the device steps to contact the nanowires functionality through careful selection of the initial bulk starting material and/or by means of post processing steps e.g. thermal annealing of metal contacts to produce high performance devices. We envision that such a controllable process, combined with the precision patterning of the aligned block copolymer nanopatterns, could prolong the scaling of nanoelectronics and potentially enable the fabrication of dense, parallel arrays of multi-gate field effect transistors. PMID:22481430

Farrell, Richard A; Kinahan, Niall T; Hansel, Stefan; Stuen, Karl O; Petkov, Nikolay; Shaw, Matthew T; West, Laetitia E; Djara, Vladimir; Dunne, Robert J; Varona, Olga G; Gleeson, Peter G; Jung, Soon-Jung; Kim, Hye-Young; Kole?nik, Maria M; Lutz, Tarek; Murray, Christopher P; Holmes, Justin D; Nealey, Paul F; Duesberg, Georg S; Krsti?, Vojislav; Morris, Michael A

2012-05-21

204

Proceedings of 12th Intl Conference on Parallel Architectures and Compilation Techniques, September 2003. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor  

E-print Network

2003. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor Nathan Tuck Dean M to putting the processor in context with prior published research in simultaneous multithread- ing throughput. The processor is also evaluated in the context of prior work on the interaction of multithreading

Wang, Deli

205

Appears in IASTED International Conference on Parallel and Distributed Systems (Euro-PDS), July 1-3, 1998, Vienna, Austria. SIMULATION STUDY OF MULTITHREADED VIRTUAL PROCESSOR  

E-print Network

-3, 1998, Vienna, Austria. SIMULATION STUDY OF MULTITHREADED VIRTUAL PROCESSOR BEN LEE, HANTAK KWAK}@computer.etri.re.kr ABSTRACT1 This paper proposes the Multithreaded Virtual Processor (MVP) architecture model as a means of integrating the multithreaded programming paradigm and a modern su- perscalar processor with support for fast

Lee, Ben

206

LOUD: A 1020-Node Modular Microphone Array and Beamformer for Intelligent  

E-print Network

LOUD: A 1020-Node Modular Microphone Array and Beamformer for Intelligent Computing Spaces Eugene. In these environments, tradi- tional methods of sound capture are insufficient, and array microphones are needed, and tested LOUD, a novel 1020- node microphone array utilizing the Raw tile parallel processor architec- ture

Mohri, Mehryar

207

Room-temperature synthesis of two-dimensional ultrathin gold nanowire parallel array with tunable spacing.  

PubMed

A series of long-chain amidoamine derivatives with different alkyl chain lengths (CnAA where n is 12, 14, 16, or 18) were synthesized and studied with regard to their ability to form organogels and to act as soft templates for the production of Au nanomaterials. These compounds were found to self-assemble into lamellar structures and exhibited gelation ability in some apolar solvents. The gelation concentration, gel-sol phase transition temperature, and lattice spacing of the lamellar structures in organic solvent all varied on the basis of the alkyl chain length of the particular CnAA compound employed. The potential for these molecules to function as templates was evaluated through the synthesis of Au nanowires (NWs) in their organogels. Ultrathin Au NWs were obtained from all CnAA/toluene gel systems, each within an optimal temperature range. Interestingly, in the case of C12AA and C14AA, it was possible to fabricate ultrathin Au NWs at room temperature. In addition, two-dimensional parallel arrays of ultrathin Au NWs were self-assembled onto TEM copper grids as a result of the drying of dispersion solutions of these NWs. The use of CnAA compounds with differing alkyl chain lengths enabled precise tuning of the distance between the Au NWs in these arrays. PMID:23316723

Morita, Clara; Tanuma, Hiromitsu; Kawai, Chika; Ito, Yuki; Imura, Yoshiro; Kawai, Takeshi

2013-02-01

208

Control scheme for microcomputers being used in multiprocessor arrays  

SciTech Connect

In general, microcomputer central processor devices are completely controllable from memory and memory control lines. By interjecting a controlling processor between the central processor chip and its memory, and using the central processor memory ready signal for synchronization, data can be supplied to the microprocessor either from an attached memory or from the controlling processor. The controlling processor may also download codes into the microprocessor's memory to be used either as programs or as data. By manipulating restart, hold and interrupt signal lines in addition to the memory lines, total control is achieved. Such a scheme can be used to orchestrate the simultaneous application of arrays of microcomputers to single large problems or to many discrete smaller problems. We describe the details of such connections to three commercially available devices: a Motorola 68000, an Advanced Micro Devices 29116 and a National Semiconductor NS32032 and indicate how our scheme may be used to connect such devices into a cooperating parallel array.

Meng, J.; Gin, F.

1984-06-01

209

Computation and parallel implementation for early vision  

NASA Technical Reports Server (NTRS)

The problem of early vision is to transform one or more retinal illuminance images-pixel arrays-to image representations built out of such primitive visual features such as edges, regions, disparities, and clusters. These transformed representations form the input to later vision stages that perform higher level vision tasks including matching and recognition. Researchers developed algorithms for: (1) edge finding in the scale space formulation; (2) correlation methods for computing matches between pairs of images; and (3) clustering of data by neural networks. These algorithms are formulated for parallel implementation of SIMD machines, such as the Massively Parallel Processor, a 128 x 128 array processor with 1024 bits of local memory per processor. For some cases, researchers can show speedups of three orders of magnitude over serial implementations.

Gualtieri, J. Anthony

1990-01-01

210

Parallel processing  

SciTech Connect

This book provides a introduction to the fundamental principles and practice of parallel processing. After a general introduction to the many facets of parallelism, the first part of the book is devoted to the development of a coherent theoretical framework. Particular attention is paid to the modeling, semantics and complexity of interacting parallel processes. The second part of the book considers the more practical aspects such as parallel processor architecture, parallel and distributed programming, and concurrent transaction handling in databases.

Krishnamurthy, E.V. (Waikato Univ., Hamilton (New Zealand))

1989-01-01

211

Computational Science Technical Note CSTN-102 Comparing Intra-and Inter-Processor Parallelism on Multi-Core CellBE  

E-print Network

of Sony, Toshiba and IBM is a powerful but complex processing device that has attracted much attention Broadband Engine (Cell BE) multi-core processor from the STI consortium of Sony, Toshiba and IBM of Sony, Toshiba and IBM (STI) pro- duced a multi-core processor chip that is known as the Cell Br

Hawick, Ken

212

Forced convection heat transfer in parallel channel array microchannel heat exchanger  

SciTech Connect

An experimental study is underway to investigate heat transfer and pressure drop in a microchannel heat exchanger. Devices with geometry consisting of an array of 54 parallel microchannels with rectangular cross-section, 1.0 mm deep and 0.27 mm wide (aspect ratio approximately 4, with hydraulic diameter 425 mm) have been tested using Refrigerant 124 as the working fluid. Conditions tested include a range of Reynolds numbers between 100 and 570 (where Re{sub d} is defined using the hydraulic diameter of the channels), uniform heat flux up to approximately 40 W/cm{sup 2}, and wall surface superheats ranging from approximately 0 to 65 K, in both single-phase and two-phase flow. The average liquid-side heat transfer coefficient showed a significant increase over the expected value in macroscopic flow at the same Reynolds numbers. In the single-phase tests, the Nusselt number values ranged from about 5 to 12 for the conditions tested and varied with Reynolds number, showing increasing Nu with increasing Re{sub d}. In the two-phase tests, the Nusselt number appeared to be approximately constant with Re{sub d}, at a value of approximately 20. Exit qualities achieved in the two-phase testing ranged from approximately 4% up to nearly 60%, as determined from an energy balance on the system. These results indicate that substantial improvements in thermal hydraulic performance can be realized in microscale heat exchangers with no significant penalty in the pressure drop.

Cuta, J.M.; McDonald, C.E.; Shekarriz, A. [Pacific Northwest National Lab., Richland, WA (United States). Environmental Technology Div.

1996-12-31

213

Acoustic insertion loss due to two dimensional periodic arrays of circular cylinders parallel to a nearby surface  

E-print Network

The acoustical performances of regular arrays of cylindrical elements, with their axes aligned and parallel to a ground plane, have been investigated through predictions and laboratory experiments. Semi-analytical predictions based on multiple scattering theory and numerical simulations based on a boundary element formulation have been made. Measurements have been made in an anechoic chamber using arrays of (a) cylindrical acoustically-rigid scatterers (PVC pipes) and (b) thin elastic shells. Insertion loss (IL) spectra due to the arrays have been measured without and with ground planes for several receiver heights. Data and predictions have been compared. The minima in the excess attenuation spectrum i.e., attenuation maxima due to the ground alone resulting from destructive interference between direct and ground-reflected sound waves, tend to have an adverse influence on the band gaps (BG) related to a periodic array in the free field when these two effects coincide. On the other hand, the presence of rigid ground may result in an IL for an array near the ground similar to or, in the case of the first BG, greater than that resulting from a double array, equivalent to the original array plus its ground plane mirror image, in the free field.

Anton Krynkin; Olga Umnova; Juan Vicente Sanchez-Perez; Alvin Y. B. Chong; Shahram Taherzadeh; Keith Attenborough

2011-02-15

214

First programmable digital optical processor: optical cellular logic image processor  

NASA Astrophysics Data System (ADS)

The construction of digital optical processors based on the cellular logic image processor (CLIP) architecture is discussed. Both a single-channel processor and a parallel version incorporating 256 information channels have been constructed. The single channel version of the processor allows eight different combinatorial logic processes to be carried out under electronic control and can be programmed in real time. Several algorithms including pattern recognition, byte comparison, full addition and subtraction have been implemented with this machine. The 256 channel version operates similarly to the single channel version except that a reduced instruction set internal processor with four selectable logic processes is used. A nearest neighbor interconnect provides the communication required between the different information channels. More advanced processing capability can be achieved with the introduction of such non-local interconnects as shuffle networks. Results and simulations obtained with these processors are presented. Advances in the various components of the O- CLIP circuit, future goals, and potential application are also discussed.

Craig, Robert G. A.; Wherrett, Brian S.; Walker, Andrew C.; McKnight, Douglas J.; Redmond, Ian R.; Snowdon, John F.; Buller, Gerald S.; Restall, Edward J.; Wilson, R. A.; Wakelin, Suzanne; McArdle, Neil; Meredith, P.; Miller, J. M.; Taghizadeh, Mohammad R.; Mackinnon, G.; Smith, Stanley D.

1991-09-01

215

Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment  

NASA Astrophysics Data System (ADS)

This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

Cao, Z.; Walsh, J. L.; Kong, M. G.

2009-01-01

216

MVSP: multithreaded VLIW stream processor  

NASA Astrophysics Data System (ADS)

Stream processing is a new trend in computer architecture design which fills the gap between inflexible special-purpose media architectures and programmable architectures with low computational ability for media processing. Stream processors are designed for computationally intensive media applications characterized by high data parallelism and producer-consumer locality with little global data reuse. In this paper, we propose a new stream processor, named MVSP1. This processor is a programmable stream processor based on Imagine [1]. MVSP exploits TLP2, DLP 3, SP 4 and ILP 5 parallelisms inherent in media applications. Full simulator of MVSP has been implemented and several media workloads composed of EEMBC [2] benchmarks have been applied. The simulation results show the performance and functional unit utilization improvements of more than two times in comparison with Imagine processor.

Sardashti, Somayeh; Ghasemi, Hamid Reza; Fatemi, Omid

2006-02-01

217

Parallel multispot smFRET analysis using an 8-pixel SPAD array  

PubMed Central

Single-molecule Förster resonance energy transfer (smFRET) is a powerful tool for extracting distance information between two fluorophores (a donor and acceptor dye) on a nanometer scale. This method is commonly used to monitor binding interactions or intra- and intermolecular conformations in biomolecules freely diffusing through a focal volume or immobilized on a surface. The diffusing geometry has the advantage to not interfere with the molecules and to give access to fast time scales. However, separating photon bursts from individual molecules requires low sample concentrations. This results in long acquisition time (several minutes to an hour) to obtain sufficient statistics. It also prevents studying dynamic phenomena happening on time scales larger than the burst duration and smaller than the acquisition time. Parallelization of acquisition overcomes this limit by increasing the acquisition rate using the same low concentrations required for individual molecule burst identification. In this work we present a new two-color smFRET approach using multispot excitation and detection. The donor excitation pattern is composed of 4 spots arranged in a linear pattern. The fluorescent emission of donor and acceptor dyes is then collected and refocused on two separate areas of a custom 8-pixel SPAD array. We report smFRET measurements performed on various DNA samples synthesized with various distances between the donor and acceptor fluorophores. We demonstrate that our approach provides identical FRET efficiency values to a conventional single-spot acquisition approach, but with a reduced acquisition time. Our work thus opens the way to high-throughput smFRET analysis on freely diffusing molecules. PMID:24382989

Ingargiola, A.; Colyer, R. A.; Kim, D.; Panzeri, F.; Lin, R.; Gulinatti, A.; Rech, I.; Ghioni, M.; Weiss, S.; Michalet, X.

2012-01-01

218

Large-Scale Parallel Surface Functionalization of Goblet-type Whispering Gallery Mode Microcavity Arrays for Biosensing Applications.  

PubMed

A novel surface functionalization technique is presented for large-scale selective molecule deposition onto whispering gallery mode microgoblet cavities. The parallel technique allows damage-free individual functionalization of the cavities, arranged on-chip in densely packaged arrays. As the stamp pad a glass slide is utilized, bearing phospholipids with different functional head groups. Coated microcavities are characterized and demonstrated as biosensors. PMID:24990526

Bog, Uwe; Brinkmann, Falko; Kalt, Heinz; Koos, Christian; Mappes, Timo; Hirtz, Michael; Fuchs, Harald; Köber, Sebastian

2014-10-01

219

Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging  

PubMed Central

Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than –35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

Pang, Yong; Yu, Baiying; Vigneron, Daniel B.

2014-01-01

220

Speculative parallelization of partially parallel loops  

E-print Network

with even one cross- processor flow dependence because we have to re-execute sequentially. Moreover, the existing, partial parallelism of loops is not exploited. We demonstrate a generalization of the speculative doall parallelization tech- nique, called...

Dang, Francis Hoai Dinh

2009-05-15

221

Simultaneous multithreading: a platform for next-generation processors  

Microsoft Academic Search

Simultaneous multithreading is a processor design which consumes both thread-level and instruction-level parallelism. In SMT processors, thread-level parallelism can come from either multithreaded, parallel programs or individual, independent programs in a multiprogramming workload. Instruction-level parallelism comes from each single program or thread. Because it successfully (and simultaneously) exploits both types of parallelism, SMT processors use resources more efficiently, and both

Susan J. Eggers; Joel S. Emer; H. M. Leby; Jack L. Lo; Rebecca Stamm; Dean M. Tullsen

1997-01-01

222

Imer-product array processor for retrieval of stored images represented by bipolar binary (+1,-1) pixels using partial input trinary pixels represented by (+1,-1)  

NASA Technical Reports Server (NTRS)

An inner-product array processor is provided with thresholding of the inner product during each iteration to make more significant the inner product employed in estimating a vector to be used as the input vector for the next iteration. While stored vectors and estimated vectors are represented in bipolar binary (1,-1), only those elements of an initial partial input vector that are believed to be common with those of a stored vector are represented in bipolar binary; the remaining elements of a partial input vector are set to 0. This mode of representation, in which the known elements of a partial input vector are in bipolar binary form and the remaining elements are set equal to 0, is referred to as trinary representation. The initial inner products corresponding to the partial input vector will then be equal to the number of known elements. Inner-product thresholding is applied to accelerate convergence and to avoid convergence to a negative input product.

Liu, Hua-Kuang (Inventor); Awwal, Abdul A. S. (Inventor); Karim, Mohammad A. (Inventor)

1993-01-01

223

An associative processor for air traffic control  

Microsoft Academic Search

In recent years associative memories have been receiving an increasing amount of attention. At the same time multiprocessor and parallel processing systems have been under study to solve very large problems. An associative processor is one form of a parallel processor that seems able to provide a cost effective solution to many problems such as the air traffic control (ATC)

Kenneth James Thurber

1971-01-01

224

Appears in "Proceedings of the Sixth International Workshop on Languages and Compilers for Parallel Computing," pages 96-114, 1993. ZPL: An Array Sublanguage  

E-print Network

is an array sublanguage of the Orca family of parallel programming languages [11, 12]. The Orca languages the programmer's job without limiting expressiveness, the Orca languages provide a set 3This research ZPL. Thus, ZPL is the \

Lin, Calvin

225

Biological Information Signal Processor  

NASA Technical Reports Server (NTRS)

Biological Information Signal Processor (BISP) is computing system analyzing data on deoxyribonucleic acid (DNA) sequences for molecular genetic analysis. Includes coprocessors, specialized microprocessors complementing present and future computers by performing rapidly most-time-consuming DNA-sequence-analyzing functions, establishing relationships (alignments) between both global sequences and defining patterns in multiple sequences. Also includes state-of-art software and data-base systems on both conventional and parallel computer systems to augment analytical abilities of developmental coprocessors.

Chow, Edward T.; Peterson, John C.; Yoo, Michael M.

1993-01-01

226

Two-dimensional parallel array technology as a new approach to automated combinatorial solid-phase organic synthesis  

PubMed

An automated, 96-well parallel array synthesizer for solid-phase organic synthesis has been designed and constructed. The instrument employs a unique reagent array delivery format, in which each reagent utilized has a dedicated plumbing system. An inert atmosphere is maintained during all phases of a synthesis, and temperature can be controlled via a thermal transfer plate which holds the injection molded reaction block. The reaction plate assembly slides in the X-axis direction, while eight nozzle blocks holding the reagent lines slide in the Y-axis direction, allowing for the extremely rapid delivery of any of 64 reagents to 96 wells. In addition, there are six banks of fixed nozzle blocks, which deliver the same reagent or solvent to eight wells at once, for a total of 72 possible reagents. The instrument is controlled by software which allows the straightforward programming of the synthesis of a larger number of compounds. This is accomplished by supplying a general synthetic procedure in the form of a command file, which calls upon certain reagents to be added to specific wells via lookup in a sequence file. The bottle position, flow rate, and concentration of each reagent is stored in a separate reagent table file. To demonstrate the utility of the parallel array synthesizer, a small combinatorial library of hydroxamic acids was prepared in high throughput mode for biological screening. Approximately 1300 compounds were prepared on a 10 ?mole scale (3-5 mg) in a few weeks. The resulting crude compounds were generally >80% pure, and were utilized directly for high throughput screening in antibacterial assays. Several active wells were found, and the activity was verified by solution-phase synthesis of analytically pure material, indicating that the system described herein is an efficient means for the parallel synthesis of compounds for lead discovery. Copyright 1998 John Wiley & Sons, Inc. PMID:10099494

Brennan; Biddison; Frauendorf; Schwarcz; Keen; Ecker; Davis; Tinder; Swayze

1998-01-01

227

A novel polymeric microelectrode array for highly parallel, long-term neuronal culture and stimulation  

E-print Network

Cell-based high-throughput screening is emerging as a disruptive technology in drug discovery; however, massively parallel electrical assaying of neurons and cardiomyocites has until now been prohibitively expensive. To ...

Talei Franzesi, Giovanni

2008-01-01

228

A 256×256 CMOS imaging array with wide dynamic range pixels and column-parallel digital output  

Microsoft Academic Search

A stepped reset-gate voltage technique is applied to a CMOS active pixel sensor array to increase dynamic range by 26 dB. A frame rate of 390 frames\\/s is achieved using column-parallel output circuits. Switched-capacitor correlated double-sampling circuits reduce fixed-pattern noise to 4.0 mV (dark). Cyclic analog-to-digital converters achieve approximately 9-b accuracy. At 30 frames\\/s, random noise is 0.56 mV (dark),

Steven Decker; D. McGrath; Kevin Brehmer; Charles G. Sodini

1998-01-01

229

ATAC: A Manycore Processor with On-Chip Optical Network  

E-print Network

Ever since industry has turned to parallelism instead of frequency scaling to improve processor performance, multicore processors have continued to scale to larger and larger numbers of cores. Some believe that multicores ...

Liu, Jifeng

2009-05-05

230

Fast calculation of computer-generated Fresnel hologram utilizing distributed parallel processing and array operation  

Microsoft Academic Search

Computer-generated Fresnel hologram (Fresnel CGH) for 3-D object can be made by calculation of Fresnel diffraction. However, since diffraction computing is complicated, it is required huge calculation time. In this paper, we propose the fast calculation method of Fresnel CGH utilizing distributed parallel processing.

Syougo Nishi; Takaya Yuizono; Kunihiko Mori; Shigeru Nakayama

2003-01-01

231

Dual flux-to-voltage response of YBa2Cu3O7-? asymmetric parallel arrays of Josephson junctions  

NASA Astrophysics Data System (ADS)

We fabricated a parallel array of 440 YBa2Cu3O7-? bicrystal grain boundary Josephson junctions having an inductive asymmetric loop configuration within the array. Families of current-voltage characteristics (IVCs) have been measured in the temperature range (4.7-92) K for various values of a magnetic flux applied via a control current Ictrl. For both positive and negative current biases, I current-driven chains of magnetic vortices are propagating along the array producing flux-flow current resonances on the IVCs. However, at 77 K and above, due to the system’s inductive asymmetry the flux flow is suppressed (enhanced) for negative (positive) I. Consequently, the system shows a dual flux-to-voltage response. For negative I it operates like a flux-interferometer having a rather sinusoidal V (Ictrl) response. In contrast, for positive I the device’s response V (Ictrl) remains periodic but highly non-sinusoidal due to the interplay between multiple flux-flow modes. Below 60 K such a dual behaviour is far less pronounced as a result of flux-flow modes being suppressed due to a decrease of the dissipation coefficient with temperature.

Chesca, Boris; John, Daniel; Mellor, Christopher J.

2014-05-01

232

A 1.0GHz single-issue 64-bit powerPC integer processor  

Microsoft Academic Search

The organization and circuit design of a 1.0 GHz integer processor built in 0.25 ?m CMOS technology are presented, a microarchitecture emphasizing parallel computation with a single late select per cycle, structured control logic implemented by read-only-memories and programmable logic arrays, and a delayed reset dynamic circuit style enabling complex functions to be implemented in a few levels of logic

Joel Silberman; Naoaki Aoki; David Boerstler; Jeffrey L. Burns; Sang Dhong; Axel Essbaum; Uttam Ghoshal; David Heidel; Peter Hofstee; Kyung Tek Lee; David Meltzer; Hung Ngo; Kevin Nowka; Stephen Posluszny; Osamu Takahashi; Ivan Vo; Brian Zoric

1998-01-01

233

Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays  

Microsoft Academic Search

We describe a novel sequencing approach that combines non-gel-based signature sequencing with in vitro cloning of millions of templates on separate 5 ?m diameter microbeads. After constructing a microbead library of DNA templates by in vitro cloning, we assembled a planar array of a million template-containing microbeads in a flow cell at a density greater than 3 × 106 microbeads\\/cm2.

Maria Johnson; John Bridgham; George Golda; David H. Lloyd; Davida Johnson; Shujun Luo; Sarah McCurdy; Michael Foy; Mark Ewan; Rithy Roth; Dave George; Sam Eletr; Glenn Albrecht; Eric Vermaas; Steven R. Williams; Keith Moon; Timothy Burcham; Michael Pallas; Robert B. DuBridge; James Kirchner; Karen Fearon; Jen-i Mao; Kevin Corcoran; Sydney Brenner

2000-01-01

234

Platinum plasmonic nanostructure arrays for massively parallel single-molecule detection based on enhanced fluorescence measurements  

Microsoft Academic Search

We fabricated platinum bowtie nanostructure arrays producing fluorescence enhancement and evaluated their performance using two-photon photoluminescence and single-molecule fluorescence measurements. A comprehensive selection of suitable materials was explored by electromagnetic simulation and Pt was chosen as the plasmonic material for visible light excitation near 500 nm, which is preferable for multicolor dye-labeling applications like DNA sequencing. The observation of bright

Toshiro Saito; Satoshi Takahashi; Takayuki Obara; Naoshi Itabashi; Kazumichi Imai

2011-01-01

235

Database Reorganization in Parallel Disk Arrays with I/O Service Stealing  

NASA Technical Reports Server (NTRS)

We present a model for data reorganization in parallel disk systems that is geared towards load balancing in an environment with periodic access patterns. Data reorganization is performed by disk cooling, i.e. migrating files or extents from the hottest disks to the coldest ones. We develop an approximate queueing model for determining the effective arrival rates of cooling requests and discuss its use in assessing the costs versus benefits of cooling.

Zabback, Peter; Onyuksel, Ibrahim; Scheuermann, Peter; Weikum, Gerhard

1996-01-01

236

Field programmable gate array based parallel strapdown algorithm design for strapdown inertial navigation systems.  

PubMed

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

237

Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems  

PubMed Central

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

238

Multi-output programmable quantum processor  

E-print Network

By combining telecloning and programmable quantum gate array presented by Nielsen and Chuang [Phys.Rev.Lett. 79 :321(1997)], we propose a programmable quantum processor which can be programmed to implement restricted set of operations with several identical data outputs. The outputs are approximately-transformed versions of input data. The processor successes with certain probability.

Yafei Yu; Jian Feng; Mingsheng Zhan

2002-09-10

239

Multi-output programmable quantum processor  

NASA Astrophysics Data System (ADS)

By combining telecloning and the programmable quantum gate array presented by Nielsen and Chuang [Phys. Rev. Lett. 79, 321 (1997)], we propose a programmable quantum processor which can be programmed to implement a restricted set of operations with several identical data outputs. The outputs are approximately transformed versions of input data. The processor successes with certain probability.

Yu, Yafei; Feng, Jian; Zhan, Mingsheng

2002-11-01

240

System for executing a sequence of operation codes with some codes being executed out of order in a pipeline parallel processor  

SciTech Connect

This patent describes a data processing system having a memory and a data processor coupled to the memory to receive a code stream including memory control words for referencing control words in memory and data words in memory, the data processor having program means to separate the control work referencing control words from the data referencing control words, a memory address formation unit comprising: queue means to receive a sequence of control words some of which reference data words in memory and others of which reference other control words in memory; a first receiving means coupled to the queue means to receive queue addresses of control words which reference data words; a second queue address receiving means to receive queue addresses control words which reference other control words; address formation means coupled to the first and second receiving means for receiving control words and forming memory addresses; and logic means coupled to the first receiving means for transmitting a later received control word for addressing a control word to the address formation means before an earlier received control word for addressing a data word.

Hassler, J.A.; Burroughs, W.G.

1988-09-20

241

Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis.  

PubMed

Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol. PMID:24129356

Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

2013-12-01

242

Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis  

NASA Astrophysics Data System (ADS)

Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol.

Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

2013-11-01

243

Optimal expression evaluation for data parallel architectures  

NASA Technical Reports Server (NTRS)

A data parallel machine represents an array or other composite data structure by allocating one processor (at least conceptually) per data item. A pointwise operation can be performed between two such arrays in unit time, provided their corresponding elements are allocated in the same processors. If the arrays are not aligned in this fashion, the cost of moving one or both of them is part of the cost of the operation. The choice of where to perform the operation then affects this cost. If an expression with several operands is to be evaluated, there may be many choices of where to perform the intermediate operations. An efficient algorithm is given to find the minimum-cost way to evaluate an expression, for several different data parallel architectures. This algorithm applies to any architecture in which the metric describing the cost of moving an array is robust. This encompasses most of the common data parallel communication architectures, including meshes of arbitrary dimension and hypercubes. Remarks are made on several variations of the problem, some of which are solved and some of which remain open.

Gilbert, John R.; Schreiber, Robert

1990-01-01

244

Parallel image-acquisition in continuous-wave electron paramagnetic resonance imaging with a surface coil array: Proof-of-concept experiments.  

PubMed

This article describes a feasibility study of parallel image-acquisition using a two-channel surface coil array in continuous-wave electron paramagnetic resonance (CW-EPR) imaging. Parallel EPR imaging was performed by multiplexing of EPR detection in the frequency domain. The parallel acquisition system consists of two surface coil resonators and radiofrequency (RF) bridges for EPR detection. To demonstrate the feasibility of this method of parallel image-acquisition with a surface coil array, three-dimensional EPR imaging was carried out using a tube phantom. Technical issues in the multiplexing method of EPR detection were also clarified. We found that degradation in the signal-to-noise ratio due to the interference of RF carriers is a key problem to be solved. PMID:24374749

Enomoto, Ayano; Hirata, Hiroshi

2014-02-01

245

A 5.9mW 6.5GMACS CID/DRAM Array Processor Roman Genov, Gert Cauwenberghs, Grant Mulliken, Farhan Adil  

E-print Network

-latency analog ac- cumulation. Delta-sigma analog-to-digital conversion of the analog array outputs is combined, Externally Digital Computation The approach combines the computational efficiency of analog array processing. The digital representation is embedded in the analog array architecture, with matrix elements stored locally

Cauwenberghs, Gert

246

Parallel recognition of cancer cells using an addressable array of solid-state micropores.  

PubMed

Early stage detection and precise quantification of circulating tumor cells (CTCs) in the peripheral blood of cancer patients are important for early diagnosis. Early diagnosis improves the effectiveness of the therapy and results in better prognosis. Several techniques have been used for CTC detection but are limited by their need for dye tagging, low throughput and lack of statistical reliability at single cell level. Solid-state micropores can characterize each cell in a sample providing interesting information about cellular populations. We report a multi-channel device which utilized solid-state micropores array assembly for simultaneous measurement of cell translocation. This increased the throughput of measurement and as the cells passed the micropores, tumor cells showed distinctive current blockade pulses, when compared to leukocytes. The ionic current across each micropore channel was continuously monitored and recorded. The measurement system not only increased throughput but also provided on-chip cross-relation. The whole blood was lysed to get rid of red blood cells, so the blood dilution was not needed. The approach facilitated faster processing of blood samples with tumor cell detection efficiency of about 70%. The design provided a simple and inexpensive method for rapid and reliable detection of tumor cells without any cell staining or surface functionalization. The device can also be used for high throughput electrophysiological analysis of other cell types. PMID:25038540

Ilyas, Azhar; Asghar, Waseem; Kim, Young-tae; Iqbal, Samir M

2014-12-15

247

Platinum plasmonic nanostructure arrays for massively parallel single-molecule detection based on enhanced fluorescence measurements.  

PubMed

We fabricated platinum bowtie nanostructure arrays producing fluorescence enhancement and evaluated their performance using two-photon photoluminescence and single-molecule fluorescence measurements. A comprehensive selection of suitable materials was explored by electromagnetic simulation and Pt was chosen as the plasmonic material for visible light excitation near 500 nm, which is preferable for multicolor dye-labeling applications like DNA sequencing. The observation of bright photoluminescence (? = 500-600 nm) from each Pt nanostructure, induced by irradiation at 800 nm with a femtosecond laser pulse, clearly indicates that a highly enhanced local field is created near the Pt nanostructure. The attachment of a single dye molecule was attempted between the Pt triangles of each nanostructure by using selective immobilization chemistry. The fluorescence intensities of the single dye molecule localized on the nanostructures were measured. A highly enhanced fluorescence, which was increased by a factor of 30, was observed. The two-photon photoluminescence intensity and fluorescence intensity showed qualitatively consistent gap size dependence. However, the average fluorescence enhancement factor was rather repressed even in the nanostructure with the smallest gap size compared to the large growth of photoluminescence. The variation of the position of the dye molecule attached to the nanostructure may influence the wide distribution of the fluorescence enhancement factor and cause the rather small average value of the fluorescence enhancement factor. PMID:21988776

Saito, Toshiro; Takahashi, Satoshi; Obara, Takayuki; Itabashi, Naoshi; Imai, Kazumichi

2011-11-01

248

Massively parallel information processing systems for space applications  

NASA Technical Reports Server (NTRS)

NASA is developing massively parallel systems for ultra high speed processing of digital image data collected by satellite borne instrumentation. Such systems contain thousands of processing elements. Work is underway on the design and fabrication of the 'Massively Parallel Processor', a ground computer containing 16,384 processing elements arranged in a 128 x 128 array. This computer uses existing technology. Advanced work includes the development of semiconductor chips containing thousands of feedthrough paths. Massively parallel image analog to digital conversion technology is also being developed. The goal is to provide compact computers suitable for real-time onboard processing of images.

Schaefer, D. H.

1979-01-01

249

Massively parallel electron beam direct writing (MPEBDW) system based on micro-electro-mechanical system (MEMS)/nanocrystalineSi emitter array  

NASA Astrophysics Data System (ADS)

The characteristics of a prototype massively parallel electron beam direct writing (MPEBDW) system are demonstrated. The electron optics consist of an emitter array, a micro-electro-mechanical system (MEMS) condenser lens array, auxiliary lenses, a stigmator, three-stage deflectors to align and scan the parallel beams, and an objective lens acting as a reduction lens. The emitter array produces 10000 programmable 10 ?m square beams. The electron emitter is a nanocrystalline silicon (nc-Si) ballistic electron emitter array integrated with an active matrix driver LSI for high-speed emission current control. Because the LSI also has a field curvature correction function, the system can use a large electron emitter array. In this system, beams that are incident on the outside of the paraxial region of the reduction lens can also be used through use of the optical aberration correction functions. The exposure pattern is stored in the active matrix LSI's memory. Alignment between the emitter array and the condenser lens array is performed by moving the emitter stage that slides along the x- and y-axes, and rotates around the z-theta axis. The electrons of all beams are accelerated, and pass through the anode array. The stigmator and the two-stage deflectors perform fine adjustments to the beam positions. The other deflector simultaneously scans all parallel beams to synchronize the moving target stage. Exposure is carried out by moving the target stage that holds the wafer. The reduction lens focuses all beams on the target wafer surface, and the electron optics of the column reduces the electron image to 0.1% of its original size.

Kojima, A.; Ikegami, N.; Yoshida, T.; Miyaguchi, H.; Muroyama, M.; Nishino, H.; Yoshida, S.; Sugata, M.; Ohyi, H.; Koshida, N.; Esashi, M.

2014-03-01

250

An implementation of a real-time and parallel processing ECG features extraction algorithm in a Field Programmable Gate Array (FPGA)  

Microsoft Academic Search

The objective of the paper is to report a development of real-time and parallel processing algorithm to implement it into a Field Programmable Gate Array (FPGA) for the electrocardiogram (ECG) signals feature extraction. The prototyped system will be extracting the ECG features and tested as a System on Chip (Soc) design. The performance of algorithm was tested against MATLAB routine

Weichih Hu; Chun Cheng Lin; Liang Yu Shyu

2011-01-01

251

Upset Characterization of the PowerPC405 Hard-core Processor Embedded in Virtex-II Pro Field Programmable Gate Arrays  

NASA Technical Reports Server (NTRS)

Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.

Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez

2006-01-01

252

Calculating electronic tunnel currents in networks of disordered irregularly shaped nanoparticles by mapping networks to arrays of parallel nonlinear resistors  

NASA Astrophysics Data System (ADS)

We have shown both theoretically and experimentally that tunnel currents in networks of disordered irregularly shaped nanoparticles (NPs) can be calculated by considering the networks as arrays of parallel nonlinear resistors. Each resistor is described by a one-dimensional or a two-dimensional array of equal size nanoparticles that the tunnel junction gaps between nanoparticles in each resistor is assumed to be equal. The number of tunnel junctions between two contact electrodes and the tunnel junction gaps between nanoparticles are found to be functions of Coulomb blockade energies. In addition, the tunnel barriers between nanoparticles were considered to be tilted at high voltages. Furthermore, the role of thermal expansion coefficient of the tunnel junction gaps on the tunnel current is taken into account. The model calculations fit very well to the experimental data of a network of disordered gold nanoparticles, a forest of multi-wall carbon nanotubes, and a network of few-layer graphene nanoplates over a wide temperature range (5-300 K) at low and high DC bias voltages (0.001 mV-50 V). Our investigations indicate, although electron cotunneling in networks of disordered irregularly shaped NPs may occur, non-Arrhenius behavior at low temperatures cannot be described by the cotunneling model due to size distribution in the networks and irregular shape of nanoparticles. Non-Arrhenius behavior of the samples at zero bias voltage limit was attributed to the disorder in the samples. Unlike the electron cotunneling model, we found that the crossover from Arrhenius to non-Arrhenius behavior occurs at two temperatures, one at a high temperature and the other at a low temperature.

Aghili Yajadda, Mir Massoud

2014-10-01

253

Optical scalable parallel modified signed-digit algorithms for large-scale array addition and multiplication using digit-decomposition-plane representation  

NASA Astrophysics Data System (ADS)

Optical scalable parallel and high-speed 2D-data array computing based on modified signed- digit arithmetic and digit-decomposition-plane representation is presented. The digit- decomposition-plane coding uses m binary planes or m blocks of a binary plane to code an m-digit data array. Therefore, we can easily access each digit individually and can implement array addition with only 13 combinatorial logic formulas. A duplication-shifting- superimposition algorithm for digital array multiplication is proposed. The algorithm generates and records all the bitwise products in mn binary planes simultaneously, and then processes them based on a modified signed-digit adder tree. Only five basic operations of bitwise product, duplication, shifting, masking and magnification are required for digital computing. The features of the proposed algorithm are that it requires no bistable devices, no decimal point, no sign, and no carry. The algorithm and its implementing scheme are scalable because they are independent to the sizes of data arrays. Therefore it has great promise for large-scale array computing. Optical implementation with classical optical elements, such as beamsplitters, parallel plates, and mirrors, is discussed. A preliminary demonstration experiment with an optoelectronic scheme is described.

Huang, Hongxin; Itoh, Masahide; Yatagai, Toyohiko

1999-03-01

254

Parallel Mandelbrot Set Model  

NSDL National Science Digital Library

The Parallel Mandelbrot Set Model is a parallelization of the sequential MandelbrotSet model, which does all the computations on a single processor core. This parallelization is able to use a computer with more than one cores (or processors) to carry out the same computation, thus speeding up the process. The parallelization is done using the model elements in the Parallel Java group. These model elements allow easy use of the Parallel Java library created by Alan Kaminsky. In particular, the parallelization used for this model is based on code in Chapters 11 and 12 of Kaminsky's book Building Parallel Java. The Parallel Mandelbrot Set Model was developed using the Easy Java Simulations (EJS) modeling tool. It is distributed as a ready-to-run (compiled) Java archive. Double click the ejs_chaos_ParallelMandelbrotSet.jar file to run the program if Java is installed.

Franciscouembre

2011-11-24

255

A 5.9mW 6.5GMACS CID/DRAM Array Processor Roman Genov, Gert Cauwenberghs, Grant Mulliken, Farhan Adil  

E-print Network

-latency analog ac- cumulation. Delta-sigma analog-to-digital conversion of the analog array outputs is combined delta-sigma ADCs measures 3 mm ¢ 3 mm in 0.5 £ m CMOS and delivers 1.1 GMACS/mW. 1. Introduction Real, Externally Digital Computation The approach combines the computational efficiency of analog array processing

Genov, Roman

256

Load follow simulation of three-dimensional boiling water reactor core by PACS32 parallel microprocessor system  

Microsoft Academic Search

The three-dimensional boiling water reactor (BWR) core following the daily load was simulated by the use of the processor array for continuum simulation (PACS-32), a newly developed parallel microprocessor system. The PACS system consists of 32 processing units (PUs) (microprocessors) and has a multiinstruction, multidata type architecture, being optimum to the numerical simulation of the partial differential equations. The BWR

T. Hoshino; T. Shirakawa

1982-01-01

257

Parallel Architecture for the Solution of Linear Equations Systems Based on Division Free Gaussian Elimination Method Implemented in FPGA  

Microsoft Academic Search

This paper presents a parallel architecture for the solution of linear equations systems based on the Division Free Gaussian Elimination Method. This architecture was implemented in a Field Programmable Gate Array (FPGA). The division-free Gaussian elimination method was integrated in identical processors in a FPGA Spartan 3 of Xilinx. A top-down design was used. The proposed architecture can handle IEEE

R. MARTINEZ; D. TORRES; M. MADRIGAL; S. MAXIMOV

2009-01-01

258

Converting thread-level parallelism to instruction-level parallelism via simultaneous multithreading  

Microsoft Academic Search

To achieve high performance, contemporary computer systems rely on two forms of parallelism: instruction-level parallelism (ILP) and thread-level parallelism (TLP). Wide-issue super-scalar processors exploit ILP by executing multiple instructions from a single program in a single cycle. Multiprocessors (MP) exploit TLP by executing different threads in parallel on different processors. Unfortunately, both parallel processing styles statically partition processor resources, thus

Jack L. Lo; Joel S. Emer; Henry M. Levy; Rebecca L. Stamm; Dean M. Tullsen; S. J. Eggers

1997-01-01

259

Algorithmically specialized parallel computers  

SciTech Connect

This book is based on a workshop which dealt with array processors. Topics considered include algorithmic specialization using VLSI, innovative architectures, signal processing, speech recognition, image processing, specialized architectures for numerical computations, and general-purpose computers.

Snyder, L.; Jamieson, L.H.; Gannon, D.B.; Siegel, H.J.

1985-01-01

260

Parallel Optimisation  

NSDL National Science Digital Library

An introduction to optimisation techniques that may improve parallel performance and scaling on HECToR. It assumes that the reader has some experience of parallel programming including basic MPI and OpenMP. Scaling is a measurement of the ability for a parallel code to use increasing numbers of cores efficiently. A scalable application is one that, when the number of processors is increased, performs better by a factor which justifies the additional resource employed. Making a parallel application scale to many thousands of processes requires not only careful attention to the communication, data and work distribution but also to the choice of the algorithms to use. Since the choice of algorithm is too broad a subject and very particular to application domain to include in this brief guide we concentrate on general good practices towards parallel optimisation on HECToR.

261

Magnetic arrays  

DOEpatents

Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

Trumper, David L. (Plaistow, NH); Kim, Won-jong (Cambridge, MA); Williams, Mark E. (Pelham, NH)

1997-05-20

262

Magnetic arrays  

DOEpatents

Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

Trumper, D.L.; Kim, W.; Williams, M.E.

1997-05-20

263

Opto-electronic morphological processor  

NASA Technical Reports Server (NTRS)

The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

1993-01-01

264

Programmable pipelined image processor  

NASA Technical Reports Server (NTRS)

A pipelined image processor selectively interconnects modules in a column of a two-dimensional array to modules of the next column of the array of modules 1,1 through M,N, where M is the number of modules in one dimension and N is the number of modules in the other direction. Each module includes two input selectors for A and B inputs, two convolvers, a binary function operator, a neighborhood comparison operator which produces an A output and an output selector which may select as a B output the output of any one of the components in the module, including the A output of the neighborhood comparison operator. Each module may be connected to as many as eight modules in the next column, preferably with the majority always in a different row that is up (or down) in the array for a generally spiral data path around the torus thus formed. The binary function operator is implemented as a look-up table addressed by the most significant 8 bits of each 12-bit argument. The table output includes a function value and the slopes for interpolation of the two arguments by multiplying the 4 least significant bits in multipliers and adding the products to the function value through adders.

Gennery, Donald B. (inventor); Wilcox, Brian (inventor)

1988-01-01

265

Molecular simulation of cooperative hydrodynamic effects in motion of a periodic array of spheres between parallel walls.  

PubMed

We use molecular dynamics simulations to investigate the cooperative hydrodynamic interactions involved in the collective translation of a periodic array of spheres in a fluid which is confined between two atomistic surfaces. In particular, we study a spherical particle that is moving with a constant velocity parallel to the two confining surfaces. This central sphere along with its periodic images forms the translating two dimensional periodic grid. The cooperative hydrodynamic effects between neighboring spheres in the grid are determined by monitoring the friction force experienced by the spheres that are moving through an atomistic solvent. The dependence of the hydrodynamic cooperativity on the grid spacing is quantified by running simulations in systems with different sizes of the periodic box. Our results show a clear evidence of hydrodynamic cooperation between the spherical particles for grid spacing of 90sigma and larger, where sigma is the solvent molecular diameter. These cooperative interactions lead to a reduced value of the friction force experienced by these spheres as opposed to the case for a single sphere moving in an infinite quiescent fluid. The simulated friction force values are compared with the recent continuum mechanics predictions [Bhattacharya, J. Chem. Phys. 128, 074709 (2008)] for the same problem of the motion of a periodic grid of particles through a confined fluid. The simulated values of friction force were found to follow the same qualitative trend as the continuum results but the continuum predictions were consistently larger than the simulation results by approximately 22%. We attribute this difference to the fluid slip at the surface of the spherical particle, as measured in the simulations. PMID:19045297

Kohale, Swapnil C; Khare, Rajesh

2008-10-28

266

Online track processor for the CDF upgrade  

SciTech Connect

A trigger track processor, called the eXtremely Fast Tracker (XFT), has been designed for the CDF upgrade. This processor identifies high transverse momentum (> 1.5 GeV/c) charged particles in the new central outer tracking chamber for CDF II. The XFT design is highly parallel to handle the input rate of 183 Gbits/s and output rate of 44 Gbits/s. The processor is pipelined and reports the result for a new event every 132 ns. The processor uses three stages: hit classification, segment finding, and segment linking. The pattern recognition algorithms for the three stages are implemented in programmable logic devices (PLDs) which allow in-situ modification of the algorithm at any time. The PLDs reside on three different types of modules. The complete system has been installed and commissioned at CDF II. An overview of the track processor and performance in CDF Run II are presented.

E. J. Thomson et al.

2002-07-17

267

Periodic parallel array of nanopillars and nanoholes resulting from colloidal stripes patterned by geometrically confined evaporative self-assembly for unique anisotropic wetting.  

PubMed

In this paper we present an economical process to create anisotropic microtextures based on periodic parallel stripes of monolayer silica nanoparticles (NPs) patterned by geometrically confined evaporative self-assembly (GCESA). In the GCESA process, a straight meniscus of a colloidal dispersion is initially formed in an opened enclosure, which is composed of two parallel plates bounded by a U-shaped spacer sidewall on three sides with an evaporating outlet on the fourth side. Lateral evaporation of the colloidal dispersion leads to periodic "stick-slip" receding of the meniscus (evaporative front), as triggered by the "coffee-ring" effect, promoting the assembly of silica NPs into periodic parallel stripes. The morphology of stripes can be well controlled by tailoring process variables such as substrate wettability, NP concentration, temperature, and gap height, etc. Furthermore, arrayed patterns of nanopillars or nanoholes are generated on a silicon wafer using the as-prepared colloidal stripes as an etching mask or template. Such arrayed patterns can reveal unique anisotropic wetting properties, which have a large contact angle hysteresis viewing from both the parallel and perpendicular directions in addition to a large wetting anisotropy. PMID:25353399

Li, Xiangmeng; Wang, Chunhui; Shao, Jinyou; Ding, Yucheng; Tian, Hongmiao; Li, Xiangming; Wang, Li

2014-11-26

268

Proceedings of the 1983 international conference on parallel processing  

SciTech Connect

The following topics were dealt with: the performance of existing supercomputers on computationally intensive tasks; multistage networks; numerical algorithms; network connection capabilities; special purpose systems; node-to-node networks; nonnumerical algorithms; tree structured systems; parallel programming and languages; images and speech; expressing parallelism; database machines and signal processing; data flow; simulation and operating systems; models; scheduling resources; system performance; VLSI processor arrays; computer architectures; associative processing and distributed systems; multiprocessor systems; and pipelining. 97 papers were presented, all of which are published in full in the present proceedings. Abstracts of individual papers can be found under the relevant classification codes in this or other issues.

Siegel, H.J.; Siegel, L.

1983-01-01

269

Parallelizing Monte Carlo with PMC  

SciTech Connect

PMC (Parallel Monte Carlo) is a system of generic interface routines that allows easy porting of Monte Carlo packages of large-scale physics simulation codes to Massively Parallel Processor (MPP) computers. By loading various versions of PMC, simulation code developers can configure their codes to run in several modes: serial, Monte Carlo runs on the same processor as the rest of the code; parallel, Monte Carlo runs in parallel across many processors of the MPP with the rest of the code running on other MPP processor(s); distributed, Monte Carlo runs in parallel across many processors of the MPP with the rest of the code running on a different machine. This multi-mode approach allows maintenance of a single simulation code source regardless of the target machine. PMC handles passing of messages between nodes on the MPP, passing of messages between a different machine and the MPP, distributing work between nodes, and providing independent, reproducible sequences of random numbers. Several production codes have been parallelized under the PMC system. Excellent parallel efficiency in both the distributed and parallel modes results if sufficient workload is available per processor. Experiences with a Monte Carlo photonics demonstration code and a Monte Carlo neutronics package are described.

Rathkopf, J.A.; Jones, T.R.; Nessett, D.M.; Stanberry, L.C.

1994-11-01

270

Generic implementations of parallel prefix sums and its applications  

E-print Network

synchronization as the number of processors increases. As part of the applications for parallel prefix sums, parallel radix sort and four parallel tree applications are built on top of the implementation. These applications are also fundamental parallel algorithms...

Huang, Tao

2009-05-15

271

MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY  

SciTech Connect

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

2008-01-01

272

High performance parallel computers for science: New developments at the Fermilab advanced computer program  

SciTech Connect

Fermilab's Advanced Computer Program (ACP) has been developing highly cost effective, yet practical, parallel computers for high energy physics since 1984. The ACP's latest developments are proceeding in two directions. A Second Generation ACP Multiprocessor System for experiments will include $3500 RISC processors each with performance over 15 VAX MIPS. To support such high performance, the new system allows parallel I/O, parallel interprocess communication, and parallel host processes. The ACP Multi-Array Processor, has been developed for theoretical physics. Each $4000 node is a FORTRAN or C programmable pipelined 20 MFlops (peak), 10 MByte single board computer. These are plugged into a 16 port crossbar switch crate which handles both inter and intra crate communication. The crates are connected in a hypercube. Site oriented applications like lattice gauge theory are supported by system software called CANOPY, which makes the hardware virtually transparent to users. A 256 node, 5 GFlop, system is under construction. 10 refs., 7 figs.

Nash, T.; Areti, H.; Atac, R.; Biel, J.; Cook, A.; Deppe, J.; Edel, M.; Fischler, M.; Gaines, I.; Hance, R.

1988-08-01

273

A 0.8-?m CMOS two-dimensional programmable mixed-signal focal-plane array processor with on-chip binary imaging and instructions storage  

Microsoft Academic Search

This paper presents a CMOS chip for the parallel acquisition and concurrent analog processing of two-dimensional (2-D) binary images. Its processing function is determined by a reduced set of 19 analog coefficients whose values are programmable with 7-b accuracy. The internal programming signals are analog, but the external control interface is fully digital. On-chip nonlinear digital-to-analog converters (DAC's) map digitally

R. Dominguez-Castro; S. Espejo; A. Rodriguez-Vazquez; R. A. Carmona; P. Foldesy; A. Zarandy; P. Szolgay; T. Sziranyi; T. Roska

1997-01-01

274

Efficient design space exploration of high performance embedded out-of-order processors  

Microsoft Academic Search

Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high- end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out- of-order processors, in-order processor design methodolo- gies which

Stijn Eyerman; Lieven Eeckhout; Koen De Bosschere

2006-01-01

275

Parallel programming interface for distributed data  

NASA Astrophysics Data System (ADS)

The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Program summaryProgram title: PPIDD Catalogue identifier: AEEF_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEEF_1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: Standard CPC licence, http://cpc.cs.qub.ac.uk/licence/licence.html No. of lines in distributed program, including test data, etc.: 17 698 No. of bytes in distributed program, including test data, etc.: 166 173 Distribution format: tar.gz Programming language: Fortran, C Computer: Many parallel systems Operating system: Various Has the code been vectorised or parallelized?: Yes. 2-256 processors used RAM: 50 Mbytes Classification: 6.5 External routines: Global Arrays or MPI-2 Nature of problem: Many scientific applications require management and communication of data that is global, and the standard MPI-2 protocol provides only low-level methods for the required one-sided remote memory access. Solution method: The Parallel Programming Interface for Distributed Data (PPIDD) library provides an interface, suitable for use in parallel scientific applications, that delivers communications and global data management. The library can be built either using the Global Arrays (GA) toolkit, or a standard MPI-2 library. This abstraction allows the programmer to write portable parallel codes that can utilise the best, or only, communications library that is available on a particular computing platform. Running time: Problem dependent. The test provided with the distribution takes only a few seconds to run.

Wang, Manhui; May, Andrew J.; Knowles, Peter J.

2009-12-01

276

SUDS : automatic parallelization for raw processors  

E-print Network

A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind ...

Frank, Matthew I

2003-01-01

277

Lateral Flow through a Parallel Gap Driven by Surface Hydrophilicity and Liquid Edge Pinning for Creating Microlens Array.  

PubMed

This letter proposes a surface-energy driven process for economically creating polymer microlens array (MLA) with well controllable curvatures. When a UV-curable prepolymer flows into a cell constructed by multiple holes on a top template and a flat substrate, since the edge pinning of the contact line, an array of curved air/prepolymer interface forms around each microhole of the template. Then a UV-radiation of the bulk prepolymer leads to a solid microlens array. The curvature of the air/prepolymer interface can be controlled by choosing materials with different interface free energy or varying the gap height mechanically. PMID:25348103

Jiang, Chengbao; Li, Xiangming; Tian, Hongmiao; Wang, Chunhui; Shao, Jinyou; Ding, Yucheng; Wang, Li

2014-11-12

278

Parallel algorithms for mapping pipelined and parallel computations  

NASA Technical Reports Server (NTRS)

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

Nicol, David M.

1988-01-01

279

Optimal processor assignment for pipeline computations  

NASA Technical Reports Server (NTRS)

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

1991-01-01

280

Graph-Based Dynamic Assignment Of Multiple Processors  

NASA Technical Reports Server (NTRS)

Algorithm-to-architecture mapping model (ATAMM) is strategy minimizing time needed to periodically execute graphically described, data-driven application algorithm on multiple data processors. Implemented as operating system managing flow of data and dynamically assigns nodes of graph to processors. Predicts throughput versus number of processors available to execute given application algorithm. Includes rules ensuring application algorithm represented by graph executed periodically without deadlock and in shortest possible repetition time. ATAMM proves useful in maximizing effectiveness of parallel computing systems.

Hayes, Paul J.; Andrews, Asa M.

1994-01-01

281

Architecture and data processing alternatives for the TSE computer. Volume 3: Execution of a parallel counting algorithm using array logic (Tse) devices  

NASA Technical Reports Server (NTRS)

A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.

Metcalfe, A. G.; Bodenheimer, R. E.

1976-01-01

282

Transitive closure on the imagine stream processor  

SciTech Connect

The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

Griem, Gorden; Oliker, Leonid

2003-11-11

283

Broadcasting collective operation contributions throughout a parallel computer  

DOEpatents

Methods, systems, and products are disclosed for broadcasting collective operation contributions throughout a parallel computer. The parallel computer includes a plurality of compute nodes connected together through a data communications network. Each compute node has a plurality of processors for use in collective parallel operations on the parallel computer. Broadcasting collective operation contributions throughout a parallel computer according to embodiments of the present invention includes: transmitting, by each processor on each compute node, that processor's collective operation contribution to the other processors on that compute node using intra-node communications; and transmitting on a designated network link, by each processor on each compute node according to a serial processor transmission sequence, that processor's collective operation contribution to the other processors on the other compute nodes using inter-node communications.

Faraj, Ahmad (Rochester, MN)

2012-02-21

284

Issues and challenges in compiling for graphics processors  

Microsoft Academic Search

Graphics has been one of the best success stories of parallel processing. Using a unique combination of specialized hardware and aspecialized programming model, game developers routinely write high performance code using millions of threads. Each Generation of graphic processors (GPU's) delivers higher performance and is more programmable then the last. Unlike CPU's, these processors are designed from the beginning to

Norm Rubin

2008-01-01

285

Optimistic parallelism requires abstractions  

Microsoft Academic Search

The problem of writing software for multicore processors is greatly simplified if we could automatically parallelize sequential programs. Although auto-parallelization has been studied for many decades, it has succeeded only in a few application areas such as dense matrix computations. In particular, auto-parallelization of irregular programs, which are organized around large, pointer-based data struc- tures like graphs, has seemed intractable.

Milind Kulkarni; Keshav Pingali; Bruce Walter; Ganesh Ramanarayanan; Kavita Bala; L. Paul Chew

2007-01-01

286

Parallel asynchronous hardware implementation of image processing algorithms  

NASA Technical Reports Server (NTRS)

Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.

Coon, Darryl D.; Perera, A. G. U.

1990-01-01

287

Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of Rb atoms in a quantum information processor.  

PubMed

We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb87 atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 and 480 nm) and have nonoverlapping Bragg-matched frequency response at these wavelengths, so that there will be no cross talk when proportional frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a tellurium dioxide crystal (TeO(2)). The designed and fabricated AOD has more than 100 resolvable spots, widely separated band shapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 and 480 nm), and a 4 micros or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD. PMID:18404181

Kim, Sangtaek; Mcleod, Robert R; Saffman, M; Wagner, Kelvin H

2008-04-10

288

Doppler-free, Multi-wavelength Acousto-optic deflector for two-photon addressing arrays of Rb atoms in a Quantum Information Processor  

E-print Network

We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 nm and 480 nm) and have non-overlapping Bragg-matched frequency response at these wavelengths, so that there will be no crosstalk when proportional RF frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a Tellurium dioxide crystal (TeO2). The designed and fabricated AOD has more than 100 resolvable spots, widely separated bandshapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 nm and 480 nm), and a 4 usec or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD.

Sangtaek Kim; Robert R. Mcleod; Mark Saffman; Kelvin H. Wagner

2007-11-21

289

Scioto: A Framework for Global-ViewTask Parallelism  

SciTech Connect

We introduce Scioto, Shared Collections of Task Objects, a framework for supporting task-parallelism in one-sided and global-view parallel programming models. Scioto provides lightweight, locality aware dynamic load balancing and interoperates with existing parallel models including MPI, SHMEM, CAF, and Global Arrays. Through task parallelism, the Scioto framework provides a solution for overcoming load imbalance and heterogeneity as well as dynamic mapping of computation onto emerging multicore architectures. In this paper, we present the design and implementation of the Scioto framework and demonstrate its effectiveness on the Unbalanced Tree Search (UTS) benchmark and two quantum chemistry codes: the closed shell Self-Consistent Field (SCF) method and a sparse tensor contraction kernel extracted from a coupled cluster computation. We explore the efficiency and scalability of Scioto through these sample applications and demonstrate that is offers low overhead, achieves good performance on heterogeneous and multicore clusters, and scales to hundreds of processors.

Dinan, James S.; Krishnamoorthy, Sriram; Larkins, D. B.; Nieplocha, Jaroslaw; Sadayappan, Ponnuswamy

2008-09-09

290

GPU Microarchitecture (Schedule Booklet Title: GRAPHICS PROCESSORS)  

E-print Network

EE 7700-1 GPU Microarchitecture (Schedule Booklet Title: GRAPHICS PROCESSORS) Where/When 218 Computer Graphics Transformations, projections, lighting, textures, etc. · GPU/GPGPU APIs: OpenGL, Direct3D Programmable GPU Designs Exploitation of control simplicity and abundant parallelism. GPU/CPU communication

Koppelman, David M.

291

Graphics Processor Based Implementation of Bioinformatics Codes  

Microsoft Academic Search

We created a powerful computing platform based on video cards with the goal of accelerating the performance of bioinformatics codes. To satisfy the demands of the video gaming industry, modern graphics processing units (GPUs) have become very advanced computational devices, using a large set of stream processors to render multiple pixels in parallel. Recently, computer scientists have taken interest in

Andrew Bellenir; Christian Trefftz; Greg Wolffe

2008-01-01

292

Asynchronous Migration in Parallel Genetic Programming  

E-print Network

Asynchronous Migration in Parallel Genetic Programming Shisanu Tongchim and Prabhas Chongstitvatana, every subpopulation was migrated between processors using a fully connected topology. The parallel MPI as a message passing library. In the first stage of the implementation, the migration

Fernandez, Thomas

293

High performance parallel algorithms for incompressible flows  

E-print Network

Object-Oriented design for the algorithms and its parallel implementation in multi-threading and multi-processing environments is presented. Inexpensive parallel matrix-vector products using bounded buffers for inter-processor communication are suggested...

Sambavaram, Sreekanth Reddy

2012-06-07

294

Virtual Reality and Parallel Systems Performance Analysis  

Microsoft Academic Search

Recording and analyzing the dynamics of application program, system software, and hardware interactions are the keys to understanding and tuning the performance of massively parallel systems. Because massively parallel systems contain hundreds or thousands of processors, each potentially with many dynamic performance metrics, the performance data occupy a sparsely populated, high-dimensional space. These dynamic performance metrics for each processor define

Daniel A. Reed; Keith A. Shields; Will H. Scullin; Luis F. Tawera; Christopher L. Elford

1995-01-01

295

A powerful and flexible co-processor for feature extraction in a robot vision system  

Microsoft Academic Search

Concepts, implementation, and evaluation of a novel processor for feature extraction are described. The processor is a freely programmable RISC (reduced instruction set computer) machine with a modified Harvard architecture, designed to operate as a coprocessor in combination with each parallel processor of the multiprocessor robot vision system BVV 3. It performs 107 complex operations per second, each of which

Volker Graefe; Karlheinz Fleder

1991-01-01

296

Parallel MATLAB: Parallel For Loops  

E-print Network

.......... FSU: Florida State University AOE: Department of Aerospace and Ocean Engineering ARC: Advanced Research Computing ICAM: Interdisciplinary Center for Applied Mathematics 1 / 69 #12;MATLAB Parallel are completely independent; there are also some restrictions on array-data access. OpenMP implements a directive

Crawford, T. Daniel

297

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

Crosette, D.B.

1994-07-19

298

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

Crosette, Dario B. (DeSoto, TX)

1994-01-01

299

Model-driven mapping onto distributed memory parallel computers  

NASA Technical Reports Server (NTRS)

The author addresses the problem of exploiting the parallelism available in a program to efficiently employ the resources of the target machine in the context of building a mapping compiler for a distributed memory parallel machine. He demonstrates the effectiveness of using execution models to select the best mapping technique from among those available for a given program segment on a particular machine. Through analysis of the execution models for several mapping techniques for one class of programs on a linear processor array, it is shown that selecting the best technique for a particular program instance can make a significant difference in performance. On the other hand, the results of benchmarks from a mapping compiler for the Warp systolic array machine show that the execution models considered are accurate enough to select the best mapping technique for a given program.

Sussman, Alan

1992-01-01

300

Parallel Earley's parser and its application to syntactic image analysis  

SciTech Connect

A complete Earley parser which includes recognition and parse extraction has been implemented on a triangular array of processors. The detailed analysis of the complete parser is given. The recognition algorithm is executed in parallel by adopting a new operator, x/sup */, and restricting the input context-free grammar to be lamda-free. The parse extraction algorithm which follows recognition uses a nonrecursive subroutine to generate the correct right-parse in parallel. A special busing arrangement within this array enables the right data to reach the right place at the right time. Simulation examples are provided. The results show that when a string of length >n> is under testing, at the system time 2>n> + 1, the correct right-parse will be obtained if the string is accepted. 15 references.

Chiang, Y.P.; Fu, K.S.

1983-01-01

301

SPROC: A multiple-processor DSP IC  

NASA Technical Reports Server (NTRS)

A large, single-chip, multiple-processor, digital signal processing (DSP) integrated circuit (IC) fabricated in HP-Cmos34 is presented. The innovative architecture is best suited for analog and real-time systems characterized by both parallel signal data flows and concurrent logic processing. The IC is supported by a powerful development system that transforms graphical signal flow graphs into production-ready systems in minutes. Automatic compiler partitioning of tasks among four on-chip processors gives the IC the signal processing power of several conventional DSP chips.

Davis, R.

1991-01-01

302

Speculative multithreaded processors  

Microsoft Academic Search

In this paper we present a novel processor microarchitecture that relieves four of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the instruction window size, the complexity of a wide issue machine and the instruction fetch bandwidth requirements. The new microarchitecture executes simultaneously multiple threads of control obtained from a single program by means of

Pedro Marcuello; Antonio González; Jordi Tubella

1998-01-01

303

Combining Task and Data Parallelism to Speed up Protein Folding on a Desktop Grid Platform Is ecient protein folding possible with CHARMM on the United Devices MetaProcessor?  

Microsoft Academic Search

The steady increase of computing power at lower and lower cost enables molecular dynamics simulations to investigate the process of protein folding with an explicit treatment of water molecules. Such simulations are typically done with well known computational chemistry codes like CHARMM. Desktop grids such as the United Devices MetaProcessor are highly attractive platforms, since scavenging for unused machines on

M. Taufer; T. Stricker; G. Settanni; A. Cavalli; A. Caflisch

304

Allergen arrays for antibody screening and immune cell activation profiling generated by parallel lipid dip-pen nanolithography.  

PubMed

Multiple-allergen testing for high throughput and high sensitivity requires the development of miniaturized immunoassays that allow for a large test area and require only a small volume of the test analyte, which is often available only in limited amounts. Developing such miniaturized biochips containing arrays of test allergens needs application of a technique able to deposit molecules at high resolution and speed while preserving its functionality. Lipid dip-pen nanolithography (L-DPN) is an ideal technique to create such biologically active surfaces, and it has already been successfully applied for the direct, nanoscale deposition of functional proteins, as well as for the fabrication of biochemical templates for selective adsorption. The work presented here shows the application of L-DPN for the generation of arrays of the ligand 2,4-dinitrophenyl[1,2-dipalmitoyl-sn-glycero-3-phosphoethanolamine-N-[6-[(2,4-dinitrophenyl)amino]hexanoyl] (DNP)] onto glass surfaces as a model system for detection of allergen-specific Immunoglobin E (IgE) antibodies and for mast cell activation profiling. PMID:22278752

Sekula-Neuner, Sylwia; Maier, Jana; Oppong, Emmanuel; Cato, Andrew C B; Hirtz, Michael; Fuchs, Harald

2012-02-20

305

Processing techniques for software based SAR processors  

NASA Technical Reports Server (NTRS)

Software SAR processing techniques defined to treat Shuttle Imaging Radar-B (SIR-B) data are reviewed. The algorithms are devised for the data processing procedure selection, SAR correlation function implementation, multiple array processors utilization, cornerturning, variable reference length azimuth processing, and range migration handling. The Interim Digital Processor (IDP) originally implemented for handling Seasat SAR data has been adapted for the SIR-B, and offers a resolution of 100 km using a processing procedure based on the Fast Fourier Transformation fast correlation approach. Peculiarities of the Seasat SAR data processing requirements are reviewed, along with modifications introduced for the SIR-B. An Advanced Digital SAR Processor (ADSP) is under development for use with the SIR-B in the 1986 time frame as an upgrade for the IDP, which will be in service in 1984-5.

Leung, K.; Wu, C.

1983-01-01

306

Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics  

NASA Astrophysics Data System (ADS)

In this study, we developed a prototype animal PET by applying several novel technologies to use solid-state photomultiplier (SSPM) arrays to measure the depth of interaction (DOI) and improve imaging performance. Each PET detector has an 8 × 8 array of about 1.9 × 1.9 × 30.0 mm3 lutetium-yttrium-oxyorthosilicate scintillators, with each end optically connected to an SSPM array (16 channels in a 4 × 4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3 × 3 mm2, and its output is read by a custom-developed application-specific integrated circuit to directly convert analogue signals to digital timing pulses that encode the interaction information. These pulses are transferred to and are decoded by a field-programmable gate array-based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable the use of flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered subset expectation maximization image reconstruction was implemented. The measured mean energy, coincidence timing and DOI resolution for a crystal were about 17.6%, 2.8 ns and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner.

Shao, Yiping; Sun, Xishan; Lan, Kejian A.; Bircher, Chad; Lou, Kai; Deng, Zhi

2014-03-01

307

Rapid, Single-Molecule Assays in Nano/Micro-Fluidic Chips with Arrays of Closely Spaced Parallel Channels Fabricated by Femtosecond Laser Machining  

PubMed Central

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

Canfield, Brian K.; King, Jason K.; Robinson, William N.; Hofmeister, William H.; Davis, Lloyd M.

2014-01-01

308

A Post-Wall Center-Feed Waveguide Circuit Consisting of T-Junctions for Reducing the Slot-Free Area in a Parallel Plate Slot Array Antenna  

NASA Astrophysics Data System (ADS)

A post-wall center-feed waveguide consisting of T-junctions is proposed for reducing the slot-free area of a parallel plate slot array antenna. The width of the slot-free area is reduced from 2.6 ?0 to 2.1 ?0. A sidelobe level in the E-plane is expected to be suppressed lower than that of the conventional center-feed antenna using cross-junctions. The method of moments with solid-wall replacement designs initially the T-junctions and HFSS including the post surfaces modifies only the reflection cancelling post. We have designed and fabricated a 61.25GHz model antenna with uniform aperture illumination. The sidelobe level in the E-plane is suppressed to -9.5dB while that of a conventional cross-junction type is -7.8dB. Also, we suppress it to -13.8dB by introducing a -8.3dB amplitude tapered distribution in the array of the radiation slot pairs.

Hashimoto, Koh; Hirokawa, Jiro; Ando, Makoto

309

Control structures for high speed processors  

NASA Technical Reports Server (NTRS)

A special processor was designed to function as a Reed Solomon decoder with throughput data rate in the Mhz range. This data rate is significantly greater than is possible with conventional digital architectures. To achieve this rate, the processor design includes sequential, pipelined, distributed, and parallel processing. The processor was designed using a high level language register transfer language. The RTL can be used to describe how the different processes are implemented by the hardware. One problem of special interest was the development of dependent processes which are analogous to software subroutines. For greater flexibility, the RTL control structure was implemented in ROM. The special purpose hardware required approximately 1000 SSI and MSI components. The data rate throughput is 2.5 megabits/second. This data rate is achieved through the use of pipelined and distributed processing. This data rate can be compared with 800 kilobits/second in a recently proposed very large scale integration design of a Reed Solomon encoder.

Maki, G. K.; Mankin, R.; Owsley, P. A.; Kim, G. M.

1982-01-01

310

Fabrication and evaluation of a micro(bio)sensor array chip for multiple parallel measurements of important cell biomarkers.  

PubMed

This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

Pemberton, Roy M; Cox, Timothy; Tuffin, Rachel; Drago, Guido A; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C; Davies, Rhodri; Jackson, Simon K; Kenna, Gerry; Luxton, Richard; Hart, John P

2014-01-01

311

Dipole monolayers: a model for elementary information processors  

Microsoft Academic Search

A 2-D array of electric dipoles contacted by suitable electrodes is shown to offer paramount properties necessary for intelligent processing of information. On the basis of a simplified mechanism that models the interactions among the dipoles and between the dipoles and the external electrodes, the possibility of transforming the 2-D dipole array into an elementary molecular processor is explored

S. Cincotti; M. Storace; M. Parodi; A. Chiabrera

1996-01-01

312

Parallel image compression  

NASA Technical Reports Server (NTRS)

A parallel compression algorithm for the 16,384 processor MPP machine was developed. The serial version of the algorithm can be viewed as a combination of on-line dynamic lossless test compression techniques (which employ simple learning strategies) and vector quantization. These concepts are described. How these concepts are combined to form a new strategy for performing dynamic on-line lossy compression is discussed. Finally, the implementation of this algorithm in a massively parallel fashion on the MPP is discussed.

Reif, John H.

1987-01-01

313

Measuring Parallelism in Computation-Intensive Scientific\\/Engineering Applications  

Microsoft Academic Search

Describes COMET, (concurrency measurement tool), a software tool for measuring parallelism in large scientific\\/engineering applications. The proposed tool measures the total parallelism present in programs, filtering out the effects of communication\\/synchronization delays, finite storage, limited number of processors, the policies for management of processors and storage, etc. Although an ideal machine that can exploit the total parallelism is not realizable,

Manoj Kumar

1988-01-01

314

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

Karasick, M.S.; Strip, D.R.

1996-01-30

315

ALEPH-QP: Universal hybrid quantum processors  

E-print Network

A quantum processor (the programmable gate array) is a quantum network with a fixed structure. A space of states is represented as tensor product of data and program registers. Different unitary operations with the data register correspond to "loaded" programs without any changing or "tuning" of network itself. Due to such property and undesirability of entanglement between program and data registers, universality of quantum processors is subject of rather strong restrictions. By different authors was developed universal "stochastic" quantum gate arrays. It was proved also, that "deterministic" quantum processors with finite-dimensional space of states may be universal only in approximate sense. In present paper is shown, that using hybrid system with continuous and discrete quantum variables, it is possible to suggest a design of strictly universal quantum processors. It is shown also that "deterministic" limit of specific programmable "stochastic" U(1) gates (probability of success becomes unit for infinite program register), discussed by other authors, may be essentially same kind of hybrid quantum systems used here.

Alexander Yu. Vlasov

2002-05-13

316

Buffered coscheduling for parallel programming and enhanced fault tolerance  

DOEpatents

A computer implemented method schedules processor jobs on a network of parallel machine processors or distributed system processors. Control information communications generated by each process performed by each processor during a defined time interval is accumulated in buffers, where adjacent time intervals are separated by strobe intervals for a global exchange of control information. A global exchange of the control information communications at the end of each defined time interval is performed during an intervening strobe interval so that each processor is informed by all of the other processors of the number of incoming jobs to be received by each processor in a subsequent time interval. The buffered coscheduling method of this invention also enhances the fault tolerance of a network of parallel machine processors or distributed system processors

Petrini, Fabrizio (Los Alamos, NM); Feng, Wu-chun (Los Alamos, NM)

2006-01-31

317

Hypercluster - Parallel processing for computational mechanics  

NASA Technical Reports Server (NTRS)

An account is given of the development status, performance capabilities and implications for further development of NASA-Lewis' testbed 'hypercluster' parallel computer network, in which multiple processors communicate through a shared memory. Processors have local as well as shared memory; the hypercluster is expanded in the same manner as the hypercube, with processor clusters replacing the normal single processor node. The NASA-Lewis machine has three nodes with a vector personality and one node with a scalar personality. Each of the vector nodes uses four board-level vector processors, while the scalar node uses four general-purpose microcomputer boards.

Blech, Richard A.

1988-01-01

318

The Dragon Processor  

Microsoft Academic Search

The Xerox PARC Dragon is a VLSI research computer that uses several techniques to achieve dense code and fast procedure calls in a system that can support multiple processors on a central high bandwidth memory bus.

Russell R. Atkinson; Edward M. McCreight

1987-01-01

319

3081/E processor  

SciTech Connect

The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future.

Kunz, P.F.; Gravina, M.; Oxoby, G.; Rankin, P.; Trang, Q.; Ferran, P.M.; Fucci, A.; Hinton, R.; Jacobs, D.; Martin, B.

1984-04-01

320

Approximate programmable quantum processors  

E-print Network

A quantum processor is a programmable quantum circuit in which both the data and the program, which specifies the operation that is carried out on the data, are quantum states. We study the situation in which we want to use such a processor to approximate a set of unitary operators to a specified level of precision. We measure how well an operation is performed by the process fidelity between the desired operation and the operation produced by the processor. We show how to find the program for a given processor that produces the best approximation of a particular unitary operation. We also place bounds on the dimension of the program space that is necessary to approximate a set of unitary operators to a specified level of precision.

Mark Hillery; Mario Ziman; Vladimir Buzek

2005-10-20

321

Liquid sample processor  

NASA Technical Reports Server (NTRS)

Processor is automatic and includes series of extraction tubes packed with fibrous absorbent material of large surface area. When introduced into these tubes, liquid test samples become completely absorbed by packing material as thin film.

Jahnsen, V. J.; Campen, C. F., Jr.

1975-01-01

322

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

PubMed Central

Abstract. A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2013-01-01

323

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

NASA Astrophysics Data System (ADS)

A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans.

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2014-01-01

324

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging.  

PubMed

A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

El-Ghussein, Fadi; Mastanduno, Michael A; Jiang, Shudong; Pogue, Brian W; Paulsen, Keith D

2014-01-01

325

Itanium Processor Microarchitecture  

Microsoft Academic Search

The Itanium processor is the first implementation of the IA-64 instruction set architecture (ISA). The design team opti-mized the processor to meet a wide range of requirements: high performance on Internet servers and workstations, support for 64-bit addressing, reliability for mission-critical applications, full IA-32 instruction set com-patibility in hardware, and scalability across a range of operating systems and platforms. The

Harsh Sharangpani; Ken Arora

2000-01-01

326

Programmable Stream Processors  

Microsoft Academic Search

including 3D graphics, image compression, and signal processing, requires tens to hun-dreds of billions of computations per sec-ond. To achieve these computation rates, current media processors use special-purpose archi-tectures tailored to one specific application. Such processors require significant design effort and are thus difficult to change as media-processing appli-cations and algorithms evolve. The demand for flexibility in media processing motivates

Ujval J. Kapasi; Scott Rixner; William J. Dally; Brucek Khailany; Jung Ho Ahn; Peter R. Mattson; John D. Owens

2003-01-01

327

Generating local addresses and communication sets for data-parallel programs  

NASA Technical Reports Server (NTRS)

Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

1993-01-01

328

Accelerator compiler for the VENICE vector processor  

Microsoft Academic Search

This paper describes the compiler design for VENICE, a new soft vector processor (SVP). The compiler is a new back-end target for Microsoft Accelerator, a high-level data parallel library for C++ and C#. This allows us to automatically compile high-level programs into VENICE assembly code, thus avoiding the process of writing assembly code used by previous SVPs. Experimental results show

Zhiduo Liu; Aaron Severance; Satnam Singh; Guy G. F. Lemieux

2012-01-01

329

Niagara: A 32Way Multithreaded Sparc Processor  

Microsoft Academic Search

The Niagara processor implements a thread-rich architecture designed to provide a high-performance solution for commercial server applications. This is an entirely new implementation of the Sparc V9 architectural specification, which exploits large amounts of on-chip parallelism to provide high throughput. The hardware supports 32 threads with a memory subsystem consisting of an on-board crossbar, level-2 cache, and memory controllers for

Poonacha Kongetira; Kathirgamar Aingaran; Kunle Olukotun

2005-01-01

330

Hybrid optical\\/electronic processor study  

Microsoft Academic Search

Itek Corporation has performed a three-part study that addresses the use of hybrid optical\\/electronic processing techniques for high range resolution imaging radar systems. The first part of the study was an investigation of the processing requirements of the HR-cubed system. Parametric evaluations were performed that related processor capabilities such as number of parallel processing channels to the basic system operating

B. A. Horwitz

1978-01-01

331

National Resource for Computation in Chemistry (NRCC). Attached scientific processors for chemical computations: a report to the chemistry community  

SciTech Connect

The demands of chemists for computational resources are well known and have been amply documented. The best and most cost-effective means of providing these resources is still open to discussion, however. This report surveys the field of attached scientific processors (array processors) and attempts to indicate their present and possible future use in computational chemistry. Array processors have the possibility of providing very cost-effective computation. This report attempts to provide information that will assist chemists who might be considering the use of an array processor for their computations. It describes the general ideas and concepts involved in using array processors, the commercial products that are available, and the experiences reported by those currently using them. In surveying the field of array processors, the author makes certain recommendations regarding their use in computational chemistry. 5 figures, 1 table (RWR)

Ostlund, N.S.

1980-01-01

332

Energy optimization of multi-level processor cache architectures  

Microsoft Academic Search

To optimize performance and power of a processor’s cache, a multiple-divided module (MDM) cache architecture is proposed to save power at memory peripherals as well as the bit array. For a MxB-divided MDM cache, latency is equivalent to that of the smallest module and power consumption is only 1\\/MxB of the regular, non-divided cache. Based on the architecture and given

Uming Ko; Poras T. Balsara; Ashwini K. Nanda

1995-01-01

333

Configurable Multi-Purpose Processor  

NASA Technical Reports Server (NTRS)

Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and operate in a launch environment. Additionally, the CTP was designed to interface with the WFF (Wallops Flight Facility) custom-designed transceiver board which is used in the Low Cost TDRSS Transceiver (LCT2) also developed by WFF. The LCT2 s transceiver board demodulates commands received from the ground via the forward link and sends them to the CTP, where they are processed. The CTP inputs and processes data from the inertial measurement unit (IMU) and the GPS receiver board, generates status data, and then sends the data to the transceiver board where it is modulated and sent to the ground via the return link. Overall, the CTP has combined processing with the ability to interface to a GPS receiver, an IMU, and a pulse code modulation (PCM) communication link, while providing the capability to support common interfaces including Ethernet and serial interfaces boarding a relatively small-sized, lightweight package.

Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

2010-01-01

334

Issue Mechanism for Embedded Simultaneous Multithreading Processor  

NASA Astrophysics Data System (ADS)

Simultaneous Multithreading (SMT) technology enhances instruction throughput by issuing multiple instructions from multiple threads within one clock cycle. For in-order pipeline to each thread, SMT processors can provide large number of issued instructions close to or surpass than using out-of-order pipeline. In this work, we show an efficient issue logic for predicated instruction sequence with the parallel flag in each instruction, where the predicate register based issue control is adopted and the continuous instructions with the parallel flag of ‘0’ are executed in parallel. The flag is pre-defined by a compiler. Instructions from different threads are issued based on the round-robin order. We also introduce an Instruction Queue skip mechanism for thread if the queue is empty. Using this kind of issue logic, we designed a 6 threads, 7-stage, in-order pipeline processor. Based on this processor, we compare round-robin issue policy (RR(T1-Tn)) with other policies: thread one always has the highest priority (PR(T1)) and thread one or thread n has the highest priority in turn (PR(T1-Tn)). The results show that RR(T1-Tn) policy outperforms others and PR(T1-Tn) is almost the same to RR(T1-Tn) from the point of view of the issued instructions per cycle.

Zang, Chengjie; Imai, Shigeki; Frank, Steven; Kimura, Shinji

335

Design and analysis of real-time wavefront processor  

Microsoft Academic Search

Latency of wavefront processor is an important factor of closed loop adaptive optical systems. For an adaptive optical system using Shack-Hartmann wave-front sensing and point beam, by ways of task queue, subtask arithmetic decomposition and subtask structure design, a multi-processors structure based on moder parallelism theory is built to realize a pipeline of wavefront gradient, wavefront reconstruction and wavefront control.

Luchun Zhou; Chunhong Wang; Mei Li; Wenhan Jiang

2004-01-01

336

Offline symbolic analysis for multi-processor execution replay  

Microsoft Academic Search

Ability to replay a program's execution on a multi-processor system can significantly help parallel programming. To replay a shared- memory multi-threaded program, existing solutions record the program input (I\\/O, DMA, etc.) and the shared-memory dependencies between threads. Prior processor based record-and-replay solutions are efficient, but they require non-trivial modifications to the coherency protocol and the memory sub-system for recording the

Dongyoon Lee; Mahmoud Said; Satish Narayanasamy; Zijiang Yang; Cristiano Pereira

2009-01-01

337

IBM Power5 Chip: A Dual-Core Multithreaded Processor  

Microsoft Academic Search

IBM introduced Power4-based systems in 2001. The Power4 design integrates two processor cores on a single chip, a shared second-level cache, a directory for an off-chip third-level cache, and the necessary circuitry to connect it to other Power4 chips to form a system. The dual-processor chip provides natural thread-level parallelism at the chip level. The Power5 is the next-generation chip

Ronald N. Kalla; Balaram Sinharoy; Joel M. Tendler

2004-01-01

338

Flosolver: A parallel computer for fluid dynamics  

NASA Astrophysics Data System (ADS)

The Flosolver parallel computer designed and built at NAL for fluid dynamics problem solving is described. The computer has two nodes each having four processors based on Intel 8086-8087 chips. In each node one of the processors acts as host and has access to a section of the private memory of the remaining three processors through the Multibus. Inter-node communication is done using parallel ports. Synchronization and inter-processor communication is done by passing the message through a global memory. Prior to execution the host processor loads the absolute codes from a disk to the respective processors. Several fluid dynamical problems of practical interest have been programmed on Flosolver using concurrent algorithms, and show that it is comparable in speed with mainframes available in the country.

Sinha, U. N.; Deshpande, M. D.; Sarasamma, V. R.

1988-06-01

339

Data-Parallel Programming on the Cell BE and the GPU using the RapidMind Development Platform  

Microsoft Academic Search

Abstract – The Cell BE processor is capable of achieving very high levels of performance via parallel computation. The processors in video accelerators, known as GPUs, are also high ,performance parallel processors. The RapidMind Development Platform provides a simple data-parallel model of execution that is easy to understand and learn, is usable from any ISO standard C++ program without any

Michael D. Mccool

340

Reversible CAM Processor Modeled After Quantum Computer Behavior  

E-print Network

Proposed below is a reversible digital computer modeled after the natural behavior of a quantum system. Using approaches usually reserved for idealized quantum computers, the Reversible CAM, or State Vector Parallel (RSVP) processor can easily find keywords in an unstructured database (that is, it can solve a needle in a haystack problem). The RSVP processor efficiently solves a SAT (Satisfiability of Boolean Formulae) problem; also it can aid in the solution of a GP (Global Properties of Truth Table) problem. The power delay product of the RSVP processor is exponentially lower than that of a standard CAM programmed to perform similar operations.

John Robert Burger

2005-12-28

341

Is Monte Carlo embarrassingly parallel?  

SciTech Connect

Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

Hoogenboom, J. E. [Delft Univ. of Technology, Mekelweg 15, 2629 JB Delft (Netherlands); Delft Nuclear Consultancy, IJsselzoom 2, 2902 LB Capelle aan den IJssel (Netherlands)

2012-07-01

342

Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library  

NASA Technical Reports Server (NTRS)

Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.

VanderWijngaart, Rob F.

2000-01-01

343

Algorithmic commonalities in the parallel environment  

NASA Technical Reports Server (NTRS)

The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.

Mcanulty, Michael A.; Wainer, Michael S.

1987-01-01

344

Identification of Phenylbutyrate-Generated Metabolites in Huntington Disease Patients using Parallel LC/EC-array/MS and Off-line Tandem MS  

PubMed Central

Oral sodium phenyl butyrate (SPB) is currently under investigation as a histone deacetylation (HDAC) inhibitor in Huntington disease (HD). Ongoing studies indicate that symptoms related to HD genetic abnormalities decrease with SPB therapy. In a recently reported safety and tolerability study of SPB in HD, we analyzed overall chromatographic patterns from a method that employs gradient Liquid Chromatography with series Electrochemical array, UV and Fluorescence (LCECA/UV/F) for measuring SPB and its metabolite phenylacetate (PA). We found that plasma and urine from SPB-treated patients yielded individual-specific patterns of ca. 20 metabolites which may provide a means for the selection of subjects for extended trials of SPB. The structural identification of these metabolites is of critical importance, since their characterization will facilitate understanding the mechanisms of drug action and possible side effects. We have now developed an iterative process with LCECA, parallel LCECA/LCMS, and high performance tandem MS, for metabolite characterization. We report here the details of this method and its use for identification of 10 plasma and urinary metabolites in treated subjects, including indole species in urine that are not themselves metabolites of SPB. This approach thus contributes to understanding metabolic pathways that differ among HD individuals being treated with SPB. PMID:20074541

Ebbel, Erika N.; Leymarie, Nancy; Schiavo, Susan; Sharma, Swati; Gevorkian, Sona; Hersch, Steven; Matson, Wayne R.; Costello, Catherine E.

2013-01-01

345

NWChem: scalable parallel computational chemistry  

SciTech Connect

NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

2011-11-01

346

Soft-core processor study for node-based architectures.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

2008-09-01

347

Universal schemes for parallel communication  

Microsoft Academic Search

In this paper we isolate a combinatorial problem that, we believe, lies at the heart of this question and provide some encouragingly positive solutions to it. We show that there exists an N-processor realistic computer that can simulate arbitrary idealistic N-processor parallel computations with only a factor of O(log N) loss of runtime efficiency. The main innovation is an O(log

Leslie G. Valiant; Gordon J. Brebner

1981-01-01

348

Coherent beam combination of a hexagonal distributed high power fiber amplifier array.  

PubMed

We demonstrate the coherent beam combination of a hexagonal distributed high power fiber amplifier array. Six polarization-maintained fiber amplifiers are tiled into a hexagonal array by use of a novel beam combiner. The phase control signal for fiber amplifiers is generated by running a stochastic parallel gradient descent algorithm on a digital signal processor. The coherent beam combination with a total output power of 73.5 W is demonstrated. The contrast of the coherent combined beam profile is as high as 81% when the system is closed-loop controlled. PMID:19935976

Zhou, Pu; Ma, Yanxing; Wang, Xiaolin; Ma, Haotong; Wang, Jianhua; Xu, Xiaojun; Liu, Zejin

2009-11-20

349

Self-reconfiguration on Spartan-III FPGAs with compressed partial bitstreams via a parallel configuration access port (cPCAP) core  

Microsoft Academic Search

This paper presents an alternative approach for dynamic partial self-reconfiguration that enables a field programmable gate array (FPGA) to reconfigure itself at run-time partially through a parallel configuration access port (cPCAP) under the control of the stand alone cPCAP core within the FPGA instead of using an embedded processor. The cPCAP core with bitstream decompression module needs only 361 slices,

Salih Bayar; Arda Yurdakul

2008-01-01

350

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore  

SciTech Connect

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

Liao, C; Quinlan, D J; Willcock, J J; Panas, T

2008-12-12

351

Linear array implementation of the EM algorithm for PET image reconstruction  

SciTech Connect

The PET image reconstruction based on the EM algorithm has several attractive advantages over the conventional convolution back projection algorithms. However, the PET image reconstruction based on the EM algorithm is computationally burdensome for today`s single processor systems. In addition, a large memory is required for the storage of the image, projection data, and the probability matrix. Since the computations are easily divided into tasks executable in parallel, multiprocessor configurations are the ideal choice for fast execution of the EM algorithms. In tis study, the authors attempt to overcome these two problems by parallelizing the EM algorithm on a multiprocessor systems. The parallel EM algorithm on a linear array topology using the commercially available fast floating point digital signal processor (DSP) chips as the processing elements (PE`s) has been implemented. The performance of the EM algorithm on a 386/387 machine, IBM 6000 RISC workstation, and on the linear array system is discussed and compared. The results show that the computational speed performance of a linear array using 8 DSP chips as PE`s executing the EM image reconstruction algorithm is about 15.5 times better than that of the IBM 6000 RISC workstation. The novelty of the scheme is its simplicity. The linear array topology is expandable with a larger number of PE`s. The architecture is not dependant on the DSP chip chosen, and the substitution of the latest DSP chip is straightforward and could yield better speed performance.

Rajan, K.; Patnaik, L.M.; Ramakrishna, J. [Indian Institute of Science, Bangalore (India)] [Indian Institute of Science, Bangalore (India)

1995-08-01

352

Interactive Digital Signal Processor  

NASA Technical Reports Server (NTRS)

Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.

Mish, W. H.

1985-01-01

353

Universal voice processor development  

NASA Technical Reports Server (NTRS)

The development of a universal voice processor is discussed. The device is based on several circuit configurations using hybrid techniques to satisfy the electrical specifications. The steps taken during the design process are described. Circuit diagrams of the final design are presented. Mathematical models are included to support the theoretical aspects.

1972-01-01

354

Distribution free Doppler processor  

Microsoft Academic Search

The purpose of this paper is to investigate the performance of a typical Doppler processor when the received clutter data is correlated. As the classical theories of statistical inference do not provide a tractable procedure for consideration of correlated non-gaussian random variables, it is necessary to resort to Monte Carlo simulation. A technique based on the theory of Best Linear

S. Nawathe; B. Rao

1984-01-01

355

The Alaska SAR processor  

NASA Technical Reports Server (NTRS)

The Alaska SAR processor was designed to process over 200 100 km x 100 km (Seasat like) frames per day from the raw SAR data, at a ground resolution of 30 m x 30 m from ERS-1, J-ERS-1, and Radarsat. The near real time processor is a set of custom hardware modules operating in a pipelined architecture, controlled by a general purpose computer. Input to the processor is provided from a high density digital cassette recording of the raw data stream as received by the ground station. A two pass processing is performed. During the first pass clutter-lock and auto-focus measurements are made. The second pass uses the results to accomplish final image formation which is recorded on a high density digital cassette. The processing algorithm uses fast correlation techniques for range and azimuth compression. Radiometric compensation, interpolation and deskewing is also performed by the processor. The standard product of the ASP is a high resolution four-look image, with a low resolution (100 to 200 m) many look image provided simultaneously.

Carande, R. E.; Charny, B.

1988-01-01

356

Clustered speculative multithreaded processors  

Microsoft Academic Search

In this paper we present a processor microarchitecture that can simultaneously execute multiple threads and has a clustered design for scalability purposes. A main feature of the proposed microarchitecture is its capability to spawn speculative threads from a single-thread application at run-time. These speculative threads use otherwise idle resources of the machine. Spawning a speculative threads involves predicting its control

Pedro Marcuello; Antonio González

1999-01-01

357

Graphite: A Distributed Parallel Simulator for Multicores  

E-print Network

This paper introduces the open-source Graphite distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, ...

Beckmann, Nathan

2009-11-09

358

Parallel design patterns for a low-power, software-defined compressed video encoder  

NASA Astrophysics Data System (ADS)

Video compression algorithms such as H.264 offer much potential for parallel processing that is not always exploited by the technology of a particular implementation. Consumer mobile encoding devices often achieve real-time performance and low power consumption through parallel processing in Application Specific Integrated Circuit (ASIC) technology, but many other applications require a software-defined encoder. High quality compression features needed for some applications such as 10-bit sample depth or 4:2:2 chroma format often go beyond the capability of a typical consumer electronics device. An application may also need to efficiently combine compression with other functions such as noise reduction, image stabilization, real time clocks, GPS data, mission/ESD/user data or software-defined radio in a low power, field upgradable implementation. Low power, software-defined encoders may be implemented using a massively parallel memory-network processor array with 100 or more cores and distributed memory. The large number of processor elements allow the silicon device to operate more efficiently than conventional DSP or CPU technology. A dataflow programming methodology may be used to express all of the encoding processes including motion compensation, transform and quantization, and entropy coding. This is a declarative programming model in which the parallelism of the compression algorithm is expressed as a hierarchical graph of tasks with message communication. Data parallel and task parallel design patterns are supported without the need for explicit global synchronization control. An example is described of an H.264 encoder developed for a commercially available, massively parallel memorynetwork processor device.

Bruns, Michael W.; Hunt, Martin A.; Prasad, Durga; Gunupudi, Nageswara R.; Sonachalam, Sekar

2011-06-01

359

Precise Exceptions in Asynchronous Processors  

Microsoft Academic Search

The presence of precise exceptions in a processor leads to co mplications in its design. Some re- cent processor architectures have sacrificed this requirem ent for performance reasons at the cost of software complexity. We present an implementation strateg y for precise exceptions in asynchronous processors that does not block the instruction fetch when ex ceptions do not occur; the

Rajit Manohar; Mika Nyström; Alain J. Martin

2001-01-01

360

VLSI Processor For Vector Quantization  

NASA Technical Reports Server (NTRS)

Pixel intensities in each kernel compared simultaneously with all code vectors. Prototype high-performance, low-power, very-large-scale integrated (VLSI) circuit designed to perform compression of image data by vector-quantization method. Contains relatively simple analog computational cells operating on direct or buffered outputs of photodetectors grouped into blocks in imaging array, yielding vector-quantization code word for each such block in sequence. Scheme exploits parallel-processing nature of vector-quantization architecture, with consequent increase in speed.

Tawel, Raoul

1995-01-01

361

The U.S. Sarsat geosynchronous experiment - Ground processor description and test results  

NASA Technical Reports Server (NTRS)

The development of a specialized digital signal processor, the Geosynchronous Signal Processor (GSP), for short beacon burst signal detection and demodulation is described. The processing is based on fast Fourier Transform techniques for detection and message integration on the respective message bursts for demodulation. The GSP is based on array processor technology; it is designed to yield an ultimate capacity of 50-75 simultaneous beacon transmissions within the nominal 20 KHz bandwidth centered at 406.025 MHz.

Flikkema, P. G.; Davisson, L. D.

1988-01-01

362

Optimizing Vector-Quantization Processor Architecture for Intelligent Query-Search Applications  

NASA Astrophysics Data System (ADS)

The architecture of a very large scale integration (VLSI) vector-quantization processor (VQP) has been optimized to develop a general-purpose intelligent query-search agent. The agent performs a similarity-based search in a large-volume database. Although similarity-based search processing is computationally very expensive, latency-free searches have become possible due to the highly parallel maximum-likelihood search architecture of the VQP chip. Three architectures of the VQP chip have been studied and their performances are compared. In order to give reasonable searching results according to the different policies, the concept of penalty function has been introduced into the VQP. An E-commerce real-estate agency system has been developed using the VQP chip implemented in a field-programmable gate array (FPGA) and the effectiveness of such an agency system has been demonstrated.

Xu, Huaiyu; Mita, Yoshio; Shibata, Tadashi

2002-04-01

363

Scaling first-principles plane-wave codes to thousands of processors  

Microsoft Academic Search

We present some novel computational methods for scaling up first-principles plane-wave based codes to thousands of processors avoiding communication and latency bottlenecks. This allows our code to scale to more processors and larger systems than previous plane-wave codes that are typically limited in scaling to a few hundred processors. We present performance data for the plane-wave pseudopotential code PARATEC (PARAllel

Andrew Canning; David Raczkowski

2005-01-01

364

Fault detection and bypass in a sequence information signal processor  

NASA Technical Reports Server (NTRS)

The invention comprises a plurality of scan registers, each such register respectively associated with a processor element; an on-chip comparator, encoder and fault bypass register. Each scan register generates a unitary signal the logic state of which depends on the correctness of the input from the previous processor in the systolic array. These unitary signals are input to a common comparator which generates an output indicating whether or not an error has occurred. These unitary signals are also input to an encoder which identifies the location of any fault detected so that an appropriate multiplexer can be switched to bypass the faulty processor element. Input scan data can be readily programmed to fully exercise all of the processor elements so that no fault can remain undetected.

Peterson, John C. (Inventor); Chow, Edward T. (Inventor)

1992-01-01

365

Parallel Genetic Algorithm for Alpha Spectra Fitting  

NASA Astrophysics Data System (ADS)

We present a performance study of alpha-particle spectra fitting using parallel Genetic Algorithm (GA). The method uses a two-step approach. In the first step we run parallel GA to find an initial solution for the second step, in which we use Levenberg-Marquardt (LM) method for a precise final fit. GA is a high resources-demanding method, so we use a Beowulf cluster for parallel simulation. The relationship between simulation time (and parallel efficiency) and processors number is studied using several alpha spectra, with the aim of obtaining a method to estimate the optimal processors number that must be used in a simulation.

García-Orellana, Carlos J.; Rubio-Montero, Pilar; González-Velasco, Horacio

2005-01-01

366

Adaptive optical processor  

Microsoft Academic Search

The Phase 1 in-house effort to develop an optical processor as an electronic counter-counter-measure for radar multipath is discussed. The closed loop system demonstrates the ability to achieve a 15.2 +\\/- 2.4 dB adaptive cancellation of a single tone, single delay jamming signal over a 1-5 MHz bandwidth. The open loop optical system proves capable of providing a 30.2 +\\/-

Michael J. Ward; Christopher W. Keefer; Stephen T. Welstead

1991-01-01

367

Scalable parallel communications  

NASA Technical Reports Server (NTRS)

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

368

Parallel machine architecture for production rule systems  

DOEpatents

A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

Allen, Jr., John D. (Knoxville, TN); Butler, Philip L. (Knoxville, TN)

1989-01-01

369

Scalable Parallel Crash Simulations  

SciTech Connect

We are pleased to submit our efforts in parallelizing the PRONTO application suite for con- sideration in the SuParCup 99 competition. PRONTO is a finite element transient dynamics simulator which includes a smoothed particle hydrodynamics (SPH) capability; it is similar in scope to the well-known DYNA, PamCrash, and ABAQUS codes. Our efforts over the last few years have produced a fully parallel version of the entire PRONTO code which (1) runs fast and scalably on thousands of processors, (2) has performed the largest finite-element transient dynamics simulations we are aware of, and (3) includes several new parallel algorithmic ideas that have solved some difficult problems associated with contact detection and SPH scalability. We motivate this work, describe the novel algorithmic advances, give performance numbers for PRONTO running on Sandia's Intel Teraflop machine, and highlight two prototypical large-scale computations we have performed with the parallel code. We have successfully parallelized a large-scale production transient dynamics code with a novel algorithmic approach that utilizes multiple decompositions for different key segments of the computations. To be able to simulate a more than ten million element model in a few tenths of second per timestep is unprecedented for solid dynamics simulations, especially when full global contact searches are required. The key reason is our new algorithmic ideas for efficiently parallelizing the contact detection stage. To our knowledge scalability of this computation had never before been demonstrated on more than 64 processors. This has enabled parallel PRONTO to become the only solid dynamics code we are aware of that can run effectively on 1000s of processors. More importantly, our parallel performance compares very favorably to the original serial PRONTO code which is optimized for vector supercomputers. On the container crush problem, a Teraflop node is as fast as a single processor of the Cray Jedi. This means that on the Teraflop machine we can now run simulations with tens of millions of elements thousands of times faster than we could on the Jedi! This is enabling transient dynamics simulations of unprecedented scale and fidelity. Not only can previous applications be run with vastly improved resolution and speed, but qualitatively new and different analyses have been made possible.

Attaway, Stephen; Barragy, Ted; Brown, Kevin; Gardner, David; Gruda, Jeff; Heinstein, Martin; Hendrickson, Bruce; Metzinger, Kurt; Neilsen, Mike; Plimpton, Steve; Pott, John; Swegle, Jeff; Vaughan, Courtenay

1999-06-01

370

An Experimental Evaluation of Processor Pool-Based Scheduling for Shared-Memory NUMA Multiprocessors  

E-print Network

An Experimental Evaluation of Processor Pool-Based Scheduling for Shared-Memory NUMA processes (or kernel threads) of parallel applications to processors in multiprogrammed, shared-memory NUMA other are desirable properties of algorithms for operating system schedulers executing on NUMA

Feitelson, Dror

371

Efficacy of Code Optimization on Cache-based Processors  

NASA Technical Reports Server (NTRS)

The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.

VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

372

A parallel algorithm for channel routing on a hypercube  

NASA Technical Reports Server (NTRS)

A new parallel simulated annealing algorithm for channel routing on a P processor hypercube is presented. The basic idea used is to partition a set of tracks equally among processors in the hypercube. In parallel, P/2 pairs of processors perform displacements and exchanges of nets between tracks, compute the changes in cost functions, and accept moves using a parallel annealing criteria. Through the use of a unique distributed data structure, it is possible to minimize message traffic and add versatility and efficiency in a parallel routing tool. The algorithm has been implemented and is being tested on some of the popular channel problems from the literature.

Brouwer, Randall; Banerjee, Prithviraj

1987-01-01

373

Distributed processor allocation for launching applications in a massively connected processors complex  

DOEpatents

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

Pedretti, Kevin (Goleta, CA)

2008-11-18

374

Initial Experiences Porting a Bioinformatics Application to a Graphics Processor  

Microsoft Academic Search

Bioinformatics applications are one of the most relevant and compute-demanding applications today. While normally these applica- tions are executed on clusters or dedicated parallel systems, in this work we explore the use of an alternative architecture. We focus on exploiting the compute-intensive characteristics offered by the graphics processors (GPU) in order to accelerate a bioinformatics application. The GPU is a

Maria Charalambous; Pedro Trancoso; Alexandros Stamatakis

2005-01-01

375

Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming  

NASA Technical Reports Server (NTRS)

Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

Gentzsch, W.

1982-01-01

376

FY 2006 Accomplishment Colony - "Services and Interfaces to Support Large Numbers of Processors"  

SciTech Connect

The Colony Project is developing operating system and runtime system technology to enable efficient general purpose environments on tens of thousands of processors. To accomplish this, we are investigating memory management techniques, fault management strategies, and parallel resource management schemes. Recent results show promising findings for scalable strategies based on processor virtualization, in-memory checkpointing, and parallel aware modifications to full featured operating systems.

Jones, T; Kale, L; Moreira, J; Mendes, C; Chakravorty, S; Tauferner, A; Inglett, T

2006-06-30

377

Parallel processing data network of master and slave transputers controlled by a serial control network  

DOEpatents

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

Crosetto, D.B.

1996-12-31

378

Scaling properties of geometric parallelization  

NASA Astrophysics Data System (ADS)

We present a universal scaling law for all geometrically parallelized computer simulation algorithms. For algorithms with local interaction laws we calculate the scaling exponents for zero and infinite lattice size. The scaling is tested on local (cellular automata, Metropolis Ising) as well as cluster (Swendsen-Wang) algorithms. The practical aspects of the scaling properties lead to a simple recipe for finding the optimum number of processors to be used for the parallel simulation of a particular system.

Jakobs, A.; Gerling, R. W.

1992-01-01

379

Optical Finite Element Processor  

NASA Astrophysics Data System (ADS)

A new high-accuracy optical linear algebra processor (OLAP) with many advantageous features is described. It achieves floating point accuracy, handles bipolar data by sign-magnitude representation, performs LU decomposition using only one channel, easily partitions and considers data flow. A new application (finite element (FE) structural analysis) for OLAPs is introduced and the results of a case study presented. Error sources in encoded OLAPs are addressed for the first time. Their modeling and simulation are discussed and quantitative data are presented. Dominant error sources and the effects of composite error sources are analyzed.

Casasent, David; Taylor, Bradley K.

1986-01-01

380

Waste from food processors  

SciTech Connect

Food processing companies, by nature of the commodities they deal in and the products they provide, generate a much higher percentage of biodegradable, organic wastes than they do nonorganic wastes. The high percentage of food materials, and to a lesser extent, paper, found in a food processor's waste stream makes composting a highly cost-effective way to manage the wastes. This is the last in a series of articles that discussed solid waste management in various public arenas. Each segment highlighted particulars -- the waste stream; how the waste is handled; waste reduction and recovery programs; and the direction of future waste management -- that are specific to that area.

Sheehan, K.

1993-12-01

381

Parallel MATLAB at VT: Parallel For Loops  

E-print Network

.......... FSU: Florida State University AOE: Department of Aerospace and Ocean Engineering ARC: Advanced Research Computing ICAM: Interdisciplinary Center for Applied Mathematics 1 / 56 #12;Matlab Parallel on the order of execution. There are also restrictions on array-data access. OpenMP implements a directive

Crawford, T. Daniel

382

Parallel MATLAB at VT: Parallel For Loops  

E-print Network

.......... FSU: Florida State University AOE: Department of Aerospace and Ocean Engineering ARC: Advanced Research Computing ICAM: Interdisciplinary Center for Applied Mathematics 1 / 71 #12;MATLAB Parallel are completely independent; there are also some restrictions on array-data access. OpenMP implements a directive

Crawford, T. Daniel

383

Parallel MATLAB at VT: Parallel For Loops  

E-print Network

.......... FSU: Florida State University AOE: Department of Aerospace and Ocean Engineering ARC: Advanced Research Computing ICAM: Interdisciplinary Center for Applied Mathematics 1 / 72 #12;MATLAB Parallel independent; there are also some restrictions on array-data access. OpenMP implements a directive

Crawford, T. Daniel

384

Reconfigurable data path processor  

NASA Technical Reports Server (NTRS)

A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.

Donohoe, Gregory (Inventor)

2005-01-01

385

CoNNeCT Baseband Processor Module  

NASA Technical Reports Server (NTRS)

A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.

Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.

2011-01-01

386

Parallel processing architecture for H.264 deblocking filter on multi-core platforms  

NASA Astrophysics Data System (ADS)

Massively parallel computing (multi-core) chips offer outstanding new solutions that satisfy the increasing demand for high resolution and high quality video compression technologies such as H.264. Such solutions not only provide exceptional quality but also efficiency, low power, and low latency, previously unattainable in software based designs. While custom hardware and Application Specific Integrated Circuit (ASIC) technologies may achieve lowlatency, low power, and real-time performance in some consumer devices, many applications require a flexible and scalable software-defined solution. The deblocking filter in H.264 encoder/decoder poses difficult implementation challenges because of heavy data dependencies and the conditional nature of the computations. Deblocking filter implementations tend to be fixed and difficult to reconfigure for different needs. The ability to scale up for higher quality requirements such as 10-bit pixel depth or a 4:2:2 chroma format often reduces the throughput of a parallel architecture designed for lower feature set. A scalable architecture for deblocking filtering, created with a massively parallel processor based solution, means that the same encoder or decoder will be deployed in a variety of applications, at different video resolutions, for different power requirements, and at higher bit-depths and better color sub sampling patterns like YUV, 4:2:2, or 4:4:4 formats. Low power, software-defined encoders/decoders may be implemented using a massively parallel processor array, like that found in HyperX technology, with 100 or more cores and distributed memory. The large number of processor elements allows the silicon device to operate more efficiently than conventional DSP or CPU technology. This software programing model for massively parallel processors offers a flexible implementation and a power efficiency close to that of ASIC solutions. This work describes a scalable parallel architecture for an H.264 compliant deblocking filter for multi core platforms such as HyperX technology. Parallel techniques such as parallel processing of independent macroblocks, sub blocks, and pixel row level are examined in this work. The deblocking architecture consists of a basic cell called deblocking filter unit (DFU) and dependent data buffer manager (DFM). The DFU can be used in several instances, catering to different performance needs the DFM serves the data required for the different number of DFUs, and also manages all the neighboring data required for future data processing of DFUs. This approach achieves the scalability, flexibility, and performance excellence required in deblocking filters.

Prasad, Durga P.; Sonachalam, Sekar; Kunchamwar, Mangesh K.; Gunupudi, Nageswara Rao

2012-03-01

387

HSRA: high-speed, hierarchical synchronous reconfigurable array  

Microsoft Academic Search

There is no inherent characteristic forcing Field ProgrammableGate Array (FPGA) or Reconfigurable Computing (RC) Array cycle times to be greater than processors in the same process. Mod- ern FPGAs seldom achieve application clock rates close to their processor cousins because (1) resources in the FPGAs are not bal- anced appropriately for high-speed operation, (2) FPGA CAD does not automatically provide

William Tsu; Kip Macy; Atul Joshi; Randy Huang; Norman Walker; Tony Tung; Omid Rowhani; Varghese George; John Wawrzynek; André DeHon

1999-01-01

388

Highly scalable linear solvers on thousands of processors.  

SciTech Connect

In this report we summarize research into new parallel algebraic multigrid (AMG) methods. We first provide a introduction to parallel AMG. We then discuss our research in parallel AMG algorithms for very large scale platforms. We detail significant improvements in the AMG setup phase to a matrix-matrix multiplication kernel. We present a smoothed aggregation AMG algorithm with fewer communication synchronization points, and discuss its links to domain decomposition methods. Finally, we discuss a multigrid smoothing technique that utilizes two message passing layers for use on multicore processors.

Domino, Stefan Paul (Sandia National Laboratories, Albuquerque, NM); Karlin, Ian (University of Colorado at Boulder, Boulder, CO); Siefert, Christopher (Sandia National Laboratories, Albuquerque, NM); Hu, Jonathan Joseph; Robinson, Allen Conrad (Sandia National Laboratories, Albuquerque, NM); Tuminaro, Raymond Stephen

2009-09-01

389

Never Trust Your Word Processor  

ERIC Educational Resources Information Center

In this article, the author talks about the auto correction mode of word processors that leads to a number of problems and describes an example in biochemistry exams that shows how word processors can lead to mistakes in databases and in papers. The author contends that, where this system is applied, spell checking should not be left to a word…

Linke, Dirk

2009-01-01

390

Performance Tradeoffs in Multithreaded Processors  

Microsoft Academic Search

An analytical performance model for multithreaded processors that includes cache interference, network contention, context-switching overhead, and data-sharing effects is presented. The model is validated through the author's simulations and by comparison with previously published simulation results. The results indicate that processors can substantially benefit from multithreading, even in systems with small caches, provided sufficient network bandwidth exists. Caches that are

Anant Agarwal

1992-01-01

391

Interactive digital signal processor  

NASA Technical Reports Server (NTRS)

The Interactive Digital Signal Processor (IDSP) is examined. It consists of a set of time series analysis Operators each of which operates on an input file to produce an output file. The operators can be executed in any order that makes sense and recursively, if desired. The operators are the various algorithms used in digital time series analysis work. User written operators can be easily interfaced to the sysatem. The system can be operated both interactively and in batch mode. In IDSP a file can consist of up to n (currently n=8) simultaneous time series. IDSP currently includes over thirty standard operators that range from Fourier transform operations, design and application of digital filters, eigenvalue analysis, to operators that provide graphical output, allow batch operation, editing and display information.

Mish, W. H.; Wenger, R. M.; Behannon, K. W.; Byrnes, J. B.

1982-01-01

392

Parallel recording of neurotransmitters release from chromaffin cells using a 10×10 CMOS IC potentiostat array with on-chip working electrodes.  

PubMed

Neurotransmitter release is modulated by many drugs and molecular manipulations. We present an active CMOS-based electrochemical biosensor array with high throughput capability (100 electrodes) for on-chip amperometric measurement of neurotransmitter release. The high-throughput of the biosensor array will accelerate the data collection needed to determine statistical significance of changes produced under varying conditions, from several weeks to a few hours. The biosensor is designed and fabricated using a combination of CMOS integrated circuit (IC) technology and a photolithography process to incorporate platinum working electrodes on-chip. We demonstrate the operation of an electrode array with integrated high-gain potentiostats and output time-division multiplexing with minimum dead time for readout. The on-chip working electrodes are patterned by conformal deposition of Pt and lift-off photolithography. The conformal deposition method protects the underlying electronic circuits from contact with the electrolyte that covers the electrode array during measurement. The biosensor was validated by simultaneous measurement of amperometric currents from 100 electrodes in response to dopamine injection, which revealed the time course of dopamine diffusion along the surface of the biosensor array. The biosensor simultaneously recorded neurotransmitter release successfully from multiple individual living chromaffin cells. The biosensor was capable of resolving small and fast amperometric spikes reporting release from individual vesicle secretions. We anticipate that this device will accelerate the characterization of the modulation of neurotransmitter secretion from neuronal and endocrine cells by pharmacological and molecular manipulations of the cells. PMID:23084756

Kim, Brian N; Herbst, Adam D; Kim, Sung J; Minch, Bradley A; Lindau, Manfred

2013-03-15

393

Parallel Recording of Neurotransmitters Release from Chromaffin Cells Using a 10 x 10 CMOS IC Potentiostat Array with On-Chip Working Electrodes  

PubMed Central

Neurotransmitter release is modulated by many drugs and molecular manipulations. We present an active CMOS-based electrochemical biosensor array with high throughput capability (100 electrodes) for on-chip amperometric measurement of neurotransmitter release. The high-throughput of the biosensor array will accelerate the data collection needed to determine statistical significance of changes produced under varying conditions, from several weeks to a few hours. The biosensor is designed and fabricated using a combination of CMOS integrated circuit (IC) technology and a photolithography process to incorporate platinum working electrodes on-chip. We demonstrate the operation of an electrode array with integrated high-gain potentiostats and output time-division multiplexing with minimum dead time for readout. The on-chip working electrodes are patterned by conformal deposition of Pt and lift-off photolithography. The conformal deposition method protects the underlying electronic circuits from contact with the electrolyte that covers the electrode array during measurement. The biosensor was validated by simultaneous measurement of amperometric currents from 100 electrodes in response to dopamine injection, which revealed the time course of dopamine diffusion along the surface of the biosensor array. The biosensor simultaneously recorded neurotransmitter release successfully from multiple individual living chromaffin cells. The biosensor was capable of resolving small and fast amperometric spikes reporting release from individual vesicle secretions. We anticipate that this device will accelerate the characterization of the modulation of neurotransmitter secretion from neuronal and endocrine cells by pharmacological and molecular manipulations of the cells. PMID:23084756

Kim, Brian Namghi; Herbst, Adam D.; Kim, Sung June; Minch, Bradley A.; Lindau, Manfred

2012-01-01

394

Asynchronous parallel status comparator  

DOEpatents

Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.

Arnold, J.W.; Hart, M.M.

1992-12-15

395

Asynchronous parallel status comparator  

DOEpatents

Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.

Arnold, Jeffrey W. (828 Hickory Ridge Rd., Aiken, SC 29801); Hart, Mark M. (223 Limerick Dr., Aiken, SC 29803)

1992-01-01

396

Radiofrequency detector coil performance maps for parallel MRI applications  

E-print Network

Parallel MRI techniques allow acceleration of MR imaging beyond traditional speed limits. In parallel MRI, arrays of radiofrequency (RF) detector coil arrays are used to perform some degree of spatial encoding which ...

Lattanzi, Riccardo

2006-01-01

397

Parallel automated adaptive procedures for unstructured meshes  

NASA Technical Reports Server (NTRS)

Consideration is given to the techniques required to support adaptive analysis of automatically generated unstructured meshes on distributed memory MIMD parallel computers. The key areas of new development are focused on the support of effective parallel computations when the structure of the numerical discretization, the mesh, is evolving, and in fact constructed, during the computation. All the procedures presented operate in parallel on already distributed mesh information. Starting from a mesh definition in terms of a topological hierarchy, techniques to support the distribution, redistribution and communication among the mesh entities over the processors is given, and algorithms to dynamically balance processor workload based on the migration of mesh entities are given. A procedure to automatically generate meshes in parallel, starting from CAD geometric models, is given. Parallel procedures to enrich the mesh through local mesh modifications are also given. Finally, the combination of these techniques to produce a parallel automated finite element analysis procedure for rotorcraft aerodynamics calculations is discussed and demonstrated.

Shephard, M. S.; Flaherty, J. E.; Decougny, H. L.; Ozturan, C.; Bottasso, C. L.; Beall, M. W.

1995-01-01

398

Acousto-optic/CCD real-time SAR data processor  

NASA Technical Reports Server (NTRS)

The SAR processor which uses an acousto-optic device as the input electronic-to-optical transducer and a 2-D CCD image sensor, which is operated in the time-delay-and-integrate (TDI) mode is presented. The CCD serves as the optical detector, and it simultaneously operates as an array of optically addressed correlators. The lines of the focused SAR image form continuously (at the radar PRF) at the final row of the CCD. The principles of operation of this processor, its performance characteristics, the state-of-the-art of the devices used and experimental results are outlined. The methods by which this processor can be made flexible so that it can be dynamically adapted to changing SAR geometries is discussed.

Psaltis, D.

1983-01-01

399

A Cooperative Management Scheme for Power Efficient Implementations of Real-Time Operating Systems on Soft Processors  

Microsoft Academic Search

A cooperative management scheme for power efficient implementations of real-time operating systems on field-programmable gate-array (FPGA)-based soft processors is presented. Dedicated power management hardware peripherals are tightly coupled to a soft processor by utilizing its configurability. These hardware peripherals manage tasks and interrupts in cooperation with the soft processor, while retaining the real-time responsiveness of the operating system. More specifically,

Jingzhao Ou; Viktor K. Prasanna

2008-01-01

400

SPECIAL ISSUE ON OPTICAL PROCESSING OF INFORMATION: Optoelectronic processors with scanning CCD photodetectors  

NASA Astrophysics Data System (ADS)

Two new types of optoelectronic radio-signal processors were investigated. Charge-coupled device (CCD) photodetectors are used in these processors under continuous scanning conditions, i.e. in a time delay and storage mode. One of these processors is based on a CCD photodetector array with a reference-signal amplitude transparency and the other is an adaptive acousto-optical signal processor with linear frequency modulation. The processor with the transparency performs multichannel discrete—analogue convolution of an input signal with a corresponding kernel of the transformation determined by the transparency. If a light source is an array of light-emitting diodes of special (stripe) geometry, the optical stages of the processor can be made from optical fibre components and the whole processor then becomes a rigid 'sandwich' (a compact hybrid optoelectronic microcircuit). A report is given also of a study of a prototype processor with optical fibre components for the reception of signals from a system with antenna aperture synthesis, which forms a radio image of the Earth.

Esepkina, N. A.; Lavrov, A. P.; Anan'ev, M. N.; Blagodarnyi, V. S.; Ivanov, S. I.; Mansyrev, M. I.; Molodyakov, S. A.

1995-10-01

401

Parallel implementation of an algorithm for Delaunay triangulation  

NASA Technical Reports Server (NTRS)

The theory and practice of implementing Tanemura's algorithm for 3D Delaunay triangulation on Intel's Gamma prototype, a 128 processor MIMD computer, is described. Efficient implementation of Tanemura's algorithm on a conventional, vector processing supercomputer is problematic. It does not vectorize to any significant degree and requires indirect addressing. Efficient implementation on a parallel architecture is possible, however. Speeds in excess of 20 times a single processor Cray Y-MP are realized on 128 processors of the Intel Gamma prototype.

Merriam, Marshal L.

1992-01-01

402

Instrumentation for parallel magnetic resonance imaging  

E-print Network

of arrays of sensors. In parallelization, multiple MR scanners (or multiple sensors) are used to collect images from different samples simultaneously. This allows for an increase in the throughput, not the inherent speed, of the MR experiment. Parallel...

Brown, David Gerald

2007-04-25

403

Parallel Processing System  

NASA Technical Reports Server (NTRS)

In order to process very high resolution image data from spacecraft sensors, Goddard Space Flight Center commissioned the development of a Massively Parallel Processor (MPP) based upon simultaneous processing of image picture elements (pixels) rather than serial processing. It resulted in a considerable increase in computational speed. MasPar Computer Corporation's MasPar MP-1 incorporates this technology, allowing users to attack a variety of computationally-intensive problems. The MP-1 is no longer manufactured but has been replaced by the MP-2, a more advanced model.

1991-01-01

404

Eigensolution of finite element problems in a completely connected parallel architecture  

NASA Technical Reports Server (NTRS)

A parallel algorithm is presented for the solution of the generalized eigenproblem in linear elastic finite element analysis. The algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm is successfully implemented on a tightly coupled MIMD parallel processor. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor or to a logical processor (task) if the number of domains exceeds the number of physical processors. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts, and the dimension of the subspace on the performance of the algorithm is investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18, and 3.61 are achieved on two, four, six, and eight processors, respectively.

Akl, F.; Morel, M.

1989-01-01

405

Parallel processing data network of master and slave transputers controlled by a serial control network  

DOEpatents

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor (100) to a plurality of slave processors (200) to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor's status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer (104), a digital signal processor (114), a parallel transfer controller (106), and two three-port memory devices. A communication switch (108) within each node (100) connects it to a fast parallel hardware channel (70) through which all high density data arrives or leaves the node.

Crosetto, Dario B. (DeSoto, TX)

1996-01-01

406

Using processor affinity in loop scheduling on shared-memory multiprocessors  

Microsoft Academic Search

Loops are the single largest source of parallelism in many applications. One way to exploit this parallelismis to execute loop iterations in parallel on different processors. Previous approaches to loop scheduling attemptto achieve the minimum completion time by distributing the workload as evenly as possible, while minimizingthe number of synchronization operations required. In this paper we consider a third dimension

Evangelos P. Markatos; Thomas J. LeBlanc

1992-01-01

407

Parallel network simulations with NEURON.  

PubMed

The NEURON simulation environment has been extended to support parallel network simulations. Each processor integrates the equations for its subnet over an interval equal to the minimum (interprocessor) presynaptic spike generation to postsynaptic spike delivery connection delay. The performance of three published network models with very different spike patterns exhibits superlinear speedup on Beowulf clusters and demonstrates that spike communication overhead is often less than the benefit of an increased fraction of the entire problem fitting into high speed cache. On the EPFL IBM Blue Gene, almost linear speedup was obtained up to 100 processors. Increasing one model from 500 to 40,000 realistic cells exhibited almost linear speedup on 2,000 processors, with an integration time of 9.8 seconds and communication time of 1.3 seconds. The potential for speed-ups of several orders of magnitude makes practical the running of large network simulations that could otherwise not be explored. PMID:16732488

Migliore, M; Cannia, C; Lytton, W W; Markram, Henry; Hines, M L

2006-10-01

408

Parallelization of Edge Detection Algorithm using MPI on Beowulf Cluster  

NASA Astrophysics Data System (ADS)

In this paper, we present the design of parallel Sobel edge detection algorithm using Foster's methodology. The parallel algorithm is implemented using MPI message passing library and master/slave algorithm. Every processor performs the same sequential algorithm but on different part of the image. Experimental results conducted on Beowulf cluster are presented to demonstrate the performance of the parallel algorithm.

Haron, Nazleeni; Amir, Ruzaini; Aziz, Izzatdin A.; Jung, Low Tan; Shukri, Siti Rohkmah

409

On Supporting Parallelism in a Logic Programming System  

Microsoft Academic Search

Logic Programming is a declarative approach to program- ming where one can specify a problem in a high-level fashion. Several major approaches to implicit and explicit parallelism have been pro- posed for logic programming in Prolog. But, arguably, the last few years have seen most interest in the explicit parallelization of Prolog. With the advent of multi-core processors, parallelism is

V ´ õtor; Santos Costa

410

Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput  

E-print Network

Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput Ondrej Zaj´icek1 , Jir of scheduling unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between its release time and deadline on a subcube of processors. The objective is to max- imize the number of early jobs. We provide

Sgall, Jiri

411

Method for operating a parallel processing system and related apparatus  

SciTech Connect

This patent describes a method for operating a parallel processing system. The system has processors having access to working memory elements to be examined and the processors implementing production rules specifying operations to be performed when a series of conditions are satisfied by specified ones of the working memory elements.

Barabash, W.; Yerazunis, W.S.

1990-10-23

412

A large scale, homogeneous, fully distributed parallel machine, I  

Microsoft Academic Search

The preliminary hardware description of CHOPP (Columbia Homogeneous Parallel Processor), a MIMD machine supporting a fully distributed host-less operating system is presented. The architecture is intended to permit implementation of machines with 105 to 106 processors. Issues of interconnection networks, throughput, and memory structure are treated.

Herbert Sullivan; Theodore R. Bashkow; David Klappholz

1977-01-01

413

Debugging Serial and Parallel Codes  

NSDL National Science Digital Library

Introduction to debugger software. Serial debugging of array indexing, arguments mismatch, infinite loops, pointer misuse, and memory allocation. Parallel debugging of process count, shared memory, MPI I/O, collective communications, and OpenMP scope.

Ncsa

414

PVM Enhancement for Beowulf Multiple-Processor Nodes  

NASA Technical Reports Server (NTRS)

A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

Springer, Paul

2006-01-01

415

The 2nd Symposium on the Frontiers of Massively Parallel Computations  

NASA Technical Reports Server (NTRS)

Programming languages, computer graphics, neural networks, massively parallel computers, SIMD architecture, algorithms, digital terrain models, sort computation, simulation of charged particle transport on the massively parallel processor and image processing are among the topics discussed.

Mills, Ronnie (editor)

1988-01-01

416

To appear in IEEE Trans. on Parallel and Distributed Systems DSC: Scheduling Parallel Tasks on an Unbounded Number of  

E-print Network

To appear in IEEE Trans. on Parallel and Distributed Systems DSC: Scheduling Parallel Tasks the Dominant Sequence Clustering algorithm (DSC) for scheduling parallel tasks on an unbounded number of completely connected processors. The performance of DSC is comparable or even better on average than other

Yang, Tao

417

Parallel contingency statistics with Titan.  

SciTech Connect

This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.

Thompson, David C.; Pebay, Philippe Pierre

2009-09-01

418

Signal-to-noise ratio and parallel imaging performance of a 16-channel receive-only brain coil array at 3.0 Tesla  

Microsoft Academic Search

The performance of a 16-channel receive-only RF coil for brain imaging at 3.0 Tesla was investigated using a custom-built 16-channel receiver. Both the image signal-to-noise ratio (SNR) and the noise amplification (g-factor) in sensitivity-encoding (SENSE) parallel imaging applications were quantitatively eval- uated. Furthermore, the performance was compared with that of hypothetical coils with one, two, four, and eight elements (n)

Jacco A. de Zwart; Patrick J. Ledden; Peter van Gelderen; Jerzy Bodurka; Renxin Chu; Jeff H. Duyn

2004-01-01

419

Use of maximum entropy method with parallel processing machine. [for x-ray object image reconstruction  

NASA Technical Reports Server (NTRS)

The maximum entropy method (MEM) and balanced correlation method were used to reconstruct the images of low-intensity X-ray objects obtained experimentally by means of a uniformly redundant array coded aperture system. The reconstructed images from MEM are clearly superior. However, the MEM algorithm is computationally more time-consuming because of its iterative nature. On the other hand, both the inherently two-dimensional character of images and the iterative computations of MEM suggest the use of parallel processing machines. Accordingly, computations were carried out on the massively parallel processor at Goddard Space Flight Center as well as on the serial processing machine VAX 8600, and the results are compared.

Yin, Lo I.; Bielefeld, Michael J.

1987-01-01

420

CS 685--002 Parallel and Distributed Computation  

E-print Network

of parallel computing philosophy #12; Beowulf System A cluster of tightly coupled PC's (processors use of commodity­off­the­shelf (COTS) com­ ponents (chips and LINUX) Beowulf systems are facilitated

Zhang, Jun

421

Design and evaluation of the Hamal parallel computer  

E-print Network

Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new ...

Grossman, J. P., 1973-

2003-01-01

422

Design and Evaluation of the Hamal Parallel Computer  

E-print Network

Parallel shared-memory machines with hundreds or thousands of processor-memory nodes have been built; in the future we will see machines with millions or even billions of nodes. Associated with such large systems is a new ...

Grossman, J.P.

2002-12-05

423

Parallel Pascal - An extended Pascal for parallel computers  

NASA Technical Reports Server (NTRS)

Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

Reeves, A. P.

1984-01-01

424

A parallel gravitational N-body kernel  

E-print Network

We describe source code level parallelization for the {\\tt kira} direct gravitational $N$-body integrator, the workhorse of the {\\tt starlab} production environment for simulating dense stellar systems. The parallelization strategy, called ``j-parallelization'', involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial forces on the particles to be advanced are calculated in parallel by their parent processors, and are then summed in a final global operation. Once total forces are obtained, the computing elements proceed to the computation of their particle trajectories. We report the results of timing measurements on four different parallel computers, and compare them with theoretical predictions. The computers employ either a high-speed interconnect, a NUMA architecture to minimize the communication overhead or are distributed in a grid. The code scales well in the domain tested, which ranges from 1024 - 65536 stars on 1 - 128 proc...

Zwart, Simon Portegies; Groen, Derek; Gualandris, Alessia; Sipior, Michael; Vermin, Willem

2007-01-01

425

Parallel hypergraph partitioning for scientific computing.  

SciTech Connect

Graph partitioning is often used for load balancing in parallel computing, but it is known that hypergraph partitioning has several advantages. First, hypergraphs more accurately model communication volume, and second, they are more expressive and can better represent nonsymmetric problems. Hypergraph partitioning is particularly suited to parallel sparse matrix-vector multiplication, a common kernel in scientific computing. We present a parallel software package for hypergraph (and sparse matrix) partitioning developed at Sandia National Labs. The algorithm is a variation on multilevel partitioning. Our parallel implementation is novel in that it uses a two-dimensional data distribution among processors. We present empirical results that show our parallel implementation achieves good speedup on several large problems (up to 33 million nonzeros) with up to 64 processors on a Linux cluster.

Heaphy, Robert; Devine, Karen Dragon; Catalyurek, Umit (Ohio State University, Columbus); Bisseling, Robert (Utrecht University, The Netherlands); Hendrickson, Bruce Alan; Boman, Erik Gunnar

2005-07-01

426

Problem Decomposition for Non-Uniformity and Processor Heterogeneity  

Microsoft Academic Search

Grid problems that exhibit non-uniformity or irregularity are common in science andengineering. Methods for partitioning these problems across traditional parallel machineswith homogeneous processors have been examined by a number of researchers. In thispaper we extend some typical decomposition schemes for uniform 2-D grid problemsto accommodate non-uniformity in the problem space as well as heterogeneity of theprocessors. A new block decomposition

Michael J. Quinn; Phyllis E. Crandall

1995-01-01

427

Sequence information signal processor for local and global string comparisons  

DOEpatents

A sequence information signal processing integrated circuit chip designed to perform high speed calculation of a dynamic programming algorithm based upon the algorithm defined by Waterman and Smith. The signal processing chip of the present invention is designed to be a building block of a linear systolic array, the performance of which can be increased by connecting additional sequence information signal processing chips to the array. The chip provides a high speed, low cost linear array processor that can locate highly similar global sequences or segments thereof such as contiguous subsequences from two different DNA or protein sequences. The chip is implemented in a preferred embodiment using CMOS VLSI technology to provide the equivalent of about 400,000 transistors or 100,000 gates. Each chip provides 16 processing elements, and is designed to provide 16 bit, two's compliment operation for maximum score precision of between -32,768 and +32,767. It is designed to provide a comparison between sequences as long as 4,194,304 elements without external software and between sequences of unlimited numbers of elements with the aid of external software. Each sequence can be assigned different deletion and insertion weight functions. Each processor is provided with a similarity measure device which is independently variable. Thus, each processor can contribute to maximum value score calculation using a different similarity measure.

Peterson, John C. (Alta Loma, CA); Chow, Edward T. (San Dimas, CA); Waterman, Michael S. (Culver City, CA); Hunkapillar, Timothy J. (Pasadena, CA)

1997-01-01

428

Network control processor for a TDMA system  

NASA Astrophysics Data System (ADS)

Two unique aspects of designing a network control processor (NCP) to monitor and control a demand-assigned, time-division multiple-access (TDMA) network are described. The first involves the implementation of redundancy by synchronizing the databases of two geographically remote NCPs. The two sets of databases are kept in synchronization by collecting data on both systems, transferring databases, sending incremental updates, and the parallel updating of databases. A periodic audit compares the checksums of the databases to ensure synchronization. The second aspect involves the use of a tracking algorithm to dynamically reallocate TDMA frame space. This algorithm detects and tracks current and long-term load changes in the network. When some portions of the network are overloaded while others have excess capacity, the algorithm automatically calculates and implements a new burst time plan.

Suryadevara, Omkarmurthy; Debettencourt, Thomas J.; Shulman, R. B.

429

Parallel Exact Inference on the Cell Broadband Engine Processor  

E-print Network

probabilistic graphical models. In such a model, the computation complexity increases dramatically graph (DAG) structured computations. I. INTRODUCTION A full joint probability distribution for any real-world sys- tem can be used for inference. However, such a distribution increases intractably with the number

Prasanna, Viktor K.

430

Parallel packet classification using GPU co-processors  

Microsoft Academic Search

In the domain of network security, packet filtering for classification purposes is of significant interest. Packet classification provides a mechanism for understanding the composition of packet streams arriving at distinct network interfaces, and is useful in diagnosing threats and uncovering vulnerabilities so as to maximise data integrity and system security. Traditional packet classifiers, such as PCAP, have utilised Control Flow

Alastair Nottingham; Barry Irwin

2010-01-01

431

Software orchestration of instruction level parallelism on tiled processor architectures  

E-print Network

Projection from silicon technology is that while transistor budget will continue to blossom according to Moore's law, latency from global wires will severely limit the ability to scale centralized structures at high ...

Lee, Walter (Walter Cheng-Wan)

2005-01-01

432

Design of a massively parallel computer using bit serial processing elements  

NASA Technical Reports Server (NTRS)

A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.

Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing

1995-01-01

433

An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors  

NASA Technical Reports Server (NTRS)

This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.

Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

2015-01-01

434

Anomalies in parallel branch-and-bound algorithms  

Microsoft Academic Search

We consider the effects of parallelizing branch-and-bound algorithms by expanding several live nodes simultaneously. It is shown that it is quite possible for a parallel branch-and-bound algorithm using n2 processors to take more time than one using n1 processors, even though n1 n2. Furthermore, it is also possible to achieve speed-ups that are in excess of the ratio n2\\/n1. Experimental

Ten-Hwang Lai; Sartaj Sahni

1984-01-01

435

Anomalies in Parallel Branch-and-Bound Algorithms  

Microsoft Academic Search

We consider the effects of parallelizing branch-and-bound algorithms by expanding several live nodes simultaneously. It is shown that it is quite possible for a parallel branch-and-bound algo- rithm using n 2 processors to take more time than one using n 1 processors even though n 1 < n 2. Furthermore, it is also possible to achieve speedups that are in

Ten-hwang Lai; Sartaj Sahni

1983-01-01

436

UWindsor Nios II: A soft-core processor for design space exploration  

Microsoft Academic Search

Field-programmable gate arrays (FPGAs) are increasingly being used for implementing embedded systems. Soft-core processors for FPGAs are also becoming popular due to reduced design costs and better flexibility. Commercial soft-core processors such as Altera Nios II and Xilinx Microblaze have been widely deployed. While some research has been done exploring the design space of soft-core CPUs, much work remains to

Omar A. Al Rayahi; Mohammed A. S. Khalid

2009-01-01

437

Biosensors: Large-Scale Parallel Surface Functionalization of Goblet-type Whispering Gallery Mode Microcavity Arrays for Biosensing Applications (Small 19/2014).  

PubMed

To enable competitive biomolecular screening of complex analytes, dense and selective surface functionalization of sensor transducers is imperative. On page 3863, M. Hirtz, S. Koeber, and co-workers report a novel technique allowing the large-scale deposition of functional inks onto individual chip-based sensors. Polymer pen lithography-prepared glass slides act as custom stamp pads to directly print biochemical acceptor molecules. The technique is applied to densely packaged arrays of whispering gallery mode microgoblet cavities, which are then shown to function as biosensors. PMID:25292396

Bog, Uwe; Brinkmann, Falko; Kalt, Heinz; Koos, Christian; Mappes, Timo; Hirtz, Michael; Fuchs, Harald; Köber, Sebastian

2014-10-01

438

A restructurable VLSI robotics vector processor architecture for real-time control  

Microsoft Academic Search

The authors propose a restructurable architecture based on a VLSI robotics vector processor (RVP) chip. It is specially tailored to exploit parallelism in the low-level matrix\\/vector operations characteristic of the kinematics and dynamics computations required for real-time control. The RVP is composed of three tightly synchronized 32-bit floating-point processors to provide adequate computational power. Besides adder and multiplier units in

PONNUSWAMY SADAYAPPAN; YONG-LONG CALVIN LING; KARL W. OLSON; DAVID E. ORIN

1989-01-01

439

Algorithms for Automatic Alignment of Arrays  

NASA Technical Reports Server (NTRS)

Aggregate data objects (such as arrays) are distributed across the processor memories when compiling a data-parallel language for a distributed-memory machine. The mapping determines the amount of communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: an alignment that maps all the objects to an abstract template, followed by a distribution that maps the template to the processors. This paper describes algorithms for solving the various facets of the alignment problem: axis and stride alignment, static and mobile offset alignment, and replication labeling. We show that optimal axis and stride alignment is NP-complete for general program graphs, and give a heuristic method that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. We also show how local graph contractions can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. We show how to model the static offset alignment problem using linear programming, and we show that loop-dependent mobile offset alignment is sometimes necessary for optimum performance. We describe an algorithm with for determining mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself or can be used to improve performance. We describe an algorithm based on network flow that replicates objects so as to minimize the total amount of broadcast communication in replication.

Chatterjee, Siddhartha; Gilbert, John R.; Oliker, Leonid; Schreiber, Robert; Sheffler, Thomas J.

1996-01-01

440

A parallel gravitational N-body kernel  

NASA Astrophysics Data System (ADS)

We describe source code level parallelization for the kira direct gravitational N-body integrator, the workhorse of the starlab production environment for simulating dense stellar systems. The parallelization strategy, called " j-parallelization", involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial forces on the particles to be advanced are calculated in parallel by their parent processors, and are then summed in a final global operation. Once total forces are obtained, the computing elements proceed to the computation of their particle trajectories. We report the results of timing measurements on four different parallel computers, and compare them with theoretical predictions. The computers employ either a high-speed interconnect, a NUMA architecture to minimize the communication overhead or are distributed in a grid. The code scales well in the domain tested, which ranges from 1024 to 65,536 stars on 1-128 processors, providing satisfactory speedup. Running the production environment on a grid becomes inefficient for more than 60 processors distributed across three sites.

Portegies Zwart, Simon; McMillan, Stephen; Groen, Derek; Gualandris, Alessia; Sipior, Michael; Vermin, Willem

2008-07-01

441

A parallel gravitational N-body kernel  

E-print Network

We describe source code level parallelization for the {\\tt kira} direct gravitational $N$-body integrator, the workhorse of the {\\tt starlab} production environment for simulating dense stellar systems. The parallelization strategy, called ``j-parallelization'', involves the partition of the computational domain by distributing all particles in the system among the available processors. Partial forces on the particles to be advanced are calculated in parallel by their parent processors, and are then summed in a final global operation. Once total forces are obtained, the computing elements proceed to the computation of their particle trajectories. We report the results of timing measurements on four different parallel computers, and compare them with theoretical predictions. The computers employ either a high-speed interconnect, a NUMA architecture to minimize the communication overhead or are distributed in a grid. The code scales well in the domain tested, which ranges from 1024 - 65536 stars on 1 - 128 processors, providing satisfactory speedup. Running the production environment on a grid becomes inefficient for more than 60 processors distributed across three sites.

Simon Portegies Zwart; Steve McMillan; Derek Groen; Alessia Gualandris; Michael Sipior; Willem Vermin

2007-11-05

442

EFFICIENT SCHEDULING OF PARALLEL JOBS ON MASSIVELY PARALLEL SYSTEMS  

SciTech Connect

We present buffered coscheduling, a new methodology to multitask parallel jobs in a message-passing environment and to develop parallel programs that can pave the way to the efficient implementation of a distributed operating system. Buffered coscheduling is based on three innovative techniques: communication buffering, strobing, and non-blocking communication. By leveraging these techniques, we can perform effective optimizations based on the global status of the parallel machine rather than on the limited knowledge available locally to each processor. The advantages of buffered coscheduling include higher resource utilization, reduced communication overhead, efficient implementation of low-control strategies and fault-tolerant protocols, accurate performance modeling, and a simplified yet still expressive parallel programming model. Preliminary experimental results show that buffered coscheduling is very effective in increasing the overall performance in the presence of load imbalance and communication-intensive workloads.

F. PETRINI; W. FENG

1999-09-01

443

Integrated fuel processor development challenges.  

SciTech Connect

In the absence of a hydrogen-refueling infrastructure, the success of the fuel cell system in the market will depend on fuel processors to enable the use of available fuels, such as gasoline, natural gas, etc. The fuel processor includes several catalytic reactors, scrubbers to remove chemical species that can poison downstream catalysts or the fuel cell electrocatalyst, and heat exchangers. Most fuel cell power applications seek compact, lightweight hardware with rapid-start and load- following capabilities. Although packaging can partially address the size and volume, balancing the performance parameters while maintaining the fuel conversion (to hydrogen) efficiency requires careful integration of the unit operations and processes. Argonne National Laboratory has developed integrated fuel processors that are compact and light, and that operate efficiently. This paper discusses some of the difficulties encountered in the development process, focusing on the factors/components that constrain performance, and areas that need further research and development.

Ahmed, S.; Pereira, Lee, S. H. D.; Kaun, T.; Krumpelt, M.

2002-01-09

444

Single Breath-Hold Non-Contrast Thoracic MRA Using Highly-Accelerated Parallel Imaging With a 32-element Coil Array  

PubMed Central

OBJECTIVE To evaluate the feasibility of performing single breath-hold 3D thoracic non-contrast magnetic resonance angiography (NC-MRA) using highly-accelerated parallel imaging. MATERIALS AND METHODS We developed a single breath-hold NC MRA pulse sequence using balanced steady state free precession (SSFP) readout and highly-accelerated parallel imaging. In 17 subjects, highly-accelerated non-contrast MRA was compared against electrocardiogram (ECG)-triggered contrast-enhanced MRA. Anonymized images were randomized for blinded review by two independent readers for image quality, artifact severity in 8 defined vessel segments and aortic dimensions in 6 standard sites. NC-MRA and CE-MRA were compared in terms of these measures using paired sample t and Wilcoxon tests. RESULTS The overall image quality (3.21±0.68 for NC-MRA vs. 3.12±0.71 for CE-MRA) and artifact (2.87±1.01 for NC-MRA vs. 2.92±0.87 for CE-MRA) scores were not significantly different, but there were significant differences for the great vessel and coronary artery origins. NC-MRA demonstrated significantly lower aortic diameter measurements compared to CE-MRA; however, this difference was not considered clinically relevant (>3 mm difference) for less than 12% of segments, most commonly at the sinotubular junction. Mean total scan time was significantly lower for NC-MRA compared to CE-MRA (18.2 ± 6.0s vs. 28.1 ± 5.4s, respectively; p < 0.05). CONCLUSION Single breath-hold NC-MRA is feasible and can be a useful alternative for evaluation and follow-up of thoracic aortic diseases. PMID:22147589

Xu, Jian; Mcgorty, Kelly Anne; Lim, Ruth. P.; Bruno, Mary; Babb, James S.; Srichai, Monvadi B.; Kim, Daniel; Sodickson, Daniel K.

2011-01-01

445

Parallel solid mechanics codes at Sandia National Laboratories  

SciTech Connect

Computational physicists at Sandia National Laboratories have moved their production codes to distributed memory parallel computers. The codes include the multi-material CTH Eulerian code, structural mechanics code. This presentation discusses our experiences moving the codes to parallel computers and experiences running the codes. Moving large production codes onto parallel computers require developing parallel algorithms, parallel data bases and parallel support tools. We rewrote the Eulerian CTH code for parallel computers. We were able to move both ALEGRA and PRONTO to parallel computers with only a modest number of modifications. We restructured the restart and graphics data bases to make them parallel and minimize the I/O to the parallel computer. We developed mesh decomposition tools to divide a rectangular or arbitrary connectivity mesh into sub-meshes. The sub-meshes map to processors and minimize the communication between processors. We developed new visualization tools to process the very large, parallel data bases. This presentation also discusses our experiences running these codes on Sandia`s 1840 compute node Intel Paragon, 1024 processor nCUBE and networked workstations. The parallel version of CTH uses the Paragon and nCUBE for production calculations. The ALEGRA and PRONTO codes are moving off networked workstations onto the Paragon and nCUBE massively parallel computers.

McGlaun, M.

1994-08-01

446

Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA  

NASA Astrophysics Data System (ADS)

With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.

Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

2013-03-01

447

Fast processor for dilepton triggers  

SciTech Connect

We describe a fast trigger processor, developed for and used in Fermilab experiment E-537, for selecting high-mass dimuon events produced by negative pions and anti-protons. The processor finds candidate tracks by matching hit information received from drift chambers and scintillation counters, and determines their momenta. Invariant masses are calculated for all possible pairs of tracks and an event is accepted if any invariant mass is greater than some preselectable minimum mass. The whole process, accomplished within 5 to 10 microseconds, achieves up to a ten-fold reduction in trigger rate.

Katsanevas, S.; Kostarakis, P.; Baltrusaitis, R.

1983-01-01

448

High order parallel numerical schemes for solving incompressible flows  

NASA Technical Reports Server (NTRS)

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-01-01

449

High order parallel numerical schemes for solving incompressible flows  

NASA Astrophysics Data System (ADS)

The use of parallel computers for numerically solving flow fields has gained much importance in recent years. This paper introduces a new high order numerical scheme for computational fluid dynamics (CFD) specifically designed for parallel computational environments. A distributed MIMD system gives the flexibility of treating different elements of the governing equations with totally different numerical schemes in different regions of the flow field. The parallel decomposition of the governing operator to be solved is the primary parallel split. The primary parallel split was studied using a hypercube like architecture having clusters of shared memory processors at each node. The approach is demonstrated using examples of simple steady state incompressible flows. Future studies should investigate the secondary split because, depending on the numerical scheme that each of the processors applies and the nature of the flow in the specific subdomain, it may be possible for a processor to seek better, or higher order, schemes for its particular subcase.

Lin, Avi; Milner, Edward J.; Liou, May-Fun; Belch, Richard A.

1992-02-01

450

FPGA-based reconfigurable processor for ultrafast interlaced ultrasound and photoacoustic imaging.  

PubMed

In this paper, we report, to the best of our knowledge, a unique field-programmable gate array (FPGA)-based reconfigurable processor for real-time interlaced co-registered ultrasound and photoacoustic imaging and its application in imaging tumor dynamic response. The FPGA is used to control, acquire, store, delay-and-sum, and transfer the data for real-time co-registered imaging. The FPGA controls the ultrasound transmission and ultrasound and photoacoustic data acquisition process of a customized 16-channel module that contains all of the necessary analog and digital circuits. The 16-channel module is one of multiple modules plugged into a motherboard; their beamformed outputs are made available for a digital signal processor (DSP) to access using an external memory interface (EMIF). The FPGA performs a key role through ultrafast reconfiguration and adaptation of its structure to allow real-time switching between the two imaging modes, including transmission control, laser synchronization, internal memory structure, beamforming, and EMIF structure and memory size. It performs another role by parallel accessing of internal memories and multi-thread processing to reduce the transfer of data and the processing load on the DSP. Furthermore, because the laser will be pulsing even during ultrasound pulse-echo acquisition, the FPGA ensures that the laser pulses are far enough from the pulse-echo acquisitions by appropriate time-division multiplexing (TDM). A co-registered ultrasound and photoacoustic imaging system consisting of four FPGA modules (64-channels) is constructed, and its performance is demonstrated using phantom targets and in vivo mouse tumor models. PMID:22828830

Alqasemi, Umar; Li, Hai; Aguirre, Andrés; Zhu, Quing

2012-07-01

451

Parallel Processing of Broad-Band PPM Signals  

NASA Technical Reports Server (NTRS)

A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple