These are representative sample records from Science.gov related to your search topic.
For comprehensive and current results, perform a real-time search at Science.gov.
1

Design Space Exploration for Massively Parallel Processor Arrays  

Microsoft Academic Search

In this paper, we describe an approach for the optimiza- tion of dedicated co-processors that are implemented either in hardware (ASIC) or congware (FPGA). Such massively parallel co-processors are typically part of a heterogeneous hardware\\/software-system. Each co- processor is a massive parallel system consisting of an array of processing elements (PEs). In order to decide whether to map a computational

Frank Hannig; Jürgen Teich

2001-01-01

2

Watershed parallel algorithm for asynchronous processors array  

Microsoft Academic Search

A joint algorithm-architecture analysis leads to a new version of picture segmentation system adapted to multimedia mobile terminal constraints. The asynchronous processors network, with a granularity level of one processor per pixel, based on data flow model, takes less than 10 ?s to segment a SQCIF $88*72 pixels - image (about 2000 times faster than the classical sequential watershed algorithms).

B. Galilee; Franck Mamalet; M. Renaudin; P.-Y. Coulon

2002-01-01

3

Digital Parallel Processor Array for Optimum Path Planning  

NASA Technical Reports Server (NTRS)

The invention computes the optimum path across a terrain or topology represented by an array of parallel processor cells interconnected between neighboring cells by links extending along different directions to the neighboring cells. Such an array is preferably implemented as a high-speed integrated circuit. The computation of the optimum path is accomplished by, in each cell, receiving stimulus signals from neighboring cells along corresponding directions, determining and storing the identity of a direction along which the first stimulus signal is received, broadcasting a subsequent stimulus signal to the neighboring cells after a predetermined delay time, whereby stimulus signals propagate throughout the array from a starting one of the cells. After propagation of the stimulus signal throughout the array, a master processor traces back from a selected destination cell to the starting cell along an optimum path of the cells in accordance with the identity of the directions stored in each of the cells.

Kremeny, Sabrina E. (Inventor); Fossum, Eric R. (Inventor); Nixon, Robert H. (Inventor)

1996-01-01

4

Parallel B-Spline Surface Interpolation on a Mesh-Connected Processor Array  

Microsoft Academic Search

A parallel implementation of Chebyshev method is presented for the B-spline surface interpolation problem. The algorithm finds the control points of a uniform bicubic B-spline surface that interpolates m × n data points on an m × n mesh-connected processor array in constant time. Hence it is optimal. Due to its numerical stability, the algorithm can successfully be used in

Fuhua Cheng; Grzegorz W. Wasilkowski; Jiaye Wang; Caiming Zhang; Wenping Wang

1995-01-01

5

Adaptive sensing and image processing with a general-purpose pixel-parallel sensor\\/processor array integrated circuit  

Microsoft Academic Search

In this paper, a pixel-parallel image sensor\\/processor architecture with a fine-grain massively parallel SIMD analogue processor array is overviewed and the latest VLSI implementation, SCAMPS vision chip, comprising 128 times 128 array, fabricated in a 0.35mum CMOS technology, is presented. Examples of real-time image-processing executed on the chip are shown. Sensor-level data reduction, wide dynamic range and adaptive sensing algorithms,

Piotr Dudek

2007-01-01

6

Array processor architecture  

NASA Technical Reports Server (NTRS)

A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1983-01-01

7

Array processors in chemistry  

SciTech Connect

The field of attached scientific processors (''array processors'') is surveyed, and an attempt is made to indicate their present and possible future use in computational chemistry. The current commercial products from Floating Point Systems, Inc., Datawest Corporation, and CSP, Inc. are discussed.

Ostlund, N.S.

1980-01-01

8

Parallel processor\\/memory circuit  

Microsoft Academic Search

An array of processor\\/memories is described comprising: an instruction decoder that generates tables of outputs in response to instructions received at the decoder, processor\\/memories each of which comprises a memory means into which data may be written and from which data may be read and a processor for producing an output depending at least in part on data read from

W. D. Hillis; T. F. Jr. Knight; A. Bawden; B. L. Kahle; D. Chapman; D. P. Christman; C. A. Lasser; C. R. Feynman

1987-01-01

9

Supporting dynamic parallel object arrays  

Microsoft Academic Search

We present efficient support for generalized arrays of parallel data driven objects. The “array elements” are scattered across a parallel machine. Each array element is an object that can be thought of as a virtual processor. The individual elements are addressed by their “index”, which can be an arbitrary object rather than a simple integer. For example, it can be

Orion Sky Lawlor; Laxmikant V. Kalé

2001-01-01

10

Optical systolic array processor using residue arithmetic  

NASA Technical Reports Server (NTRS)

The use of residue arithmetic to increase the accuracy and reduce the dynamic range requirements of optical matrix-vector processors is evaluated. It is determined that matrix-vector operations and iterative algorithms can be performed totally in residue notation. A new parallel residue quantizer circuit is developed which significantly improves the performance of the systolic array feedback processor. Results are presented of a computer simulation of this system used to solve a set of three simultaneous equations.

Jackson, J.; Casasent, D.

1983-01-01

11

An inner product processor design using novel parallel counter circuits  

Microsoft Academic Search

This paper presents a novel parallel inner product processor architecture. The proposed processor has the following features: (1) it can be easily reconfigured for computing inner products of input arrays with four or more types of structures. Typically, each input array may contain 64 8-bit items, or 16 16-bit items, or 4 32-bit items, or 1 64-bit item, with items

Rong Lin; A. S. Botha; K. E. Kerr; G. A. Brown

1999-01-01

12

Parallel Analog-to-Digital Image Processor  

NASA Technical Reports Server (NTRS)

Proposed integrated-circuit network of many identical units convert analog outputs of imaging arrays of x-ray or infrared detectors to digital outputs. Converter located near imaging detectors, within cryogenic detector package. Because converter output digital, lends itself well to multiplexing and to postprocessing for correction of gain and offset errors peculiar to each picture element and its sampling and conversion circuits. Analog-to-digital image processor is massively parallel system for processing data from array of photodetectors. System built as compact integrated circuit located near local plane. Buffer amplifier for each picture element has different offset.

Lokerson, D. C.

1987-01-01

13

Flight array processor  

NASA Technical Reports Server (NTRS)

Spaceflight applications for the NASA Scatterometer (NSCAT) , an ocean surface wind measuring system flown as part of the Navy Remote Ocean Sensing System (NROSS) are discussed in outline form, along with information on the Advanced Digital Synthetic aperture radar Processor (ADSP) that is being developed for ground-based processing of spacecraft Earth observations. Design considerations are listed. A block diagram of the scatterometer is given.

1985-01-01

14

On supercomputing with systolic\\/wavefront array processors  

Microsoft Academic Search

Tremendous progress has been made on several promising parallel architectures for scientific computations, including a variety of digital filters, fast Fourier transform (FFT) processors, data-flow processors, systolic arrays, and wavefront arrays. This paper describes these computing networks in terms of signal-flow graphs (SFG) or data-flow graphs (DFG), and proposes a methodology of converting SFG computing networks into synchronous systolic arrays

Sun-Yuan Kung

1984-01-01

15

Systolic diagnosis of processor arrays  

SciTech Connect

With the advances in VLSI technology, it has become feasible to implement a multiprocessor system consisting of identical cells on a single chip or on a single wafer. Unfortunately, the realization of such a system is impeded by several difficult technological problems. Of these, probably the most fundamental problem is the high probability of failure of a system of such dimensions. In this thesis, an efficient fault diagnosis method for processor arrays is proposed. The method is based on systolic comparison of array cell functions. In the fault diagnosis, both cells and programmable switches are assumed to fail. The performance of the method is analyzed with respect both to the locatability of fault-free cells and to testing time. Real time applications of the method to homogeneous systems are illustrated. Finally, algorithm level systolic diagnosis for a VLSI sorter and an FF processor is studied. Low hardware and time overhead are shown to be major advantages of the method.

Choi, Y.H.

1986-01-01

16

Adaptively Parallel Processor Allocation for Cilk Jobs  

E-print Network

The problem of allocating processor resources fairly and efficiently to parallel jobs has been studied extensively in the past. Most of this work, however, assumes that the instantaneous parallelism of the jobs is known ...

Sen, Siddhartha

17

Ultrafast Fourier-transform parallel processor  

SciTech Connect

A new, flexible, parallel-processing architecture is developed for a high-speed, high-precision Fourier transform processor. The processor is intended for use in 2-D signal processing including spatial filtering, matched filtering and image reconstruction from projections.

Greenberg, W.L.

1980-04-01

18

Grundy - Parallel processor architecture makes programming easy  

NASA Technical Reports Server (NTRS)

The hardware, software, and firmware of the parallel processor, Grundy, are examined. The Grundy processor uses a simple processor that has a totally orthogonal three-address instruction set. The system contains a relative and indirect processing mode to support the high-level language, and uses pseudoprocessors and read-only memory. The system supports high-level language in which arbitrary degrees of algorithmic parallelism is expressed. The functions of the compiler and invocation frame are described. Grundy uses an operating system that can be accessed by an arbitrary number of processes simultaneously, and the access time grows only as the logarithm of the number of active processes. Applications for the parallel processor are discussed.

Meier, R. J., Jr.

1985-01-01

19

Adaptive sensing and image processing with a general-purpose pixel-parallel sensor/processor array integrated circuit  

E-print Network

integrated circuit Piotr Dudek School of Electrical and Electronic Engineering, University of Manchester PO-4] has focused on the development of such arrays, integrated with image sensors, in a single silicon chip-detection, multi-resolution read-out with pixel binning, high dynamic range sensing with multiple exposure, locally

Dudek, Piotr

20

Parallel Data Mining on Graphics Processors  

Microsoft Academic Search

We introduce GPUMiner, a novel parallel data mining system that utilizes new-generation graphics processing units (GPUs). Our sys- tem relies on the massively multi-threaded SIMD (Single Instruc- tion, Multiple-Data) architecture provided by GPUs. As special- purpose co-processors, these processors are highly optimized for graphics rendering and rely on the CPU for data input\\/output as well as complex program control. Therefore,

Wenbin Fang; Ka Keung Lau; Mian Lu; Xiangye Xiao; Chi Kit Lam; Philip Yang Yang; Bingsheng He; Qiong Luo; Pedro V. Sander; Ke Yang

2008-01-01

21

Parallel processor programs in the Federal Government  

NASA Technical Reports Server (NTRS)

In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.

Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

1985-01-01

22

Binocular Disparity Calculation on a Massively-Parallel Analog Vision Processor  

E-print Network

We studied neuromorphic models of binocular disparity processing and mapped them onto a vision chip containing a massively parallel analog processor array. Our goal was to make efficient use of the available hardware while ...

Mandal, Soumyajit

23

Multiple-fold clustered processor mesh array  

NASA Astrophysics Data System (ADS)

The multiple-fold clustered processor mesh array is a triangular organization of clustered processing elements. This multiple-fold array maintains functional equivalence to the nearest neighbor mesh computer with uni-directional interprocessor communications, but with half the number of connection wires. In addition, the connectivity of the multiple-folded organization is superior to the standard square mesh due to the improved connectivity between the clustered processors. One of the primary application areas targeted is High Performance Architectures for image processing.

Pechanek, Gerald G.; Vassiliadis, Stamatis; Delgado, Jose G.

24

Multiple-fold clustered processor mesh array  

NASA Technical Reports Server (NTRS)

The multiple-fold clustered processor mesh array is a triangular organization of clustered processing elements. This multiple-fold array maintains functional equivalence to the nearest neighbor mesh computer with uni-directional interprocessor communications, but with half the number of connection wires. In addition, the connectivity of the multiple-folded organization is superior to the standard square mesh due to the improved connectivity between the clustered processors. One of the primary application areas targeted is High Performance Architectures for image processing.

Pechanek, Gerald G.; Vassiliadis, Stamatis; Delgado, Jose G.

1993-01-01

25

Grundy: Parallel Processor Architecture Makes Programming Easy  

NASA Astrophysics Data System (ADS)

Grundy, an architecture for parallel processing, facilitates the use of high-level languages. In Grundy, several thousand simple processors are dispersed throughout the address space and the concept of machine state is replaced by an invokation frame, a data structure of local variables, program counter, and pointers to superprocesses (parents), subprocesses (children), and concurrent processes (siblings). Each instruction execution consists of five phases. An instruction is fetched, the instruction is decoded, the sources are fetched, the operation is performed, and the destination is written. This breakdown of operations is easily pipelinable. The instruction format of Grundy is completely orthogonal, so Grundy machine code consists of a set of register transfer control bits. The process state pointers are used to collect unused resources such as processors and memory. Joseph Mahon[1] found that as the degree of physical parallelism increases, throughput, including overhead, increases even if extra overhead is needed to split logical processes. As stack pointer, accumulators, and index registers facilitate using high-level languages on conventional computers, pointers to parents, children, and siblings simplify the use of a run-time operating system. The ability to ignore the physical structure of a large number of simple processors supports the use of structured programming. A very simple processor cell allows the replication of approximately 16 32-bit processors on a single Very Large Scale Integration chip. (2M lambda[2]) A bootstrapper and Input/Output channels can be hardwired (using ROM cells and pseudo-processor cells) into a 100 chip computer that is expected to have over 500 procesors, 500K memory, and a network supporting up to 64 concurrent messages between 1000 nodes. These sizes are merely typical and not limits.

Meier, Robert J.

1985-12-01

26

Supporting dynamic parallel object arrays  

Microsoft Academic Search

ABSTRACT We present efficient support for generalized arrays of parallel data driven objects. Array elements are regular C++ objects, and are scattered across the parallel machine. An individual element is addressed by its \\

Orion Sky Lawlor; Laxmikant V. Kalé

2003-01-01

27

Demonstration and Architectural Analysis of Complementary Metal-Oxide Semiconductor Multiple-Quantum-Well Smart-Pixel Array Cellular Logic Processors for Single-Instruction Multiple-Data Parallel-Pipeline Processing  

NASA Astrophysics Data System (ADS)

We present an optoelectronic-VLSI system that integrates complementary metal-oxide semiconductor multiple-quantum-well smart pixels for high-throughput computation and signal processing. The system uses 5 10 cellular smart-pixel arrays with intrachip electrical mesh interconnections and interchip optical point-to-point interconnections. Each smart pixel is a fine grain microprocessor that executes binary image algebra instructions. There is one dual-rail optical modulator output and one dual-rail optical detector input in each pixel. These optical input output arrays provide chip-to-chip optical interconnects. Cascading these smart-pixel array chips permits direct transfer of two-dimensional data or images in parallel. We present laboratory demonstrations of the system for digital image edge detection and digital video motion estimation. We also analyze the performance of the system compared with that of conventional single-instruction multiple-data processors.

Wu, Jen-Ming; Kuznia, Charles B.; Hoanca, Bogdan; Chen, Chih-Hao; Sawchuk, Alexander A.

1999-04-01

28

Advanced computers: parallel and biochip processors  

SciTech Connect

This book is divided into two sections. The first reviews and compares computational capabilities of very high-speed computers available commercially in the context of applications. Features of vector/parallel processors, surveys of equipment, and prospective hardware and software developments are discussed. Sections 2 reviews the state-of-the-art of the emerging fields of bioelectronics and biochip technology. At the present time proteins and other biological materials are being tested for their utility in the fabrication of microelectronic devices. The authors also project future trends in bioelectronics in terms of anticipated technical advances. Both sections include bibliographies for further study.

Lord, N.W.

1983-01-01

29

Scalable Unix tools on parallel processors  

SciTech Connect

The introduction of parallel processors that run a separate copy of Unix on each process has introduced new problems in managing the user`s environment. This paper discusses some generalizations of common Unix commands for managing files (e.g. 1s) and processes (e.g. ps) that are convenient and scalable. These basic tools, just like their Unix counterparts, are text-based. We also discuss a way to use these with a graphical user interface (GUI). Some notes on the implementation are provided. Prototypes of these commands are publicly available.

Gropp, W.; Lusk, E.

1994-12-31

30

Characterization of processing errors on analog fully-programmable cellular sensor-processor arrays  

E-print Network

workarounds, choose different algorithms or indeed output the data and process off-chip. Storage elements instruction parallelism through use of compact processing elements. A fundamental aspect of processor processor element, i.e. for an M x N array of elements: , , (1) where 1 i M and 1 j N. The operation

Dudek, Piotr

31

Acceleration of computer-generated hologram by Greatly Reduced Array of Processor Element with Data Reduction  

NASA Astrophysics Data System (ADS)

We have implemented a computer-generated hologram (CGH) calculation on Greatly Reduced Array of Processor Element with Data Reduction (GRAPE-DR) processors. The cost of CGH calculation is enormous, but CGH calculation is well suited to parallel computation. The GRAPE-DR is a multicore processor that has 512 processor elements. The GRAPE-DR supports a double-precision floating-point operation and can perform CGH calculation with high accuracy. The calculation speed of the GRAPE-DR system is seven times faster than that of a personal computer with an Intel Core i7-950 processor.

Sugiyama, Atsushi; Masuda, Nobuyuki; Oikawa, Minoru; Okada, Naohisa; Kakue, Takashi; Shimobaba, Tomoyoshi; Ito, Tomoyoshi

2014-11-01

32

Broadband monitoring simulation with massively parallel processors  

NASA Astrophysics Data System (ADS)

Modern efficient optimization techniques, namely needle optimization and gradual evolution, enable one to design optical coatings of any type. Even more, these techniques allow obtaining multiple solutions with close spectral characteristics. It is important, therefore, to develop software tools that can allow one to choose a practically optimal solution from a wide variety of possible theoretical designs. A practically optimal solution provides the highest production yield when optical coating is manufactured. Computational manufacturing is a low-cost tool for choosing a practically optimal solution. The theory of probability predicts that reliable production yield estimations require many hundreds or even thousands of computational manufacturing experiments. As a result reliable estimation of the production yield may require too much computational time. The most time-consuming operation is calculation of the discrepancy function used by a broadband monitoring algorithm. This function is formed by a sum of terms over wavelength grid. These terms can be computed simultaneously in different threads of computations which opens great opportunities for parallelization of computations. Multi-core and multi-processor systems can provide accelerations up to several times. Additional potential for further acceleration of computations is connected with using Graphics Processing Units (GPU). A modern GPU consists of hundreds of massively parallel processors and is capable to perform floating-point operations efficiently.

Trubetskov, Mikhail; Amotchkina, Tatiana; Tikhonravov, Alexander

2011-09-01

33

Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor  

E-print Network

Large-Scale Parallel Programming: Experience with the BBN Butterfly Parallel Processor Thomas J. Le of Rochester have used a collection of BBN Butterfly TM Parallel Processors to conduct research in parallel with the Butterfly we have ported three compilers, developed five major and several minor library packages, built two

Scott, Michael L.

34

Breadboard Signal Processor for Arraying DSN Antennas  

NASA Technical Reports Server (NTRS)

A recently developed breadboard version of an advanced signal processor for arraying many antennas in NASA s Deep Space Network (DSN) can accept inputs in a 500-MHz-wide frequency band from six antennas. The next breadboard version is expected to accept inputs from 16 antennas, and a following developed version is expected to be designed according to an architecture that will be scalable to accept inputs from as many as 400 antennas. These and similar signal processors could also be used for combining multiple wide-band signals in non-DSN applications, including very-long-baseline interferometry and telecommunications. This signal processor performs functions of a wide-band FX correlator and a beam-forming signal combiner. [The term "FX" signifies that the digital samples of two given signals are fast Fourier transformed (F), then the fast Fourier transforms of the two signals are multiplied (X) prior to accumulation.] In this processor, the signals from the various antennas are broken up into channels in the frequency domain (see figure). In each frequency channel, the data from each antenna are correlated against the data from each other antenna; this is done for all antenna baselines (that is, for all antenna pairs). The results of the correlations are used to obtain calibration data to align the antenna signals in both phase and delay. Data from the various antenna frequency channels are also combined and calibration corrections are applied. The frequency-domain data thus combined are then synthesized back to the time domain for passing on to a telemetry receiver

Jongeling, Andre; Sigman, Elliott; Chandra, Kumar; Trinh, Joseph; Soriano, Melissa; Navarro, Robert; Rogstad, Stephen; Goodhart, Charles; Proctor, Robert; Jourdan, Michael; Rayhrer, Benno

2008-01-01

35

Low-complexity distributed parallel processor for 2D IIR broadband beam plane-wave filters  

Microsoft Academic Search

Real-time systolic-array-based implementations of VLSI two-dimensional (2D) infinite-impulse-response (IIR) frequency-planar beam plane-wave filters have potentially wide applications in the filtering of spatio-temporal RF broadband plane waves based on their directions of arrival (DOAs). Distributed-parallel-processor (DPP) implementations of the systolic arrays allow synchronous sampling of the 2D input signal array, but because of the direct-form structure they have high circuit complexity.

H. L. P. A. Madanayake; Len Bruton

2007-01-01

36

Scan line graphics generation on the massively parallel processor  

NASA Technical Reports Server (NTRS)

Described here is how researchers implemented a scan line graphics generation algorithm on the Massively Parallel Processor (MPP). Pixels are computed in parallel and their results are applied to the Z buffer in large groups. To perform pixel value calculations, facilitate load balancing across the processors and apply the results to the Z buffer efficiently in parallel requires special virtual routing (sort computation) techniques developed by the author especially for use on single-instruction multiple-data (SIMD) architectures.

Dorband, John E.

1988-01-01

37

Massively Parallel MRI Detector Arrays  

PubMed Central

Originally proposed as a method to increase sensitivity by extending the locally high-sensitivity of small surface coil elements to larger areas, the term parallel imaging now includes the use of array coils to perform image encoding. This methodology has impacted clinical imaging to the point where many examinations are performed with an array comprising multiple smaller surface coil elements as the detector of the MR signal. This article reviews the theoretical and experimental basis for the trend towards higher channel counts relying on insights gained from modeling and experimental studies as well as the theoretical analysis of the so-called “ultimate” SNR and g-factor. We also review the methods for optimally combining array data and changes in RF methodology needed to construct massively parallel MRI detector arrays and show some examples of state-of-the-art for highly accelerated imaging with the resulting highly parallel arrays. PMID:23453758

Keil, Boris; Wald, Lawrence L

2013-01-01

38

Reconfigurable parallel processor for noise suppression  

NASA Astrophysics Data System (ADS)

Digital images corrupted with noise regularly require different filtering techniques to optimally correct the image. Software provides convenience for implementing a variety of different filters, but suffers a speed penalty due to its serial nature of the filter calculations. In converse fashion, implementation using ASIC technology allows for a speed advantage due to parallel processing but at the cost of increased hardware overhead for implementing a variety of filters individually. Advances in Field Programmable Gate Array (FPGA) technology offers a middle ground in which the speed advantages of an ASIC and the reprogrammable aspect of a general purpose conventional CPU or DSP software approach are combined. In this paper, we present an FPGA-based, reconfigurable system, that can perform an assortment of noise filtering algorithms using the same hardware. Implementation of Gaussian and salt-and-pepper noise are evaluated for this system.

Cuviello, Michael; Dang, Philip P.; Chau, Paul M.

1999-05-01

39

Diagnosis and reconfiguration of VLSI/WSI array processors  

SciTech Connect

Some fault-tolerant techniques and analytical methods are presented for linear, mesh, and tree array processors which are implemented in Very Large Scale Integration (VLSI) circuits or Wafer Scale Integration (WSI) circuits. Several techniques are developed for testing, diagnosis, on-line fault detection and reconfiguration of array processors. A testing strategy, built-in self-test, is presented for array processors to achieve the C-testability by which the test length is independent of the size of the array. The signature comparison approach is used for diagnostic algorithms. Reconfiguration schemes with two-level redundancy for mesh and tree arrays are described. An on-line fault detection scheme by using redundant cells and blocks are developed. Analytical tools for reliability are given for evaluating the proposed schemes. A yield estimation model for WSI mesh array processors with two-level redundancy is presented. Distributed as well as clustered defects are considered in this model.

Wang, M.

1988-01-01

40

DFT algorithms for bit-serial GaAs array processor architectures  

NASA Technical Reports Server (NTRS)

Systems and Processes Engineering Corporation (SPEC) has developed an innovative array processor architecture for computing Fourier transforms and other commonly used signal processing algorithms. This architecture is designed to extract the highest possible array performance from state-of-the-art GaAs technology. SPEC's architectural design includes a high performance RISC processor implemented in GaAs, along with a Floating Point Coprocessor and a unique Array Communications Coprocessor, also implemented in GaAs technology. Together, these data processors represent the latest in technology, both from an architectural and implementation viewpoint. SPEC has examined numerous algorithms and parallel processing architectures to determine the optimum array processor architecture. SPEC has developed an array processor architecture with integral communications ability to provide maximum node connectivity. The Array Communications Coprocessor embeds communications operations directly in the core of the processor architecture. A Floating Point Coprocessor architecture has been defined that utilizes Bit-Serial arithmetic units, operating at very high frequency, to perform floating point operations. These Bit-Serial devices reduce the device integration level and complexity to a level compatible with state-of-the-art GaAs device technology.

Mcmillan, Gary B.

1988-01-01

41

Mapping reusable software components onto the ARC parallel processor  

Microsoft Academic Search

It is shown how to map the components of a program onto the ARC (Architecture for Reusable Components) processor automatically in a way that exploits its features. Mapping consists of two phases. The first phase determines the maximum amount of parallelism attainable from a program in the model of parallel execution. This is done by mapping program components onto logical

Lonnie R. Welch; Bruce W. Weide

1990-01-01

42

Processor Self-Scheduling for Multiple-Nested Parallel Loops  

Microsoft Academic Search

Processor self-scheduling is a useful scheme in a multiprocessor system if the execution time of each iteration in a parallel loop is not known in advance and varies substantially, or if there are multiple nestings in parallel loops which makes static scheduling difficult and inefficient. By using efficient synchronization primitives, the operating system is not needed for loop scheduling. The

Peiyi Tang; Pen-chung Yew

1986-01-01

43

3081/E emulator, a processor for use in on-line and off-line arrays  

SciTech Connect

This paper presents a status report on the 3081/E covering the processor hardware, interfacing capability, and accompanying software. Details of production figures and preliminary performance results are given. Plans for the use of arrays of 3081/Es for parallel event processing in both on-line and off-line systems are outlined.

Ferran, P.M.; Fucci, A.; Gallno, P.; Hinton, R.; Jacobs, D.; Kudla, M.; Martin, B.; Masuch, H.; Storr, K.M.; Gravina, M.

1985-08-01

44

Direct simulation Monte Carlo analysis on parallel processors  

NASA Technical Reports Server (NTRS)

A method is presented for executing a direct simulation Monte Carlo (DSMC) analysis using parallel processing. The method is based on using domain decomposition to distribute the work load among multiple processors, and the DSMC analysis is performed completely in parallel. Message passing is used to transfer molecules between processors and to provide the synchronization necessary for the correct physical simulation. Benchmark problems are described for testing the method and results are presented which demonstrate the performance on two commercially available multicomputers. The results show that reasonable parallel speedup and efficiency can be obtained if the problem is properly sized to the number of processors. It is projected that with a massively parallel system, performance exceeding that of current supercomputers is possible.

Wilmoth, Richard G.

1989-01-01

45

Compiling an Array Language to a Graphics Processor Bradford Larsen  

E-print Network

Compiling an Array Language to a Graphics Processor BY Bradford Larsen B.A. Philosophy, University in Computer Science December 2010 #12;ALL RIGHTS RESERVED © 2010 Bradford Larsen #12;This thesis has been

New Hampshire, University of

46

Singular value decomposition utilizing parallel algorithms on graphical processors  

SciTech Connect

One of the current challenges in underwater acoustic array signal processing is the detection of quiet targets in the presence of noise. In order to enable robust detection, one of the key processing steps requires data and replica whitening. This, in turn, involves the eigen-decomposition of the sample spectral matrix, Cx = 1/K xKX(k)XH(k) where X(k) denotes a single frequency snapshot with an element for each element of the array. By employing the singular value decomposition (SVD) method, the eigenvectors and eigenvalues can be determined directly from the data without computing the sample covariance matrix, reducing the computational requirements for a given level of accuracy (van Trees, Optimum Array Processing). (Recall that the SVD of a complex matrix A involves determining V, , and U such that A = U VH where U and V are orthonormal and is a positive, real, diagonal matrix containing the singular values of A. U and V are the eigenvectors of AAH and AHA, respectively, while the singular values are the square roots of the eigenvalues of AAH.) Because it is desirable to be able to compute these quantities in real time, an efficient technique for computing the SVD is vital. In addition, emerging multicore processors like graphical processing units (GPUs) are bringing parallel processing capabilities to an ever increasing number of users. Since the computational tasks involved in array signal processing are well suited for parallelization, it is expected that these computations will be implemented using GPUs as soon as users have the necessary computational tools available to them. Thus, it is important to have an SVD algorithm that is suitable for these processors. This work explores the effectiveness of two different parallel SVD implementations on an NVIDIA Tesla C2050 GPU (14 multiprocessors, 32 cores per multiprocessor, 1.15 GHz clock - peed). The first algorithm is based on a two-step algorithm which bidiagonalizes the matrix using Householder transformations, and then diagonalizes the intermediate bidiagonal matrix through implicit QR shifts. This is similar to that implemented for real matrices by Lahabar and Narayanan ("Singular Value Decomposition on GPU using CUDA", IEEE International Parallel Distributed Processing Symposium 2009). The implementation is done in a hybrid manner, with the bidiagonalization stage done using the GPU while the diagonalization stage is done using the CPU, with the GPU used to update the U and V matrices. The second algorithm is based on a one-sided Jacobi scheme utilizing a sequence of pair-wise column orthogonalizations such that A is replaced by AV until the resulting matrix is sufficiently orthogonal (that is, equal to U ). V is obtained from the sequence of orthogonalizations, while can be found from the square root of the diagonal elements of AH A and, once is known, U can be found from column scaling the resulting matrix. These implementations utilize CUDA Fortran and NVIDIA's CUB LAS library. The primary goal of this study is to quantify the comparative performance of these two techniques against themselves and other standard implementations (for example, MATLAB). Considering that there is significant overhead associated with transferring data to the GPU and with synchronization between the GPU and the host CPU, it is also important to understand when it is worthwhile to use the GPU in terms of the matrix size and number of concurrent SVDs to be calculated.

Kotas, Charlotte W [ORNL; Barhen, Jacob [ORNL

2011-01-01

47

Method and structure for skewed block-cyclic distribution of lower-dimensional data arrays in higher-dimensional processor grids  

DOEpatents

A method and structure of distributing elements of an array of data in a computer memory to a specific processor of a multi-dimensional mesh of parallel processors includes designating a distribution of elements of at least a portion of the array to be executed by specific processors in the multi-dimensional mesh of parallel processors. The pattern of the designating includes a cyclical repetitive pattern of the parallel processor mesh, as modified to have a skew in at least one dimension so that both a row of data in the array and a column of data in the array map to respective contiguous groupings of the processors such that a dimension of the contiguous groupings is greater than one.

Chatterjee, Siddhartha (Yorktown Heights, NY); Gunnels, John A. (Brewster, NY)

2011-11-08

48

Global synchronization of parallel processors using clock pulse width modulation  

DOEpatents

A circuit generates a global clock signal with a pulse width modification to synchronize processors in a parallel computing system. The circuit may include a hardware module and a clock splitter. The hardware module may generate a clock signal and performs a pulse width modification on the clock signal. The pulse width modification changes a pulse width within a clock period in the clock signal. The clock splitter may distribute the pulse width modified clock signal to a plurality of processors in the parallel computing system.

Chen, Dong; Ellavsky, Matthew R.; Franke, Ross L.; Gara, Alan; Gooding, Thomas M.; Haring, Rudolf A.; Jeanson, Mark J.; Kopcsay, Gerard V.; Liebsch, Thomas A.; Littrell, Daniel; Ohmacht, Martin; Reed, Don D.; Schenck, Brandon E.; Swetz, Richard A.

2013-04-02

49

Mesh-connected processor arrays for the transitive closure problem  

NASA Technical Reports Server (NTRS)

The main purpose in this paper is to lay a theoretical foundation for the design of mesh-connected processor arrays for the transitive closure problem. Using a simple path-algebraic formulation of the problem and observing its similarity to certain well-known smoothing problems that occur in digital signal processing, it is shown how to draw upon existing techniques from the signal processing literature to derive regular iterative algorithms for determining the transitive closure of the graph. The regular iterative algorithms that are derived using these considerations, are then analyzed and synthesized on mesh-connected processor arrays. Among the vast number of mesh-connected processor arrays that can be designed using this unified approach, the systolic arrays reported in the literature for this problem are shown to be special cases.

Rao, S. K.; Citron, T.; Kailath, T.

1985-01-01

50

Adaptive domain decomposition for Monte Carlo simulations on parallel processors  

NASA Technical Reports Server (NTRS)

A method is described for performing direct simulation Monte Carlo (DSMC) calculations on parallel processors using adaptive domain decomposition to distribute the computational work load. The method has been implemented on a commercially available hypercube and benchmark results are presented which show the performance of the method relative to current supercomputers. The problems studied were simulations of equilibrium conditions in a closed, stationary box, a two-dimensional vortex flow, and the hypersonic, rarefield flow in a two-dimensional channel. For these problems, the parallel DSMC method ran 5 to 13 times faster than on a single processor of a Cray-2. The adaptive decomposition method worked well in uniformly distributing the computational work over an arbitrary number of processors and reduced the average computational time by over a factor of two in certain cases.

Wilmoth, Richard G.

1990-01-01

51

Adaptive domain decomposition for Monte Carlo simulations on parallel processors  

NASA Technical Reports Server (NTRS)

A method is described for performing direct simulation Monte Carlo (DSMC) calculations on parallel processors using adaptive domain decomposition to distribute the computational work load. The method has been implemented on a commercially available hypercube and benchmark results are presented which show the performance of the method relative to current supercomputers. The problems studied were simulations of equilibrium conditions in a closed, stationary box, a two-dimensional vortex flow, and the hypersonic, rarefied flow in a two-dimensional channel. For these problems, the parallel DSMC method ran 5 to 13 times faster than on a single processor of a Cray-2. The adaptive decomposition method worked well in uniformly distributing the computational work over an arbitrary number of processors and reduced the average computational time by over a factor of two in certain cases.

Wilmoth, Richard G.

1991-01-01

52

Parallel Information Extraction on Shared Memory Multi-processor System  

Microsoft Academic Search

Text mining is one of the best solutions for today and the future's information explosion. With the development of modern processor technologies, it will be a mass market desktop application in the many-core era. In text mining system, information extraction is a representative module and is the most compute intensive part. In this paper, we study the performance of parallel

Jiulong Shan; Yurong Chen; Qian Diao; Yimin Zhang

2006-01-01

53

Real-time trajectory optimization on parallel processors  

NASA Technical Reports Server (NTRS)

A parallel algorithm has been developed for rapidly solving trajectory optimization problems. The goal of the work has been to develop an algorithm that is suitable to do real-time, on-line optimal guidance through repeated solution of a trajectory optimization problem. The algorithm has been developed on an INTEL iPSC/860 message passing parallel processor. It uses a zero-order-hold discretization of a continuous-time problem and solves the resulting nonlinear programming problem using a custom-designed augmented Lagrangian nonlinear programming algorithm. The algorithm achieves parallelism of function, derivative, and search direction calculations through the principle of domain decomposition applied along the time axis. It has been encoded and tested on 3 example problems, the Goddard problem, the acceleration-limited, planar minimum-time to the origin problem, and a National Aerospace Plane minimum-fuel ascent guidance problem. Execution times as fast as 118 sec of wall clock time have been achieved for a 128-stage Goddard problem solved on 32 processors. A 32-stage minimum-time problem has been solved in 151 sec on 32 processors. A 32-stage National Aerospace Plane problem required 2 hours when solved on 32 processors. A speed-up factor of 7.2 has been achieved by using 32-nodes instead of 1-node to solve a 64-stage Goddard problem.

Psiaki, Mark L.

1993-01-01

54

Staging memory for massively parallel processor  

NASA Technical Reports Server (NTRS)

The invention herein relates to a computer organization capable of rapidly processing extremely large volumes of data. A staging memory is provided having a main stager portion consisting of a large number of memory banks which are accessed in parallel to receive, store, and transfer data words simultaneous with each other. Substager portions interconnect with the main stager portion to match input and output data formats with the data format of the main stager portion. An address generator is coded for accessing the data banks for receiving or transferring the appropriate words. Input and output permutation networks arrange the lineal order of data into and out of the memory banks.

Batcher, Kenneth E. (Inventor)

1988-01-01

55

An optically clocked transistor array for dual serial-to- parallel and parallel-to-serial conversion of optical packets  

Microsoft Academic Search

We propose an optically clocked transistor array which performs both serial-to-parallel and parallel-to-serial conversion (time demux\\/mux) of incoming\\/outgoing packets, enabling a low cost, low power, compact optical label processor for asynchronous burst optical packets.

Ryohei Urata; R. Takahashi; T. Nakahara; K. Takahata; H. Suzuki

2005-01-01

56

Time synchronous dyadic wavelet processor array using surface acoustic wave devices  

Microsoft Academic Search

In this paper, we propose to implement a time synchronous dyadic wavelet processor array with surface acoustic wave devices. An arbitrary dyadic wavelet scale processor consists of a wavelet interdigital transducer (IDT) apodized by the envelope of a wavelet function and a uniform IDT. A dyadic wavelet processor array consists of a multiscale wavelet processor using a surface acoustic wave

Changbao Wen; Changchun Zhu

2006-01-01

57

A taxonomy of reconfiguration techniques for fault-tolerant processor arrays--  

SciTech Connect

The authors overview, characterize, and classify some typical reconfiguration schemes in light of a proposed taxonomy. This taxonomy can be used as a guide for future research in design and analysis of reconfiguration schemes. Studying how to evaluate fault-tolerant arrays and how to exploit application characteristics to achieve dependable computing are important complementary directions of research towards reliable processor-array design. A related research problem is that of functional reconfiguration, that is, learning how to configure the topology of a parallel system to implement a different function or run a different application. Important directions of research include how to apply or extend processor-array reconfiguration algorithms to other topologies and how to marry functional and fault-tolerance reconfiguration requirements and solutions. The Diogenes approach discussed in this article is a case where this goal is naturally achieved.

Chean, M. (Shell Development Co., Houston, TX (USA)); Fortes, J.A.B. (Purdue Univ., Lafayette, IN (USA))

1990-01-01

58

Parallel LVCSR Algorithm for Cellphone-Oriented Multicore Processors  

Microsoft Academic Search

A parallel large vocabulary continuous speech recognition (LVCSR) algorithm for cellphone-oriented multicore processors is proposed. We introduce an acoustic look-ahead and blockwise computation to our compact LVCSR algorithm, in order to distribute its computational load to multiple CPU cores. We implement the proposed LVCSR algorithm on an evaluation board of a cellphone-oriented three CPU core chip, and show real-time processing

S. Ishikawa; K. Yamabana; R. Isotani; A. Okumura

2006-01-01

59

On the application of Array Processors to symbol manipulation  

Microsoft Academic Search

In the past general purpose programs for symbol manipulation have been written for traditional Von Neumann machine architectures. The design and implementation of a simple prototype symbol manipulation system for the ICL Distributed Array Processor (DAP), is described. The system is restricted to monovariate polynomials with single precision integer coefficients. The algorithms and data structure are discussed and the design

R. Beardsworth

1981-01-01

60

High Performance Natural Language Processing on Semantic Network Array Processor  

Microsoft Academic Search

This paper describes a natural language processing system developed for the Semantic Network Ar­ ray Processor (SNAP). The goal of our work is to develop a scalable and high-performance natural language processing system which utilizes the high degree of parallelism provided by the SNAP ma­ chine. We have implemented an experimental ma­ chine translation system as a central part of

Hiroaki Kitano; Dan I. Moldovan; Seungho Cha

1991-01-01

61

Optimal mapping of irregular finite element domains to parallel processors  

NASA Technical Reports Server (NTRS)

Mapping the solution domain of n-finite elements into N-subdomains that may be processed in parallel by N-processors is an optimal one if the subdomain decomposition results in a well-balanced workload distribution among the processors. The problem is discussed in the context of irregular finite element domains as an important aspect of the efficient utilization of the capabilities of emerging multiprocessor computers. Finding the optimal mapping is an intractable combinatorial optimization problem, for which a satisfactory approximate solution is obtained here by analogy to a method used in statistical mechanics for simulating the annealing process in solids. The simulated annealing analogy and algorithm are described, and numerical results are given for mapping an irregular two-dimensional finite element domain containing a singularity onto the Hypercube computer.

Flower, J.; Otto, S.; Salama, M.

1987-01-01

62

Analog parallel processor hardware for high speed pattern recognition  

NASA Technical Reports Server (NTRS)

A VLSI-based analog processor for fully parallel, associative, high-speed pattern matching is reported. The processor consists of two main components: an analog memory matrix for storage of a library of patterns, and a winner-take-all (WTA) circuit for selection of the stored pattern that best matches an input pattern. An inner product is generated between the input vector and each of the stored memories. The resulting values are applied to a WTA network for determination of the closest match. Patterns with up to 22 percent overlap are successfully classified with a WTA settling time of less than 10 microsec. Applications such as star pattern recognition and mineral classification with bounded overlap patterns have been successfully demonstrated. This architecture has a potential for an overall pattern matching speed in excess of 10 exp 9 bits per second for a large memory.

Daud, T.; Tawel, R.; Langenbacher, H.; Eberhardt, S. P.; Thakoor, A. P.

1990-01-01

63

The language parallel Pascal and other aspects of the massively parallel processor  

NASA Technical Reports Server (NTRS)

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.

Reeves, A. P.; Bruner, J. D.

1982-01-01

64

Optimal evaluation of array expressions on massively parallel machines  

NASA Technical Reports Server (NTRS)

We investigate the problem of evaluating FORTRAN 90 style array expressions on massively parallel distributed-memory machines. On such machines, an elementwise operation can be performed in constant time for arrays whose corresponding elements are in the same processor. If the arrays are not aligned in this manner, the cost of aligning them is part of the cost of evaluating the expression. The choice of where to perform the operation then affects this cost. We present algorithms based on dynamic programming to solve this problem efficiently for a wide variety of interconnection schemes, including multidimensional grids and rings, hypercubes, and fat-trees. We also consider expressions containing operations that change the shape of the arrays, and show that our approach extends naturally to handle this case.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Teng, Shang-Hua

1992-01-01

65

OPALS - Optical parallel array logic system  

Microsoft Academic Search

A new optical-digital computing system called OPALS (optical parallel array logic system) is presented. OPALS can execute various parallel neighborhood operations such as cellular logic as well as parallel logical operations for two-dimensional sampled objects. The system has the ability to perform iterative operations. OPALS is systemized, centering on the optical logic method using image coding and optical correlation techniques.

Jun Tanida; Yoshiki Ichioka

1986-01-01

66

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

Techniques are discussed for the implementation and improvement of vectorization and concurrency in nonlinear explicit structural finite element codes. In explicit integration methods, the computation of the element internal force vector consumes the bulk of the computer time. The program can be efficiently vectorized by subdividing the elements into blocks and executing all computations in vector mode. The structuring of elements into blocks also provides a convenient way to implement concurrency by creating tasks which can be assigned to available processors for evaluation. The techniques were implemented in a 3-D nonlinear program with one-point quadrature shell elements. Concurrency and vectorization were first implemented in a single time step version of the program. Techniques were developed to minimize processor idle time and to select the optimal vector length. A comparison of run times between the program executed in scalar, serial mode and the fully vectorized code executed concurrently using eight processors shows speed-ups of over 25. Conjugate gradient methods for solving nonlinear algebraic equations are also readily adapted to a parallel environment. A new technique for improving convergence properties of conjugate gradients in nonlinear problems is developed in conjunction with other techniques such as diagonal scaling. A significant reduction in the number of iterations required for convergence is shown for a statically loaded rigid bar suspended by three equally spaced springs.

Belytschko, Ted

1990-01-01

67

Bit-parallel arithmetic in a massively-parallel associative processor  

NASA Technical Reports Server (NTRS)

A simple but powerful new architecture based on a classical associative processor model is presented. Algorithms for performing the four basic arithmetic operations both for integer and floating point operands are described. For m-bit operands, the proposed architecture makes it possible to execute complex operations in O(m) cycles as opposed to O(m exp 2) for bit-serial machines. A word-parallel, bit-parallel, massively-parallel computing system can be constructed using this architecture with VLSI technology. The operation of this system is demonstrated for the fast Fourier transform and matrix multiplication.

Scherson, Isaac D.; Kramer, David A.; Alleyne, Brian D.

1992-01-01

68

Microlens array processor with programmable weight mask and direct optical input  

NASA Astrophysics Data System (ADS)

We present an optical feature extraction system with a microlens array processor. The system is suitable for online implementation of a variety of transforms such as the Walsh transform and DCT. Operating with incoherent light, our processor accepts direct optical input. Employing a sandwich- like architecture, we obtain a very compact design of the optical system. The key elements of the microlens array processor are a square array of 15 X 15 spherical microlenses on acrylic substrate and a spatial light modulator as transmissive mask. The light distribution behind the mask is imaged onto the pixels of a customized a-Si image sensor with adjustable gain. We obtain one output sample for each microlens image and its corresponding weight mask area as summation of the transmitted intensity within one sensor pixel. The resulting architecture is very compact and robust like a conventional camera lens while incorporating a high degree of parallelism. We successfully demonstrate a Walsh transform into the spatial frequency domain as well as the implementation of a discrete cosine transform with digitized gray values. We provide results showing the transformation performance for both synthetic image patterns and images of natural texture samples. The extracted frequency features are suitable for neural classification of the input image. Other transforms and correlations can be implemented in real-time allowing adaptive optical signal processing.

Schmid, Volker R.; Lueder, Ernst H.; Bader, Gerhard; Maier, Gert; Siegordner, Jochen

1999-03-01

69

Prototype Focal-Plane-Array Optoelectronic Image Processor  

NASA Technical Reports Server (NTRS)

Prototype very-large-scale integrated (VLSI) planar array of optoelectronic processing elements combines speed of optical input and output with flexibility of reconfiguration (programmability) of electronic processing medium. Basic concept of processor described in "Optical-Input, Optical-Output Morphological Processor" (NPO-18174). Performs binary operations on binary (black and white) images. Each processing element corresponds to one picture element of image and located at that picture element. Includes input-plane photodetector in form of parasitic phototransistor part of processing circuit. Output of each processing circuit used to modulate one picture element in output-plane liquid-crystal display device. Intended to implement morphological processing algorithms that transform image into set of features suitable for high-level processing; e.g., recognition.

Fang, Wai-Chi; Shaw, Timothy; Yu, Jeffrey

1995-01-01

70

An informal introduction to program transformation and parallel processors  

SciTech Connect

In the summer of 1992, I had the opportunity to participate in a Faculty Research Program at Argonne National Laboratory. I worked under Dr. Jim Boyle on a project transforming code written in pure functional Lisp to Fortran code to run on distributed-memory parallel processors. To perform this project, I had to learn three things: the transformation system, the basics of distributed-memory parallel machines, and the Lisp programming language. Each of these topics in computer science was unfamiliar to me as a mathematician, but I found that they (especially parallel processing) are greatly impacting many fields of mathematics and science. Since most mathematicians have some exposure to computers, but.certainly are not computer scientists, I felt it was appropriate to write a paper summarizing my introduction to these areas and how they can fit together. This paper is not meant to be a full explanation of the topics, but an informal introduction for the ``mathematical layman.`` I place myself in that category as well as my previous use of computers was as a classroom demonstration tool.

Hopkins, K.W. [Southwest Baptist Univ., Bolivar, MO (United States)

1994-08-01

71

On program restructuring, scheduling, and communication for parallel processor systems  

SciTech Connect

This dissertation discusses several software and hardware aspects of program execution on large-scale, high-performance parallel processor systems. The issues covered are program restructuring, partitioning, scheduling and interprocessor communication, synchronization, and hardware design issues of specialized units. All this work was performed focusing on a single goal: to maximize program speedup, or equivalently, to minimize parallel execution time. Parafrase, a Fortran restructuring compiler was used to transform programs in a parallel form and conduct experiments. Two new program restructuring techniques are presented, loop coalescing and subscript blocking. Compile-time and run-time scheduling schemes are covered extensively. Depending on the program construct, these algorithms generate optimal or near-optimal schedules. For the case of arbitrarily nested hybrid loops, two optimal scheduling algorithms for dynamic and static scheduling are presented. Simulation results are given for a new dynamic scheduling algorithm. The performance of this algorithm is compared to that of self-scheduling. Techniques for program partitioning and minimization of interprocessor communication for idealized program models and for real Fortran programs are also discussed. The close relationship between scheduling, interprocessor communication, and synchronization becomes apparent at several points in this work. Finally, the impact of various types of overhead on program speedup and experimental results are presented. 69 refs., 74 figs., 14 tabs.

Polychronopoulos, Constantine D.

1986-08-01

72

Implementation of SAR interferometric map generation using parallel processors  

SciTech Connect

Interferometric fringe maps are generated by accurately registering a pair of complex SAR images of the same scene imaged from two very similar geometries, and calculating the phase difference between the two images by averaging over a neighborhood of pixels at each spatial location. The phase difference (fringe) map resulting from this IFSAR operation is then unwrapped and used to calculate the height estimate of the imaged terrain. Although the method used to calculate interferometric fringe maps is well known, it is generally executed in a post-processing mode well after the image pairs have been collected. In that mode of operation, there is little concern about algorithm speed and the method is normally implemented on a single processor machine. This paper describes how the interferometric map generation is implemented on a distributed-memory parallel processing machine. This particular implementation is designed to operate on a 16 node Power-PC platform and to generate interferometric maps in near real-time. The implementation is able to accommodate large translational offsets, along with a slight amount of rotation which may exist between the interferometric pair of images. If the number of pixels in the IFSAR image is large enough, the implementation accomplishes nearly linear speed-up times with the addition of processors.

Doren, N.; Wahl, D.E.

1998-07-01

73

Serial multiplier arrays for parallel computation  

NASA Technical Reports Server (NTRS)

Arrays of systolic serial-parallel multiplier elements are proposed as an alternative to conventional SIMD mesh serial adder arrays for applications that are multiplication intensive and require few stored operands. The design and operation of a number of multiplier and array configurations featuring locality of connection, modularity, and regularity of structure are discussed. A design methodology combining top-down and bottom-up techniques is described to facilitate development of custom high-performance CMOS multiplier element arrays as well as rapid synthesis of simulation models and semicustom prototype CMOS components. Finally, a differential version of NORA dynamic circuits requiring a single-phase uncomplemented clock signal introduced for this application.

Winters, Kel

1990-01-01

74

Smart-Pixel Array Processors Based on Optimal Cellular Neural Networks for Space Sensor Applications  

NASA Technical Reports Server (NTRS)

A smart-pixel cellular neural network (CNN) with hardware annealing capability, digitally programmable synaptic weights, and multisensor parallel interface has been under development for advanced space sensor applications. The smart-pixel CNN architecture is a programmable multi-dimensional array of optoelectronic neurons which are locally connected with their local neurons and associated active-pixel sensors. Integration of the neuroprocessor in each processor node of a scalable multiprocessor system offers orders-of-magnitude computing performance enhancements for on-board real-time intelligent multisensor processing and control tasks of advanced small satellites. The smart-pixel CNN operation theory, architecture, design and implementation, and system applications are investigated in detail. The VLSI (Very Large Scale Integration) implementation feasibility was illustrated by a prototype smart-pixel 5x5 neuroprocessor array chip of active dimensions 1380 micron x 746 micron in a 2-micron CMOS technology.

Fang, Wai-Chi; Sheu, Bing J.; Venus, Holger; Sandau, Rainer

1997-01-01

75

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination.

Deri, Robert J. (Pleasanton, CA); Brooks, III, Eugene D. (Livermore, CA); Haigh, Ronald E. (Tracy, CA); DeGroot, Anthony J. (Castro Valley, CA)

1999-01-01

76

Massively parallel processor networks with optical express channels  

DOEpatents

An optical method for separating and routing local and express channel data comprises interconnecting the nodes in a network with fiber optic cables. A single fiber optic cable carries both express channel traffic and local channel traffic, e.g., in a massively parallel processor (MPP) network. Express channel traffic is placed on, or filtered from, the fiber optic cable at a light frequency or a color different from that of the local channel traffic. The express channel traffic is thus placed on a light carrier that skips over the local intermediate nodes one-by-one by reflecting off of selective mirrors placed at each local node. The local-channel-traffic light carriers pass through the selective mirrors and are not reflected. A single fiber optic cable can thus be threaded throughout a three-dimensional matrix of nodes with the x,y,z directions of propagation encoded by the color of the respective light carriers for both local and express channel traffic. Thus frequency division multiple access is used to hierarchically separate the local and express channels to eliminate the bucket brigade latencies that would otherwise result if the express traffic had to hop between every local node to reach its ultimate destination. 3 figs.

Deri, R.J.; Brooks, E.D. III; Haigh, R.E.; DeGroot, A.J.

1999-08-24

77

On nonlinear finite element analysis in single-, multi- and parallel-processors  

NASA Technical Reports Server (NTRS)

Numerical solution of nonlinear equilibrium problems of structures by means of Newton-Raphson type iterations is reviewed. Each step of the iteration is shown to correspond to the solution of a linear problem, therefore the feasibility of the finite element method for nonlinear analysis is established. Organization and flow of data for various types of digital computers, such as single-processor/single-level memory, single-processor/two-level-memory, vector-processor/two-level-memory, and parallel-processors, with and without sub-structuring (i.e. partitioning) are given. The effect of the relative costs of computation, memory and data transfer on substructuring is shown. The idea of assigning comparable size substructures to parallel processors is exploited. Under Cholesky type factorization schemes, the efficiency of parallel processing is shown to decrease due to the occasional shared data, just as that due to the shared facilities.

Utku, S.; Melosh, R.; Islam, M.; Salama, M.

1982-01-01

78

Parallelizing a micromagnetic program for use on multi-processor shared memory computers  

E-print Network

.donahue@nist.gov Abstract--Parallelization of a finite difference micromagnetic program on shared memory computer systems these computations by parallelizing the code. Most of this work has been done on finite element codes [1]­[3], whichParallelizing a micromagnetic program for use on multi-processor shared memory computers Michael J

Donahue, Michael J.

79

Integrated services digital network controller architecture based on parallel reconfigurable processor  

Microsoft Academic Search

In This work a new SH architecture of ISDN controller is proposed. The architecture of the proposed controller is based on a parallel reconfigurable processor. A main advantages of a proposed solution are highlighted. Data link and network level protocols are implemented in software for general purpose processors. Physical and data link levels implemented as standards. Physical level utilizes typical

A. Melnyk; A. Salo

2004-01-01

80

Application of the hypercube parallel processor to a large-scale moment method code  

NASA Technical Reports Server (NTRS)

The applicability of a parallel computing architecture to the solution of a large-scale moment-method code is investigated. Specifically, the NEC (Numerical Electromagnetics Code) method-of-moments scattering program is implemented on a hypercube parallel processor. The accuracy and the increase in the speed of execution on this parallel architecture are demonstrated. The results show a very large reduction in execution time for large problems. The great potential of this parallel processor is shown for interactive solution of large NEC problems as well as other moment-method techniques such as the finite-element method.

Manshadi, Farzin; Liewer, Paulet C.; Patterson, Jean E.

1988-01-01

81

Exploiting ThreadLevel Parallelism on Simultaneous Multithreaded Processors  

E-print Network

by comparing an SMT processor to both wide­issue superscalars and single­chip multiprocessors. Using several experiments demonstrate that an SMT achieves a 54% performance edge over a single­chip multiprocessor

Anderson, Richard

82

Parallel Information Transfer in a Multinode Quantum Information Processor  

E-print Network

We describe a method for coupling disjoint quantum bits (qubits) in different local processing nodes of a distributed node quantum information processor. An effective channel for information transfer between nodes is ...

Borneman, Troy William

83

Parallel/Series-Fed Microstrip Array Antenna  

NASA Technical Reports Server (NTRS)

Characteristics include low cross-polarization and high efficiency. Microstrip array antenna fabricated on two rectangular dielectric substrates. Produces fan-shaped beam polarized parallel to its short axis. Mounted conformally on outside surface of aircraft for use in synthetic-aperture radar. Other antennas of similar design mounted on roofs or sides of buildings, ships, or land vehicles for use in radar or communications.

Huang, John

1994-01-01

84

Free programmable smart pixel processor element arrays for standard functions  

NASA Astrophysics Data System (ADS)

The major problems in the current VLSI design are restrictions of both the number of available pins and the of-chip communication speed. The currently lasting process of increasing integration density of VLSI chips keeps these problems alive and still increases the difficulties respectively. Due to physical reasons the ability of a high speed off-chip communication in the same range of the on- chip communication is very difficult to achieve. Optoelectronic 3D circuits based on smart pixel technologies offer a principle solution for the problems mentioned above. We think for the success of optoelectronic computing it is very important to get flexible usable smart pixel circuits. Hence, we present an architecture design for programmable smart pixels. Our approach combines the functional flexibility of FPGAs with the advantages of optoelectronics providing fast and high dense optical interconnections. Moreover, this combination allows the design of various 3D processor element architectures by changing logical behavior and topology to get routing more simple and offering higher data throughput. After an overview of existing solutions we demonstrate a hardware approach of an ALU, based on a 3D free programmable SPPE array for the fast calculation of standard functions, e.g. exp, sin, cos,...furthermore we specify hardware relevant parameters.

Kasche, Bernd; Fey, Dietmar

1996-12-01

85

Performance of a broadband beam space antenna array processor with improved interference beam  

Microsoft Academic Search

A beam space antenna array processor is a two stage system where the weighted sum of beams formed at the first stage is used to produce the system output. The paper studies the performance of a two beam processor in the presence of broadband directional sources by using an improved beam formed at the first state. For narrowband signals this

Lal C Godara; Presila Israt

2010-01-01

86

Parallel signal processing  

NASA Astrophysics Data System (ADS)

The potential application of parallel computing techniques to digital signal processing for radar is discussed and two types of regular array processor are discussed. The first type of processor is the systolic or wavefront processor. The application of this type of processor to adaptive beamforming is discussed and the joint STL-RSRE adaptive antenna processor test-bed is reviewed. The second type of regular array processor is the SIMD parallel computer. One such processor, the Mil-DAP, is described, and its application to a varied range of radar signal processing tasks is discussed.

McWhirter, John G.

1989-12-01

87

On fault-tolerant structure, distributed fault-diagnosis, reconfiguration, and recovery of the array processors  

SciTech Connect

The increasing need for the design of high-performance computers has led to the design of special purpose computers such as array processors. This paper studies the design of fault-tolerant array processors. First, it is shown how hardware redundancy can be employed in the existing structures in order to make them capable of withstanding the failure of some of the array links and processors. Then distributed fault-tolerance schemes are introduced for the diagnosis of the faulty elements, reconfiguration, and recovery of the array. Fault tolerance is maintained by the cooperation of processors in a decentralized form of control without the participation of any type of hardcore or fault-free central controller such as a host computer.

Hosseini, S.H.

1989-07-01

88

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing  

E-print Network

On the Efficacy of a Fused CPU+GPU Processor (or APU) for Parallel Computing Mayank Daga, Ashwin M}@cs.vt.edu Abstract--The graphics processing unit (GPU) has made sig- nificant strides as an accelerator in parallel and programmable accelerator cores on the same silicon die. We characterize its performance via a set of micro-benchmarks

Virginia Tech

89

Parallel calculation of multi-electrode array correlation networks.  

PubMed

When calculating correlation networks from multi-electrode array (MEA) data, one works with extensive computations. Unfortunately, as the MEAs grow bigger, the time needed for the computation grows even more: calculating pair-wise correlations for current 60 channel systems can take hours on normal commodity computers whereas for future 1000 channel systems it would take almost 280 times as long, given that the number of pairs increases with the square of the number of channels. Even taking into account the increase of speed in processors, soon it can be unfeasible to compute correlations in a single computer. Parallel computing is a way to sustain reasonable calculation times in the future. We provide a general tool for rapid computation of correlation networks which was tested for: (a) a single computer cluster with 16 cores, (b) the Newcastle Condor System utilizing idle processors of university computers and (c) the inter-cluster, with 192 cores. Our reusable tool provides a simple interface for neuroscientists, automating data partition and job submission, and also allowing coding in any programming language. It is also sufficiently flexible to be used in other high-performance computing environments. PMID:19666054

Ribeiro, Pedro; Simonotto, Jennifer; Kaiser, Marcus; Silva, Fernando

2009-11-15

90

Time and Parallel Processor Bounds for Fortran-Like Loops  

Microsoft Academic Search

The main goal of this paper is to show that a large number of processors can be used effectively to speed up simple Fortran-like loops consisting of assignment statements. A practical method is given by which one can check whether or not a statement is dependent upon another. The dependence structure of the whole loop may be of different types.

Utpal Banerjee; Shyh-ching Chen; David J. Kuck; Ross A. Towle

1979-01-01

91

Real-time tracking with a 3D-Flow processor array  

SciTech Connect

The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was thought to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc., with respect to the CAM approach. The report describes real-time track finding using new computing approach technique based on the 3D-Flow array processor system. This system consists of a fixed interconnection architecture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project.

Crosetto, D.

1993-06-01

92

Preliminary study on the potential usefulness of array processor techniques for structural synthesis  

NASA Technical Reports Server (NTRS)

The effects of the use of array processor techniques within the structural analyzer program, SPAR, are simulated in order to evaluate the potential analysis speedups which may result. In particular the connection of a Floating Point System AP120 processor to the PRIME computer is discussed. Measurements of execution, input/output, and data transfer times are given. Using these data estimates are made as to the relative speedups that can be executed in a more complete implementation on an array processor maxi-mini computer system.

Feeser, L. J.

1980-01-01

93

Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP  

Microsoft Academic Search

Recently, multicore technology has been introduced to embedded systems in order to improve performance and reduce power consumption.\\u000a In the present study, three SMP multicore processors for embedded systems and a multicore processor for a desktop PC are evaluated\\u000a by the parallel benchmark using OpenMP. The results indicate that, even if the memory performance is low, applications that\\u000a are not

Toshihiro Hanawa; Mitsuhisa Sato; Jinpil Lee; Takayuki Imada; Hideaki Kimura; Taisuke Boku

2009-01-01

94

``Iconic'' tracking algorithms for high energy physics using the trax-I massively parallel processor  

NASA Astrophysics Data System (ADS)

TRAX-I, a cost-effective paralel microcomputer, applying associative string processor (ASP) architecture with 16 K parallel processing elements, is being built by Aspex Microsystems Ltd. (UK). When applied to the tracking problem of very complex events with several hundred tracks, the large number of processors allows one to dedicate one or more processors to each wire (in MWPC), each pixel (in digitized images from streamer chambers or other visual detectors), or each pad (in TPC) to perform very efficient pattern recognition. Some linear tracking algorithms based on this "iconic" representation are presented.

Vesztergombi, G.

1989-12-01

95

Using algebra for massively parallel processor design and utilization  

NASA Technical Reports Server (NTRS)

This paper summarizes the author's advances in the design of dense processor networks. Within is reported a collection of recent constructions of dense symmetric networks that provide the largest know values for the number of nodes that can be placed in a network of a given degree and diameter. The constructions are in the range of current potential engineering significance and are based on groups of automorphisms of finite-dimensional vector spaces.

Campbell, Lowell; Fellows, Michael R.

1990-01-01

96

Highly parallel reconfigurable computer architecture for robotic computation having plural processor cells each having right and left ensembles of plural processors  

NASA Technical Reports Server (NTRS)

In a computer having a large number of single-instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

Fijany, Amir (inventor); Bejczy, Antal K. (inventor)

1994-01-01

97

Architecture of a VLSI cellular processor array for synchronous/asynchronous image processing  

E-print Network

of synchronisation. Moreover, it reduces the power consumption, since fewer instruction-cycles per image frame processing cell of the Asynchronous- Synchronous Processor Array (ASPA) has a universal synchronous digital, it resembles a SIMD array since every processing cell executes the same instruction. However, it is possible

Dudek, Piotr

98

High-speed Systolic Array Processor (HISSAP) system development synopsis: Lesson learned. Final report, Oct 83-Oct 90  

SciTech Connect

This report documents the design rationale of the High Speed Systolic Array Processor (HiSSAP) testbed. In addition to reviewing general parallel processing topics, the impact of the HiSSAP testbed architecture on the top level design of the diagnostic and software mapping tools is described. Based on the experience gained in the mapping of matrix-based algorithms on the testbed hardware, specific recommendations are presented in the form of lessons learned, which are intended to offer guidance in the development of future Navy signal processing systems.

Loughlin, J.P.

1991-05-01

99

Redundant disk arrays: Reliable, parallel secondary storage. Ph.D. Thesis  

NASA Technical Reports Server (NTRS)

During the past decade, advances in processor and memory technology have given rise to increases in computational performance that far outstrip increases in the performance of secondary storage technology. Coupled with emerging small-disk technology, disk arrays provide the cost, volume, and capacity of current disk subsystems, by leveraging parallelism, many times their performance. Unfortunately, arrays of small disks may have much higher failure rates than the single large disks they replace. Redundant arrays of inexpensive disks (RAID) use simple redundancy schemes to provide high data reliability. The data encoding, performance, and reliability of redundant disk arrays are investigated. Organizing redundant data into a disk array is treated as a coding problem. Among alternatives examined, codes as simple as parity are shown to effectively correct single, self-identifying disk failures.

Gibson, Garth Alan

1990-01-01

100

Aligning parallel arrays to reduce communication  

NASA Technical Reports Server (NTRS)

Axis and stride alignment is an important optimization in compiling data-parallel programs for distributed-memory machines. We previously developed an optimal algorithm for aligning array expressions. Here, we examine alignment for more general program graphs. We show that optimal alignment is NP-complete in this setting, so we study heuristic methods. This paper makes two contributions. First, we show how local graph transformations can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. Second, we give a heuristic that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. Our algorithms have been implemented; we present experimental results showing their effect on the performance of some example programs running on the CM-5.

Sheffler, Thomas J.; Schreiber, Robert; Gilbert, John R.; Chatterjee, Siddhartha

1994-01-01

101

Array distribution in data-parallel programs  

NASA Technical Reports Server (NTRS)

We consider distribution at compile time of the array data in a distributed-memory implementation of a data-parallel program written in a language like Fortran 90. We allow dynamic redistribution of data and define a heuristic algorithmic framework that chooses distribution parameters to minimize an estimate of program completion time. We represent the program as an alignment-distribution graph. We propose a divide-and-conquer algorithm for distribution that initially assigns a common distribution to each node of the graph and successively refines this assignment, taking computation, realignment, and redistribution costs into account. We explain how to estimate the effect of distribution on computation cost and how to choose a candidate set of distributions. We present the results of an implementation of our algorithms on several test problems.

Chatterjee, Siddhartha; Gilbert, John R.; Schreiber, Robert; Sheffler, Thomas J.

1994-01-01

102

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

A nonlinear structural dynamics finite element program was developed to run on a shared memory multiprocessor with pipeline processors. The program, WHAMS, was used as a framework for this work. The program employs explicit time integration and has the capability to handle both the nonlinear material behavior and large displacement response of 3-D structures. The elasto-plastic material model uses an isotropic strain hardening law which is input as a piecewise linear function. Geometric nonlinearities are handled by a corotational formulation in which a coordinate system is embedded at the integration point of each element. Currently, the program has an element library consisting of a beam element based on Euler-Bernoulli theory and trianglar and quadrilateral plate element based on Mindlin theory.

Belytschko, Ted

1989-01-01

103

Parallel processors and nonlinear structural dynamics algorithms and software  

NASA Technical Reports Server (NTRS)

The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14.

Belytschko, Ted; Gilbertsen, Noreen D.; Neal, Mark O.; Plaskacz, Edward J.

1989-01-01

104

Series-parallel method of direct solar array regulation  

NASA Technical Reports Server (NTRS)

A 40 watt experimental solar array was directly regulated by shorting out appropriate combinations of series and parallel segments of a solar array. Regulation switches were employed to control the array at various set-point voltages between 25 and 40 volts. Regulation to within + or - 0.5 volt was obtained over a range of solar array temperatures and illumination levels as an active load was varied from open circuit to maximum available power. A fourfold reduction in regulation switch power dissipation was achieved with series-parallel regulation as compared to the usual series-only switching for direct solar array regulation.

Gooder, S. T.

1976-01-01

105

Software development on the High-Speed Systolic Array Processor (HISSAP): Lessons learned. Final report, Mar 88-Mar 91  

SciTech Connect

This report documents the lessons learned in programming the Naval Ocean System Center's (NOSC's) High-Speed Systolic Array Processor (HISSAP) testbed. The procedures used for code generation, along with the programming utilities provided in the software development environment, are discussed with regard to their impact on the efficient implementation of algorithms on a parallel processing system such as HISSAP. This information is intended for considerations pertaining to software-development environments in future Navy parallel processing systems. Many of HISSAP's software-development utilities played key roles in the implementation of two computationally intensive algorithms: the Multiple-Signal Classification algorithm (MUSIC) and a four-channel, narrowband, finite-impulse response (FIR) filter. The introduction of utilities not included with the HISSAP tools would undoubtedly have increased the speed and efficiency of software development.

Tirpak, F.M.

1991-06-01

106

Interconnection arrangement of routers of processor boards in array of cabinets supporting secure physical partition  

SciTech Connect

A multiple processor computing apparatus includes a physical interconnect structure that is flexibly configurable to support selective segregation of classified and unclassified users. The physical interconnect structure includes routers in service or compute processor boards distributed in an array of cabinets connected in series on each board and to respective routers in neighboring row cabinet boards with the routers in series connection coupled to routers in series connection in respective neighboring column cabinet boards. The array can include disconnect cabinets or respective routers in all boards in each cabinet connected in a toroid. The computing apparatus can include an emulator which permits applications from the same job to be launched on processors that use different operating systems.

Tomkins, James L. (Albuquerque, NM); Camp, William J. (Albuquerque, NM)

2007-07-17

107

Vision Sensor with a SIMD Processor Array in a Vertically Stacked 3D Integrated Circuit Technology  

E-print Network

, according to a program that is broadcast to all PEs from a single controller (currently implemented off-chip mixed-mode processing elements and operates in SIMD (Single Instruction Multiple Data) mode, providing an integrated cellular sensor/processor array. The processing element (PE) cells span across three layers

Dudek, Piotr

108

The Blitzen array processor as second level trigger element: Evaluation of performance on CDF calorimeter data  

Microsoft Academic Search

The performance of the massively parallel Blitzen processor in real-time cluster finding is presented. The efficiency in energy cluster detection and the related energy measurement has been evaluated on a sample of calorimeter data collected by the CDF experiment at pp? Fermilab collider. A comparison with the CDF off-line analysis is reported; also given are the algorithm execution times on

Giovanni Busetto; Cristina Carpanese; Donatella Pascoli; Elettra Siliotto

1995-01-01

109

Free-space optoelectronic data distributor for fully connected massively parallel processors  

NASA Astrophysics Data System (ADS)

A Code V simulation of a free-space optoelectronic data distributor, commonly known as the Kaleidoscope, is presented. The device is designed to interconnect processing elements in massively parallel processors, and the simulation results indicate that the device is capable of interconnecting 64 processing elements with a 64-bit word in a fully connected topology. The results further suggest that the number of processing elements, in the future improved version of the device, may be increased to 1024 and beyond.

Shimoji, Masao; Othman, Amar; Crosbie, Roy; Frietman, Edward E. E.

1998-05-01

110

Construction of a parallel processor for simulating manipulators and other mechanical systems  

NASA Technical Reports Server (NTRS)

This report summarizes the results of NASA Contract NAS5-30905, awarded under phase 2 of the SBIR Program, for a demonstration of the feasibility of a new high-speed parallel simulation processor, called the Real-Time Accelerator (RTA). The principal goals were met, and EAI is now proceeding with phase 3: development of a commercial product. This product is scheduled for commercial introduction in the second quarter of 1992.

Hannauer, George

1991-01-01

111

Q-plates micro-arrays for parallel processing of the photon orbital angular momentum  

NASA Astrophysics Data System (ADS)

We report on the realization of electrically tunable micro-arrays of space-variant optically anisotropic optical vortex generators. Each individual light orbital angular momentum processor consists of a microscopic self-engineered nematic liquid crystal q-plate made of a nonsingular topological defect spontaneously formed under electric field. Both structural and optical characterizations of the obtained spin-orbit optical interface are analyzed. An analytical model is derived and results of simulations are compared with experimental data. The application potential in terms of parallel processing of the optical orbital angular momentum is quantitatively discussed.

Loussert, Charles; Kushnir, Kateryna; Brasselet, Etienne

2014-09-01

112

Parallelization of the Ensemble Empirical Model Decomposition (PEEMD) Method on Multi- and Many-core Processors  

NASA Astrophysics Data System (ADS)

Cheung, S.1, B.-W. Shen2, P. Mehrotra1 , J.-L. F. Li3 1 NASA Ames Research Center, 2 UMCP/ESSIC, 3CalTech/JPL The trend in high performance computing systems is towards clusters of multi-core nodes; from an 8 cores/node Intel Xeon Harpertown processor in 2008 to the latest Intel Xeon Ivy Bridge processor with 24 cores/node. In addition hardware vendors are developing many core coprocessors, such as NVIDIA's General Purpose Graphics Processing Unit (GPGPU) and Intel's Xeon Phi, in order to get around the constraints of power and frequency. The hybrid nature of such systems presents a major challenge for software developers, in achieving the desired performance. Applications need to be constructed with multiple levels of parallelization along with hybrid communication regimes in order to exploit the power of such systems. The Ensemble Empirical Model Decomposition (EEMD) method has been applied to signal processing on nonlinear and non-stationary data. Due to the ensemble nature of the algorithm and the geographical decomposition of the problem, we have developed a parallel version of the EEMD method with 4-level parallelization, from the grid decomposition level, to time-series level and to the ensemble level using MPI and OpenMP. The parallel EEMD (PEEMD) is being used to analyze Hurricane Sandy (2012) for better understanding of the multiple scale processes that may have impacted Sandy's movement, intensification and formation. In this presentation, we summarize our experiences with the implementation of the PEEMD focusing on the programmability and usability of different processors and accelerators for multiscale analysis for Hurricane Sandy.

Cheung, S.; Shen, B.; Li, J. F.; Mehrotra, P.

2013-12-01

113

Evaluation of soft-core processors on a Xilinx Virtex-5 field programmable gate array.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable field programmable gate array (FPGA)-based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hard-core processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA-based soft-core processors for use in future NBA systems: the MicroBlaze (uB), the open-source Leon3, and the licensed Leon3. Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration.

Learn, Mark Walter

2011-04-01

114

Parallel Agent-Based Simulations on Clusters of GPUs and Multi-Core Processors  

SciTech Connect

An effective latency-hiding mechanism is presented in the parallelization of agent-based model simulations (ABMS) with millions of agents. The mechanism is designed to accommodate the hierarchical organization as well as heterogeneity of current state-of-the-art parallel computing platforms. We use it to explore the computation vs. communication trade-off continuum available with the deep computational and memory hierarchies of extant platforms and present a novel analytical model of the tradeoff. We describe our implementation and report preliminary performance results on two distinct parallel platforms suitable for ABMS: CUDA threads on multiple, networked graphical processing units (GPUs), and pthreads on multi-core processors. Message Passing Interface (MPI) is used for inter-GPU as well as inter-socket communication on a cluster of multiple GPUs and multi-core processors. Results indicate the benefits of our latency-hiding scheme, delivering as much as over 100-fold improvement in runtime for certain benchmark ABMS application scenarios with several million agents. This speed improvement is obtained on our system that is already two to three orders of magnitude faster on one GPU than an equivalent CPU-based execution in a popular simulator in Java. Thus, the overall execution of our current work is over four orders of magnitude faster when executed on multiple GPUs.

Aaby, Brandon G [ORNL; Perumalla, Kalyan S [ORNL; Seal, Sudip K [ORNL

2010-01-01

115

Constant Time Algorithms for the Transitive Closure and Some Related Graph Problems on Processor Arrays with Reconfigurable Bus Systems  

Microsoft Academic Search

The transitive closure problem in O(1) time is solved by a new method that is far different from the conventional solution method. On processor arrays with reconfigurable bus systems, two O(1) time algorithms are proposed for computing the transitive closure of an undirected graph. One is designed on a three-dimensional n*n*n processor array with a reconfigurable bus system, and the

Biing-feng Wang; Gen-huey Chen

1990-01-01

116

Block iterative restoration of astronomical images with the massively parallel processor  

NASA Technical Reports Server (NTRS)

A method is described for algebraic image restoration capable of treating astronomical images. For a typical 500 x 500 image, direct algebraic restoration would require the solution of a 250,000 x 250,000 linear system. The block iterative approach is used to reduce the problem to solving 4900 121 x 121 linear systems. The algorithm was implemented on the Goddard Massively Parallel Processor, which can solve a 121 x 121 system in approximately 0.06 seconds. Examples are shown of the results for various astronomical images.

Heap, Sara R.; Lindler, Don J.

1987-01-01

117

Parallel arrays of Josephson junctions for submillimeter local oscillators  

NASA Technical Reports Server (NTRS)

In this paper we discuss the influence of the DC biasing circuit on operation of parallel biased quasioptical Josephson junction oscillator arrays. Because of nonuniform distribution of the DC biasing current along the length of the bias lines, there is a nonuniform distribution of magnetic flux in superconducting loops connecting every two junctions of the array. These DC self-field effects determine the state of the array. We present analysis and time-domain numerical simulations of these states for four biasing configurations. We find conditions for the in-phase states with maximum power output. We compare arrays with small and large inductances and determine the low inductance limit for nearly-in-phase array operation. We show how arrays can be steered in H-plane using the externally applied DC magnetic field.

Pance, Aleksandar; Wengler, Michael J.

1992-01-01

118

Animated computer graphics models of space and earth sciences data generated via the massively parallel processor  

NASA Technical Reports Server (NTRS)

The capability was developed of rapidly producing visual representations of large, complex, multi-dimensional space and earth sciences data sets via the implementation of computer graphics modeling techniques on the Massively Parallel Processor (MPP) by employing techniques recently developed for typically non-scientific applications. Such capabilities can provide a new and valuable tool for the understanding of complex scientific data, and a new application of parallel computing via the MPP. A prototype system with such capabilities was developed and integrated into the National Space Science Data Center's (NSSDC) Pilot Climate Data System (PCDS) data-independent environment for computer graphics data display to provide easy access to users. While developing these capabilities, several problems had to be solved independently of the actual use of the MPP, all of which are outlined.

Treinish, Lloyd A.; Gough, Michael L.; Wildenhain, W. David

1987-01-01

119

Basic data-base operations on the Butterfly Parallel Processor: experiment results. Memorandum report, January-December 1987  

SciTech Connect

The next phase in speeding up data-base queries will be through the use of highly parallel computers. This paper will discuss the basic data-base operations (select, project, natural join, and scaler aggregates) on a shared-memory multiple instruction stream, multiple data stream (MIMD) computer and the problems associated with implementing them. Some problems associated with getting maximum parallelization are improper data division and hot spots. Improper data division results when the number of tasks does not divide evenly among the processors. Hot spots or contentions occur due to locking if accesses are made to the same segment of a RAMFile and also if attempts are made to get data from the same remote processor at the same time. These algorithms have been implemented on the Butterfly Parallel Processor, and the results of our experiments are described in detail.

Rosenau, T.J.; Jajodia, S.

1988-03-04

120

Performance Evaluation and Modeling Techniques for Parallel Processors. Ph.D. Thesis  

NASA Technical Reports Server (NTRS)

In practice, the performance evaluation of supercomputers is still substantially driven by singlepoint estimates of metrics (e.g., MFLOPS) obtained by running characteristic benchmarks or workloads. With the rapid increase in the use of time-shared multiprogramming in these systems, such measurements are clearly inadequate. This is because multiprogramming and system overhead, as well as other degradations in performance due to time varying characteristics of workloads, are not taken into account. In multiprogrammed environments, multiple jobs and users can dramatically increase the amount of system overhead and degrade the performance of the machine. Performance techniques, such as benchmarking, which characterize performance on a dedicated machine ignore this major component of true computer performance. Due to the complexity of analysis, there has been little work done in analyzing, modeling, and predicting the performance of applications in multiprogrammed environments. This is especially true for parallel processors, where the costs and benefits of multi-user workloads are exacerbated. While some may claim that the issue of multiprogramming is not a viable one in the supercomputer market, experience shows otherwise. Even in recent massively parallel machines, multiprogramming is a key component. It has even been claimed that a partial cause of the demise of the CM2 was the fact that it did not efficiently support time-sharing. In the same paper, Gordon Bell postulates that, multicomputers will evolve to multiprocessors in order to support efficient multiprogramming. Therefore, it is clear that parallel processors of the future will be required to offer the user a time-shared environment with reasonable response times for the applications. In this type of environment, the most important performance metric is the completion of response time of a given application. However, there are a few evaluation efforts addressing this issue.

Dimpsey, Robert Tod

1992-01-01

121

Computing effective properties of random heterogeneous materials on heterogeneous parallel processors  

NASA Astrophysics Data System (ADS)

In recent decades, finite element (FE) techniques have been extensively used for predicting effective properties of random heterogeneous materials. In the case of very complex microstructures, the choice of numerical methods for the solution of this problem can offer some advantages over classical analytical approaches, and it allows the use of digital images obtained from real material samples (e.g., using computed tomography). On the other hand, having a large number of elements is often necessary for properly describing complex microstructures, ultimately leading to extremely time-consuming computations and high memory requirements. With the final objective of reducing these limitations, we improved an existing freely available FE code for the computation of effective conductivity (electrical and thermal) of microstructure digital models. To allow execution on hardware combining multi-core CPUs and a GPU, we first translated the original algorithm from Fortran to C, and we subdivided it into software components. Then, we enhanced the C version of the algorithm for parallel processing with heterogeneous processors. With the goal of maximizing the obtained performances and limiting resource consumption, we utilized a software architecture based on stream processing, event-driven scheduling, and dynamic load balancing. The parallel processing version of the algorithm has been validated using a simple microstructure consisting of a single sphere located at the centre of a cubic box, yielding consistent results. Finally, the code was used for the calculation of the effective thermal conductivity of a digital model of a real sample (a ceramic foam obtained using X-ray computed tomography). On a computer equipped with dual hexa-core Intel Xeon X5670 processors and an NVIDIA Tesla C2050, the parallel application version features near to linear speed-up progression when using only the CPU cores. It executes more than 20 times faster when additionally using the GPU.

Leidi, Tiziano; Scocchi, Giulio; Grossi, Loris; Pusterla, Simone; D'Angelo, Claudio; Thiran, Jean-Philippe; Ortona, Alberto

2012-11-01

122

An Analog Processor for Image Compression  

NASA Technical Reports Server (NTRS)

This paper describes a novel analog Vector Array Processor (VAP) that was designed for use in real-time and ultra-low power image compression applications. This custom CMOS processor is based architectually on the Vector Quantization (VQ) algorithm in image coding, and the hardware implementation fully exploits the inherent parallelism built-in the VQ algorithm.

Tawel, R.

1992-01-01

123

Parallel three-dimensional free-space optical interconnection for an optoelectronic processor  

Microsoft Academic Search

A hybrid scalable optoelectronic crossbar switching system that uses global parallel free-space optical interconnects and three-dimensional (3D) VLSI chip stacks is presented. The system includes three 3D chip stacks with each consisting of 16 VLSI chips. A single 16 X 16 VCSEL\\/MSM detector array is flip-chip bonded on top of the chip stack. Each chip supports 16 optical I\\/Os at

Guoqiang Li; Dawei Huang; Emel Yuceturk; Sadik C. Esener; Volkan H. Ozguz; Yue Liu

2001-01-01

124

Efficient realization of the MD nonrecursive filters: from sequential implementation to mapping on systolic array processors  

Microsoft Academic Search

This paper presents algorithms and architectures for implementing from 1-D to multidimensional M-D digital nonrecursive filters. These architectures are very regular and support single chip implementation in VLSI, as well as multiple chip implementations. The proposed systolic arrays, used in implementation of these algorithms, are optimal with respect to time. In a systolic implementation the highest degree of parallel processing

Adrian Burian; Corneliu Rusu; Pauli Kuosmanen

1998-01-01

125

High-performance parallel processors based on star-coupled wavelength division multiplexing optical interconnects  

DOEpatents

As the performance of individual elements within parallel processing systems increases, increased communication capability between distributed processor and memory elements is required. There is great interest in using fiber optics to improve interconnect communication beyond that attainable using electronic technology. Several groups have considered WDM, star-coupled optical interconnects. The invention uses a fiber optic transceiver to provide low latency, high bandwidth channels for such interconnects using a robust multimode fiber technology. Instruction-level simulation is used to quantify the bandwidth, latency, and concurrency required for such interconnects to scale to 256 nodes, each operating at 1 GFLOPS performance. Performance scales have been shown to .apprxeq.100 GFLOPS for scientific application kernels using a small number of wavelengths (8 to 32), only one wavelength received per node, and achievable optoelectronic bandwidth and latency.

Deri, Robert J. (Pleasanton, CA); DeGroot, Anthony J. (Castro Valley, CA); Haigh, Ronald E. (Arvada, CO)

2002-01-01

126

NOSC (Naval Ocean Systems Center) advanced systolic array processor (ASAP). Professional paper for period ending August 1987  

SciTech Connect

Design of a high-speed (250 million 32-bit floating-point operations per second) two-dimensional systolic array composed of 16-bit/slice microsequencer structured processors is presented. System-design features such as broadcast data flow, tag bit movement, and integrated diagnostic test registers are described. The software development tools needed to map complex matrix-based signal-processing algorithms onto the systolic-processor system are described.

Loughlin, J.P.

1987-12-01

127

Developments of 60 ghz antenna and wireless interconnect inside multi-chip module for parallel processor system  

NASA Astrophysics Data System (ADS)

In order to carry out the complicated computation inside the high performance computing (HPC) systems, tens to hundreds of parallel processor chips and physical wires are required to be integrated inside the multi-chip package module (MCM). The physical wires considered as the electrical interconnects between the processor chips, however, have the challenges on placements and routings because of the unequal progress between the semiconductor and I/O size reductions. The primary goal of the research is to overcome package design challenges---providing a hybrid computing architecture with implemented 60 GHz antennas as the high efficient wireless interconnect which could generate over 10 Gbps bandwidth on the data transmissions. The dissertation is divided into three major parts. In the first part, two different performance metrics, power loss required to be recovered ( PRE) and wireless link budget, on evaluating the antenna's system performance within the chip to chip wireless interconnect are introduced to address the design challenges and define the design goals. The second part contains the design concept, fabrication procedure and measurements of implemented 60 GHz broadband antenna in the application of multi-chip data transmissions. The developed antenna utilizes the periodically-patched artificial magnetic conductor (AMC) structure associated with the ground-shielded conductor in order to enhance the antenna's impedance matching bandwidth. The validation presents that over 10 GHz -10 dB S11 bandwidth which indicates the antenna's operating bandwidth and the horizontal data transmission capability which is required by planar type chip to chip interconnect can be achieved with the design concept. In order to reduce both PRE and wireless link budget numbers, a 60 GHz two-element array in the multi-chip communication is developed in the third part. The third section includes the combined-field analysis, the design concepts on two-element array and feeding circuitry. The simulation results agree with the predicted field analysis and demonstrate the 5dBi gain enhancement in the horizontal direction over a single 60 GHz AMC antenna to further reduce both PRE and wireless link budget numbers.

Yeh, Ho-Hsin

128

Parallel Processing of Large Scale Microphone Arrays for Sound Capture  

NASA Astrophysics Data System (ADS)

Performance of microphone sound pick up is degraded by deleterious properties of the acoustic environment, such as multipath distortion (reverberation) and ambient noise. The degradation becomes more prominent in a teleconferencing environment in which the microphone is positioned far away from the speaker. Besides, the ideal teleconference should feel as easy and natural as face-to-face communication with another person. This suggests hands-free sound capture with no tether or encumbrance by hand-held or body-worn sound equipment. Microphone arrays for this application represent an appropriate approach. This research develops new microphone array and signal processing techniques for high quality hands-free sound capture in noisy, reverberant enclosures. The new techniques combine matched-filtering of individual sensors and parallel processing to provide acute spatial volume selectivity which is capable of mitigating the deleterious effects of noise interference and multipath distortion. The new method outperforms traditional delay-and-sum beamformers which provide only directional spatial selectivity. The research additionally explores truncated matched-filtering and random distribution of transducers to reduce complexity and improve sound capture quality. All designs are first established by computer simulation of array performance in reverberant enclosures. The simulation is achieved by a room model which can efficiently calculate the acoustic multipath in a rectangular enclosure up to a prescribed order of images. It also calculates the incident angle of the arriving signal. Experimental arrays were constructed and their performance was measured in real rooms. Real room data were collected in a hard-walled laboratory and a controllable variable acoustics enclosure of similar size, approximately 6 x 6 x 3 m. An extensive speech database was also collected in these two enclosures for future research on microphone arrays. The simulation results are shown to be consistent with the real room data. Localization of sound sources has been explored using cross-power spectrum time delay estimation and has been evaluated using real room data under slightly, moderately and highly reverberant conditions. To improve the accuracy and reliability of the source localization, an outlier detector that removes incorrect time delay estimation has been invented. To provide speaker selectivity for microphone array systems, a hands-free speaker identification system has been studied. A recently invented feature using selected spectrum information outperforms traditional recognition methods. Measured results demonstrate the capabilities of speaker selectivity from a matched-filtered array. In addition, simulation utilities, including matched -filtering processing of the array and hands-free speaker identification, have been implemented on the massively -parallel nCube super-computer. This parallel computation highlights the requirements for real-time processing of array signals.

Jan, Ea-Ee.

1995-01-01

129

Parallel placement for field-programmable gate arrays  

Microsoft Academic Search

Placement and routing are the most time-consuming processes in automatically synthesizing and configuring circuits for field-programmable gate arrays (FPGAs). In this paper, we use the negotiation-based paradigm to parallelize placement. Our new FPGA placer, NAP (Negotiated Analytical Placement), uses an analytical technique for coarse placement and the negotiation paradigm for detailed placement. We describe the serial algorithm and report results.

Pak K. Chan; Martine D. F. Schlag

2003-01-01

130

Parallel vacuum arc discharge with microhollow array dielectric and anode  

NASA Astrophysics Data System (ADS)

An electrode configuration with microhollow array dielectric and anode was developed to obtain parallel vacuum arc discharge. Compared with the conventional electrodes, more than 10 parallel microhollow discharges were ignited for the new configuration, which increased the discharge area significantly and made the cathode eroded more uniformly. The vacuum discharge channel number could be increased effectively by decreasing the distances between holes or increasing the arc current. Experimental results revealed that plasmas ejected from the adjacent hollow and the relatively high arc voltage were two key factors leading to the parallel discharge. The characteristics of plasmas in the microhollow were investigated as well. The spectral line intensity and electron density of plasmas in microhollow increased obviously with the decease of the microhollow diameter.

Feng, Jinghua; Zhou, Lin; Fu, Yuecheng; Zhang, Jianhua; Xu, Rongkun; Chen, Faxin; Li, Linbo; Meng, Shijian

2014-07-01

131

Lumped-Element Planar Strip Array (LPSA) for Parallel MRI  

PubMed Central

The recently introduced planar strip array (PSA) can significantly reduce scan times in parallel MRI by enabling the utilization of a large number of RF strip detectors that are inherently decoupled, and are tuned by adjusting the strip length to integer multiples of a quarter-wavelength (?/4) in the presence of a ground plane and dielectric substrate. In addition, the more explicit spatial information embedded in the phase of the signals from the strip array is advantageous (compared to loop arrays) for limiting aliasing artifacts in parallel MRI. However, losses in the detector as its natural resonance frequency approaches the Larmor frequency (where the wavelength is long at 1.5 T) may limit the signal-to-noise ratio (SNR) of the PSA. Moreover, the PSA’s inherent ?/4 structure severely limits our ability to adjust detector geometry to optimize the performance for a specific organ system, as is done with loop coils. In this study we replaced the dielectric substrate with discrete capacitors, which resulted in both SNR improvement and a tunable lumped-element PSA (LPSA) whose dimensions can be optimized within broad constraints, for a given region of interest (ROI) and MRI frequency. A detailed theoretical analysis of the LPSA is presented, including its equivalent circuit, electromagnetic fields, SNR, and g-factor maps for parallel MRI. Two different decoupling schemes for the LPSA are described. A four-element LPSA prototype was built to test the theory with quantitative measurements on images obtained with parallel and conventional acquisition schemes. PMID:14705058

Lee, Ray F.; Hardy, Christopher J.; Sodickson, Daniel K.; Bottomley, Paul A.

2007-01-01

132

Analog processor design for potentiometric sensor array and its applications in smart living space  

NASA Astrophysics Data System (ADS)

This paper presents an analog processor design for ion sensitive field effect transistor (ISFET)-based flow through system and its application in smart living space. The dynamic flow-cell measurement explores more information compared to stationary measurement and is useful in environmental monitoring and electronic tongue systems. The multi-channel floating source readout circuitry has been developed for flow-through analysis of ion sensitive field effect transistor based array. The flow injection analysis system with two different ISFET structures has been investigated by using performance parameters such as sensitivity, uniformity, response time of pH sensing. In addition, a self-tuning multi-sensor water quality monitoring system based on adaptive-network-based fuzzy interference system (ANFIS) learning method is developed. The results can be directly used in drinking water and swimming pool monitoring for improving living space and quality.

Chung, Danny Wen-Yaw; Tsai, You-Lin; Liu, Tai-Tsun; Leu, Chun-Liang; Yang, Chung-Huang; Pijanowska, Dorota G.; Torbicz, Wladyslaw; Grabiec, Piotr B.; Jaroszewicz, Bohdan

2007-04-01

133

Temperature modeling and emulation of an ASIC temperature monitor system for Tightly-Coupled Processor Arrays (TCPAs)  

NASA Astrophysics Data System (ADS)

This contribution provides an approach for emulating the behaviour of an ASIC temperature monitoring system (TMon) during run-time for a tightly-coupled processor array (TCPA) of a heterogeneous invasive multi-tile architecture to be used for FPGA prototyping. It is based on a thermal RC modeling approach. Also different usage scenarios of TCPA are analyzed and compared.

Glocker, E.; Boppu, S.; Chen, Q.; Schlichtmann, U.; Teich, J.; Schmitt-Landsiedel, D.

2014-11-01

134

Feasibility study for the implementation of NASTRAN on the ILLIAC 4 parallel processor  

NASA Technical Reports Server (NTRS)

The ILLIAC IV, a fourth generation multiprocessor using parallel processing hardware concepts, is operational at Moffett Field, California. Its capability to excel at matrix manipulation, makes the ILLIAC well suited for performing structural analyses using the finite element displacement method. The feasibility of modifying the NASTRAN (NASA structural analysis) computer program to make effective use of the ILLIAC IV was investigated. The characteristics are summarized of the ILLIAC and the ARPANET, a telecommunications network which spans the continent making the ILLIAC accessible to nearly all major industrial centers in the United States. Two distinct approaches are studied: retaining NASTRAN as it now operates on many of the host computers of the ARPANET to process the input and output while using the ILLIAC only for the major computational tasks, and installing NASTRAN to operate entirely in the ILLIAC environment. Though both alternatives offer similar and significant increases in computational speed over modern third generation processors, the full installation of NASTRAN on the ILLIAC is recommended. Specifications are presented for performing that task with manpower estimates and schedules to correspond.

Field, E. I.

1975-01-01

135

Proceedings of 12th Intl Conference on Parallel Architectures and Compilation Techniques, September 2003. Initial Observations of the Simultaneous Multithreading Pentium 4 Processor  

E-print Network

-threaded, multipro- grammed, and parallel workloads, as well as component benchmarks to evaluate this processorProceedings of 12th Intl Conference on Parallel Architectures and Compilation Techniques, September. Results in- clude multiprogrammed speedup, parallel speedup, as well as synchronization and communication

Wang, Deli

136

Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90 *  

E-print Network

Efficient Support of Parallel Sparse Computation for Array Intrinsic Functions of Fortran 90 * Rong Science, National Tsing­Hua University, Hsinchu, Taiwan Abstract Fortran 90 provides a rich set of array an efficient library for parallel sparse computations with Fortran 90 array intrinsic operations. Our method

Lee, Jenq-Kuen

137

Investigations on the usefulness of the Massively Parallel Processor for study of electronic properties of atomic and condensed matter systems  

NASA Technical Reports Server (NTRS)

The usefulness of the Massively Parallel Processor (MPP) for investigation of electronic structures and hyperfine properties of atomic and condensed matter systems was explored. The major effort was directed towards the preparation of algorithms for parallelization of the computational procedure being used on serial computers for electronic structure calculations in condensed matter systems. Detailed descriptions of investigations and results are reported, including MPP adaptation of self-consistent charge extended Hueckel (SCCEH) procedure, MPP adaptation of the first-principles Hartree-Fock cluster procedure for electronic structures of large molecules and solid state systems, and MPP adaptation of the many-body procedure for atomic systems.

Das, T. P.

1988-01-01

138

Real-Time Adaptive Lossless Hyperspectral Image Compression using CCSDS on Parallel GPGPU and Multicore Processor Systems  

NASA Technical Reports Server (NTRS)

The proposed CCSDS (Consultative Committee for Space Data Systems) Lossless Hyperspectral Image Compression Algorithm was designed to facilitate a fast hardware implementation. This paper analyses that algorithm with regard to available parallelism and describes fast parallel implementations in software for GPGPU and Multicore CPU architectures. We show that careful software implementation, using hardware acceleration in the form of GPGPUs or even just multicore processors, can exceed the performance of existing hardware and software implementations by up to 11x and break the real-time barrier for the first time for a typical test application.

Hopson, Ben; Benkrid, Khaled; Keymeulen, Didier; Aranki, Nazeeh; Klimesh, Matt; Kiely, Aaron

2012-01-01

139

A digital magnetic resonance imaging spectrometer using digital signal processor and field programmable gate array  

NASA Astrophysics Data System (ADS)

A digital spectrometer for low-field magnetic resonance imaging is described. A digital signal processor (DSP) is utilized as the pulse programmer on which a pulse sequence is executed as a subroutine. Field programmable gate array (FPGA) devices that are logically mapped into the external addressing space of the DSP work as auxiliary controllers of gradient control, radio frequency (rf) generation, and rf receiving separately. The pulse programmer triggers an event by setting the 32-bit control register of the corresponding FPGA, and then the FPGA automatically carries out the event function according to preset configurations in cooperation with other devices; accordingly, event control of the spectrometer is flexible and efficient. Digital techniques are in widespread use: gradient control is implemented in real-time by a FPGA; rf source is constructed using direct digital synthesis technique, and rf receiver is constructed using digital quadrature detection technique. Well-designed performance is achieved, including 1 ?s time resolution of the gradient waveform, 1 ?s time resolution of the soft pulse, and 2 MHz signal receiving bandwidth. Both rf synthesis and rf digitalization operate at the same 60 MHz clock, therefore, the frequency range of transmitting and receiving is from DC to ˜27 MHz. A majority of pulse sequences have been developed, and the imaging performance of the spectrometer has been validated through a large number of experiments. Furthermore, the spectrometer is also suitable for relaxation measurement in nuclear magnetic resonance field.

Liang, Xiao; Binghe, Sun; Yueping, Ma; Ruyan, Zhao

2013-05-01

140

Mechanically verified hardware implementing an 8-bit parallel IO Byzantine agreement processor  

NASA Technical Reports Server (NTRS)

Consider a network of four processors that use the Oral Messages (Byzantine Generals) Algorithm of Pease, Shostak, and Lamport to achieve agreement in the presence of faults. Bevier and Young have published a functional description of a single processor that, when interconnected appropriately with three identical others, implements this network under the assumption that the four processors step in synchrony. By formalizing the original Pease, et al work, Bevier and Young mechanically proved that such a network achieves fault tolerance. We develop, formalize, and discuss a hardware design that has been mechanically proven to implement their processor. In particular, we formally define mapping functions from the abstract state space of the Bevier-Young processor to a concrete state space of a hardware module and state a theorem that expresses the claim that the hardware correctly implements the processor. We briefly discuss the Brock-Hunt Formal Hardware Description Language which permits designs both to be proved correct with the Boyer-Moore theorem prover and to be expressed in a commercially supported hardware description language for additional electrical analysis and layout. We briefly describe our implementation.

Moore, J. Strother

1992-01-01

141

Numerical methods for matrix computations using arrays of processors. Final report, 15 August 1983-15 October 1986  

SciTech Connect

The basic objective of this project was to consider a large class of matrix computations with particular emphasis on algorithms that can be implemented on arrays of processors. In particular, methods useful for sparse matrix computations were investigated. These computations arise in a variety of applications such as the solution of partial differential equations by multigrid methods and in the fitting of geodetic data. Some of the methods developed have already found their use on some of the newly developed architectures.

Golub, G.H.

1987-04-30

142

A parallel FPGA implementation for real-time 2D pixel clustering for the ATLAS Fast Tracker Processor  

NASA Astrophysics Data System (ADS)

The parallel 2D pixel clustering FPGA implementation used for the input system of the ATLAS Fast TracKer (FTK) processor is presented. The input system for the FTK processor will receive data from the Pixel and micro-strip detectors from inner ATLAS read out drivers (RODs) at full rate, for total of 760Gbs, as sent by the RODs after level-1 triggers. Clustering serves two purposes, the first is to reduce the high rate of the received data before further processing, the second is to determine the cluster centroid to obtain the best spatial measurement. For the pixel detectors the clustering is implemented by using a 2D-clustering algorithm that takes advantage of a moving window technique to minimize the logic required for cluster identification. The cluster detection window size can be adjusted for optimizing the cluster identification process. Additionally, the implementation can be parallelized by instantiating multiple cores to identify different clusters independently thus exploiting more FPGA resources. This flexibility makes the implementation suitable for a variety of demanding image processing applications. The implementation is robust against bit errors in the input data stream and drops all data that cannot be identified. In the unlikely event of missing control words, the implementation will ensure stable data processing by inserting the missing control words in the data stream. The 2D pixel clustering implementation is developed and tested in both single flow and parallel versions. The first parallel version with 16 parallel cluster identification engines is presented. The input data from the RODs are received through S-Links and the processing units that follow the clustering implementation also require a single data stream, therefore data parallelizing (demultiplexing) and serializing (multiplexing) modules are introduced in order to accommodate the parallelized version and restore the data stream afterwards. The results of the first hardware tests of the single flow implementation on the custom FTK input mezzanine (IM) board are presented. We report on the integration of 16 parallel engines in the same FPGA and the resulting performances. The parallel 2D-clustering implementation has sufficient processing power to meet the specification for the Pixel layers of ATLAS, for up to 80 overlapping pp collisions that correspond to the maximum LHC luminosity planned until 2022.

Sotiropoulou, C. L.; Gkaitatzis, S.; Annovi, A.; Beretta, M.; Kordas, K.; Nikolaidis, S.; Petridou, C.; Volpi, G.

2014-10-01

143

Design of a Replay Debugger for a Large Cellular Array Processor \\Lambday  

E-print Network

tasks (processes). Three communication and synchronisation networks connect all cell processors and the host processor: Synchronisation (S­net), Broadcast (B­net) and point­to­point Torus (T­net). The cells architectures provide more variety of communica­ tions and synchronisation operations than previous designs

Johnson, Christopher William

144

Scalable Unix commands for parallel processors : a high-performance implementation.  

SciTech Connect

We describe a family of MPI applications we call the Parallel Unix Commands. These commands are natural parallel versions of common Unix user commands such as ls, ps, and find, together with a few similar commands particular to the parallel environment. We describe the design and implementation of these programs and present some performance results on a 256-node Linux cluster. The Parallel Unix Commands are open source and freely available.

Ong, E.; Lusk, E.; Gropp, W.

2001-06-22

145

Appendix E: Parallel Pascal development system  

NASA Technical Reports Server (NTRS)

The Parallel Pascal Development System enables Parallel Pascal programs to be developed and tested on a conventional computer. It consists of several system programs, including a Parallel Pascal to standard Pascal translator, and a library of Parallel Pascal subprograms. The library includes subprograms for using Parallel Pascal on a parallel system with a fixed degree of parallelism, such as the Massively Parallel Processor, to conveniently manipulate arrays which have dimensions than the hardware. Programs can be conveninetly tested with small sized arrays on the conventional computer before attempting to run on a parallel system.

1985-01-01

146

Wideband aperture array using RF channelizers and massively parallel digital 2D IIR filterbank  

NASA Astrophysics Data System (ADS)

Wideband receive-mode beamforming applications in wireless location, electronically-scanned antennas for radar, RF sensing, microwave imaging and wireless communications require digital aperture arrays that offer a relatively constant far-field beam over several octaves of bandwidth. Several beamforming schemes including the well-known true time-delay and the phased array beamformers have been realized using either finite impulse response (FIR) or fast Fourier transform (FFT) digital filter-sum based techniques. These beamforming algorithms offer the desired selectivity at the cost of a high computational complexity and frequency-dependant far-field array patterns. A novel approach to receiver beamforming is the use of massively parallel 2-D infinite impulse response (IIR) fan filterbanks for the synthesis of relatively frequency independent RF beams at an order of magnitude lower multiplier complexity compared to FFT or FIR filter based conventional algorithms. The 2-D IIR filterbanks demand fast digital processing that can support several octaves of RF bandwidth, fast analog-to-digital converters (ADCs) for RF-to-bits type direct conversion of wideband antenna element signals. Fast digital implementation platforms that can realize high-precision recursive filter structures necessary for real-time beamforming, at RF radio bandwidths, are also desired. We propose a novel technique that combines a passive RF channelizer, multichannel ADC technology, and single-phase massively parallel 2-D IIR digital fan filterbanks, realized at low complexity using FPGA and/or ASIC technology. There exists native support for a larger bandwidth than the maximum clock frequency of the digital implementation technology. We also strive to achieve More-than-Moore throughput by processing a wideband RF signal having content with N-fold (B = N Fclk/2) bandwidth compared to the maximum clock frequency Fclk Hz of the digital VLSI platform under consideration. Such increase in bandwidth is achieved without use of polyphase signal processing or time-interleaved ADC methods. That is, all digital processors operate at the same Fclk clock frequency without phasing, while wideband operation is achieved by sub-sampling of narrower sub-bands at the the RF channelizer outputs.

Sengupta, Arindam; Madanayake, Arjuna; Gómez-García, Roberto; Engeberg, Erik D.

2014-05-01

147

Evaluation of the Leon3 soft-core processor within a Xilinx radiation-hardened field-programmable gate array.  

SciTech Connect

The purpose of this document is to summarize the work done to evaluate the performance of the Leon3 soft-core processor in a radiation environment while instantiated in a radiation-hardened static random-access memory based field-programmable gate array. This evaluation will look at the differences between two soft-core processors: the open-source Leon3 core and the fault-tolerant Leon3 core. Radiation testing of these two cores was conducted at the Texas A&M University Cyclotron facility and Lawrence Berkeley National Laboratory. The results of these tests are included within the report along with designs intended to improve the mitigation of the open-source Leon3. The test setup used for evaluating both versions of the Leon3 is also included within this document.

Learn, Mark Walter

2012-01-01

148

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources.

O'Sullivan, George A. (Pottersville, NJ); O'Sullivan, Joseph A. (St. Louis, MO)

1999-01-01

149

Multimode power processor  

DOEpatents

In one embodiment, a power processor which operates in three modes: an inverter mode wherein power is delivered from a battery to an AC power grid or load; a battery charger mode wherein the battery is charged by a generator; and a parallel mode wherein the generator supplies power to the AC power grid or load in parallel with the battery. In the parallel mode, the system adapts to arbitrary non-linear loads. The power processor may operate on a per-phase basis wherein the load may be synthetically transferred from one phase to another by way of a bumpless transfer which causes no interruption of power to the load when transferring energy sources. Voltage transients and frequency transients delivered to the load when switching between the generator and battery sources are minimized, thereby providing an uninterruptible power supply. The power processor may be used as part of a hybrid electrical power source system which may contain, in one embodiment, a photovoltaic array, diesel engine, and battery power sources. 31 figs.

O'Sullivan, G.A.; O'Sullivan, J.A.

1999-07-27

150

Acoustooptic linear algebra processors - Architectures, algorithms, and applications  

NASA Technical Reports Server (NTRS)

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.

Casasent, D.

1984-01-01

151

Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays  

SciTech Connect

At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over {mu}Clinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems.

Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

2005-09-20

152

Performance evaluation of the JPL interim digital SAR processor  

NASA Technical Reports Server (NTRS)

The performance of the Interim Digital SAR Processor (IDP) was evaluated. The IDP processor was originally developed for experimental processing of digital SEASAT SAR data. One phase of the system upgrade which features parallel processing in three peripheral array processors, automated estimation for Doppler parameters, and unsupervised image pixel location determination and registration was executed. The method to compensate for the target range curvature effect was improved. A four point interpolation scheme is implemented to replace the nearest neighbor scheme used in the original IDP. The processor still maintains its fast throughput speed. The current performance and capability of the processing modes now available on the IDP system are updated.

Wu, C.; Barkan, B.; Curlander, J.; Jin, M.; Pang, S.

1983-01-01

153

Simulation of the ?? 4 lattice theory on a loosely coupled array of processors  

NASA Astrophysics Data System (ADS)

We discuss a simulation of the lattice ?? 4 theory in a parallel environment. The computation has been performed in the IBM ECSEC scientific center. The hardware consists of a set of ten FPS-164 units connected to a host IBM 4381 computer. In system software is the VM/EPEX experimental parallel control program. A discussion of the parallel computation performance is presented.

Baig, M.

1987-11-01

154

Implementation of monitors with macros: a programming aid for the HEP and other parallel processors  

SciTech Connect

In a previous paper, the advantages of using monitors when implementing multiprocessing algorithms for the Denelcor HEP were delineated. A detailed presentation is given here of how monitors can be implementd on the HEP using a simple macro processor. The thesis is developed that a small body of general-purpose monitors can be defined to handle most standard synchronization patterns. We include the macro packages required to implement some of the more common synchronization patterns, including the fairly complex logic discussed before. Code produced using these macro packages is portable from one multiprocessing environment to another. Indeed, by recoding the set of basic macros (about 100 lines of code for the Denelcor HEP), most programs that are now being written could be moved to any similar multiprocessing system.

Lusk, E.L.; Overbeek, R.A.

1983-12-01

155

Fast String Search on Multicore Processors: Mapping fundamental algorithms onto parallel hardware  

SciTech Connect

String searching is one of these basic algorithms. It has a host of applications, including search engines, network intrusion detection, virus scanners, spam filters, and DNA analysis, among others. The Cell processor, with its multiple cores, promises to speed-up string searching a lot. In this article, we show how we mapped string searching efficiently on the Cell. We present two implementations: • The fast implementation supports a small dictionary size (approximately 100 patterns) and provides a throughput of 40 Gbps, which is 100 times faster than reference implementations on x86 architectures. • The heavy-duty implementation is slower (3.3-4.3 Gbps), but supports dictionaries with tens of thousands of strings.

Scarpazza, Daniele P.; Villa, Oreste; Petrini, Fabrizio

2008-04-01

156

Obsidian: A Domain Specific Embedded Language for General-Purpose Parallel Programming of Graphics Processors  

E-print Network

Abstract. We present a domain specific language, embedded in Haskell, for general purpose parallel programming on GPUs. Our intention is to explore the use of connection patterns in parallel programming. We briefly present our earlier work on hardware generation, and outline the current state of GPU architectures and programming models. Finally, we present the current status of the Obsidian project, which aims to make GPU programming easier, without relinquishing detailed control of GPU resources. Both a programming example and some details of the implementation are presented. This is a report on work in progress. 1

Joel Svensson; Mary Sheeran; Koen Claessen

2008-01-01

157

Multipoint parallel excitation and CCD-based imaging system for high-throughput fluorescence detection of biochip micro-arrays  

Microsoft Academic Search

We report the development and the characterization of a multipoint parallel excitation and CCD-based imaging system for high-throughput fluorescence detection of biochip micro-arrays. A two-dimensional array of (19×19) points with uniform intensity distribution, generated by a holographic array generator, was used for parallel excitation of two-dimensional micro-arrays of fluorescence samples. A CCD-based imaging system was used for high-throughput parallel detection

D. S. Mehta; C. Y. Lee; A. Chiou

2001-01-01

158

Compiler Optimizations for Parallel Sparse Programs with Array Intrinsics of Fortran 90 *  

E-print Network

Compiler Optimizations for Parallel Sparse Programs with Array Intrinsics of Fortran 90 * Rong Science, National Tsing­Hua University, Hsinchu, Taiwan Abstract Fortran 90 provides a rich set of array, it however is currently not supported by any vendor Fortran 90 or HPF com­ pilers. In our research work, we

Lee, Jenq-Kuen

159

Arrays, non-determinism, side-effects, and parallelism: A functional perspective  

Microsoft Academic Search

Incremental, functional updates to arrays, executed in a non-deterministic manner, are shown to achieve the same effect (in both efficiency and functionality) as parallel assignment to imperative arrays. The strategy depends critically on the ability of a compiler to recognize not only that the incremental updates can be done destructively, but also that the updates may be done in any

Paul Hudak

1986-01-01

160

Accelerating biomedical signal processing algorithms with parallel programming on graphic processor units  

Microsoft Academic Search

This paper investigates the benefits derived by adopting the use of Graphics Processing Unit (GPU) parallel programming in the field of biomedical signal processing. The differences in execution time when computing the Correlation Dimension (CD) of multivariate neurophysiological recordings and the Skin Conductance Level (SCL) are reported by comparing several common programming environments. Moreover, as indicated in this study, the

Evdokimos I. Konstantinidis; Christos A. Frantzidis; Lazaros Tzimkas; Costas Pappas; P. D. Bamidis

2009-01-01

161

High-performance ultra-low power VLSI analog processor for data compression  

NASA Technical Reports Server (NTRS)

An apparatus for data compression employing a parallel analog processor. The apparatus includes an array of processor cells with N columns and M rows wherein the processor cells have an input device, memory device, and processor device. The input device is used for inputting a series of input vectors. Each input vector is simultaneously input into each column of the array of processor cells in a pre-determined sequential order. An input vector is made up of M components, ones of which are input into ones of M processor cells making up a column of the array. The memory device is used for providing ones of M components of a codebook vector to ones of the processor cells making up a column of the array. A different codebook vector is provided to each of the N columns of the array. The processor device is used for simultaneously comparing the components of each input vector to corresponding components of each codebook vector, and for outputting a signal representative of the closeness between the compared vector components. A combination device is used to combine the signal output from each processor cell in each column of the array and to output a combined signal. A closeness determination device is then used for determining which codebook vector is closest to an input vector from the combined signals, and for outputting a codebook vector index indicating which of the N codebook vectors was the closest to each input vector input into the array.

Tawel, Raoul (Inventor)

1996-01-01

162

Block-Level Added Redundancy Explicit Authentication for Parallelized Encryption and Integrity Checking of Processor-Memory Transactions  

NASA Astrophysics Data System (ADS)

The bus between the System on Chip (SoC) and the external memory is one of the weakest points of computer systems: an adversary can easily probe this bus in order to read private data (data confidentiality concern) or to inject data (data integrity concern). The conventional way to protect data against such attacks and to ensure data confidentiality and integrity is to implement two dedicated engines: one performing data encryption and another data authentication. This approach, while secure, prevents parallelizability of the underlying computations. In this paper, we introduce the concept of Block-Level Added Redundancy Explicit Authentication (BL-AREA) and we describe a Parallelized Encryption and Integrity Checking Engine (PE-ICE) based on this concept. BL-AREA and PE-ICE have been designed to provide an effective solution to ensure both security services while allowing for full parallelization on processor read and write operations and optimizing the hardware resources. Compared to standard encryption which ensures only confidentiality, we show that PE-ICE additionally guarantees code and data integrity for less than 4% of run-time performance overhead.

Elbaz, Reouven; Torres, Lionel; Sassatelli, Gilles; Guillemin, Pierre; Bardouillet, Michel; Martinez, Albert

163

Obtaining identical results with double precision global accuracy on different numbers of processors in parallel particle Monte Carlo simulations  

SciTech Connect

We describe and compare different approaches for achieving numerical reproducibility in photon Monte Carlo simulations. Reproducibility is desirable for code verification, testing, and debugging. Parallelism creates a unique problem for achieving reproducibility in Monte Carlo simulations because it changes the order in which values are summed. This is a numerical problem because double precision arithmetic is not associative. Parallel Monte Carlo, both domain replicated and decomposed simulations, will run their particles in a different order during different runs of the same simulation because the non-reproducibility of communication between processors. In addition, runs of the same simulation using different domain decompositions will also result in particles being simulated in a different order. In [1], a way of eliminating non-associative accumulations using integer tallies was described. This approach successfully achieves reproducibility at the cost of lost accuracy by rounding double precision numbers to fewer significant digits. This integer approach, and other extended and reduced precision reproducibility techniques, are described and compared in this work. Increased precision alone is not enough to ensure reproducibility of photon Monte Carlo simulations. Non-arbitrary precision approaches require a varying degree of rounding to achieve reproducibility. For the problems investigated in this work double precision global accuracy was achievable by using 100 bits of precision or greater on all unordered sums which where subsequently rounded to double precision at the end of every time-step.

Cleveland, Mathew A., E-mail: cleveland7@llnl.gov; Brunner, Thomas A.; Gentile, Nicholas A.; Keasler, Jeffrey A.

2013-10-15

164

Retinal Parallel Processors: More than 100 Independent Microcircuits Operate within a Single Interneuron  

PubMed Central

SUMMARY Most neurons are highly polarized cells with branched dendrites that receive and integrate synaptic inputs and extensive axons that deliver action potential output to distant targets. By contrast, amacrine cells, a diverse class of inhibitory interneurons in the inner retina, collect input and distribute output within the same neuritic network. The extent to which most amacrine cells integrate synaptic information and distribute their output is poorly understood. Here, we show that single A17 amacrine cells provide reciprocal feedback inhibition to presynaptic bipolar cells via hundreds of independent microcircuits operating in parallel. The A17 uses specialized morphological features, biophysical properties, and synaptic mechanisms to isolate feedback microcircuits and maximize its capacity to handle many independent processes. This example of a neuron employing distributed parallel processing rather than spatial integration provides insights into how unconventional neuronal morphology and physiology can maximize network function while minimizing wiring cost. PMID:20346762

Grimes, William N.; Zhang, Jun; Graydon, Cole W.; Kachar, Bechara; Diamond, Jeffrey S.

2010-01-01

165

Parallel application benchmarks and performance evaluation of the Intel Xeon 7500 family processors  

Microsoft Academic Search

With the recent advent of novel multi- and many-core hardware architectures, application programmers have to deal with many hardware-specific implementation details and have to be familiar with software optimization techniques to benefit from new high-performance computing machines. Highly effcient parallel application design is in fact an interdisciplinary process involving domain specific and IT experts. Therefore, this paper aims to present

Piotr Kopta; Michal Kulczewski; Krzysztof Kurowski; Tomasz Piontek; Pawel Gepner; Mariusz Puchalski; Jacek Komasa

2011-01-01

166

Reconfigurable Parallel VLSI CoProcessor for Space Robot Using FPGA  

Microsoft Academic Search

This paper proposes hardware solutions to the computation for the trigonometric and square root functions of inverse kinematics. They are based on an existing pipeline arithmetic which employs the CORDIC(Coordinate Rotation Digital Computer) algorithm. This integrated approach enhances computational efficiency by reducing the duplicate calculations of this functions and maximizing the parallel\\/pipelining processing for real-time robot control. The reliability of

R. Wei; M. H. Jin; J. J. Xia; Z. W. Xie; Hong Liu

2006-01-01

167

Electrostatic quadrupole array for focusing parallel beams of charged particles  

DOEpatents

An array of electrostatic quadrupoles, capable of providing strong electrostatic focusing simultaneously on multiple beams, is easily fabricated from a single array element comprising a support rod and multiple electrodes spaced at intervals along the rod. The rods are secured to four terminals which are isolated by only four insulators. This structure requires bias voltage to be supplied to only two terminals and eliminates the need for individual electrode bias and insulators, as well as increases life by eliminating beam plating of insulators.

Brodowski, John (Smithtown, NY)

1982-11-23

168

High-speed, automatic controller design considerations for integrating array processor, multi-microprocessor, and host computer system architectures  

NASA Technical Reports Server (NTRS)

Modern control systems must typically perform real-time identification and control, as well as coordinate a host of other activities related to user interaction, online graphics, and file management. This paper discusses five global design considerations which are useful to integrate array processor, multimicroprocessor, and host computer system architectures into versatile, high-speed controllers. Such controllers are capable of very high control throughput, and can maintain constant interaction with the nonreal-time or user environment. As an application example, the architecture of a high-speed, closed-loop controller used to actively control helicopter vibration is briefly discussed. Although this system has been designed for use as the controller for real-time rotorcraft dynamics and control studies in a wind tunnel environment, the controller architecture can generally be applied to a wide range of automatic control applications.

Jacklin, S. A.; Leyland, J. A.; Warmbrodt, W.

1985-01-01

169

VLSI processor with a configurable processing element array for balanced feature extraction in high-resolution images  

NASA Astrophysics Data System (ADS)

A VLSI processor employing a configurable processing element array (PEA) is developed for a newly proposed balanced feature extraction algorithm. In the algorithm, the input image is divided into square regions and the number of features is determined by noise effect analysis in each region. Regions of different sizes are used according to the resolutions and contents of input images. Therefore, inside the PEA, processing elements are hierarchically grouped for feature extraction in regions of different sizes. A proof-of-concept chip is fabricated using a 0.18 µm CMOS technology with a 32 × 32 PEA. From measurement results, a speed of 7.5 kfps is achieved for feature extraction in 128 × 128 pixel regions when operating the chip at 45 MHz, and a speed of 55 fps is also achieved for feature extraction in 1920 × 1080 pixel images.

Zhu, Hongbo; Shibata, Tadashi

2014-01-01

170

Capanic: A Parallel Tree N-Body Code for Inhomogeneous Clusters of Processors  

E-print Network

We have implemented a parallel version of the Barnes-Hut 3-D N-body tree algorithm under PVM 3.2.5, adopting an SPMD paradigm. We parallelize the problem by decomposing the physical domain by means of the {\\bf Orthogonal Recursive Bisection} oct-tree scheme suggested by Salmon (1991), but we modify the original hypercube communication pattern into an incomplete hypercube, which is more suitable for a generic inhomogenous cluster architecture.\\\\ We address dynamical load balancing by assigning different "weights" to the spawned tasks according to the dynamically changing workloads of each task. The weights are determined by monitoring the local platforms where the tasks are running and estimating the performance of each task. The monitoring scheme is flexible and allows us to address at the same time cluster and intrinsic sources of load imbalance. We then show measurements of the performance of our code on a test case of astrophysical interest in order to test the performance of our implementation.

V. Antonuccio-Delogu; U. Becciani

1994-06-24

171

Method of up-front load balancing for local memory parallel processors  

NASA Technical Reports Server (NTRS)

In a parallel processing computer system with multiple processing units and shared memory, a method is disclosed for uniformly balancing the aggregate computational load in, and utilizing minimal memory by, a network having identical computations to be executed at each connection therein. Read-only and read-write memory are subdivided into a plurality of process sets, which function like artificial processing units. Said plurality of process sets is iteratively merged and reduced to the number of processing units without exceeding the balance load. Said merger is based upon the value of a partition threshold, which is a measure of the memory utilization. The turnaround time and memory savings of the instant method are functions of the number of processing units available and the number of partitions into which the memory is subdivided. Typical results of the preferred embodiment yielded memory savings of from sixty to seventy five percent.

Baffes, Paul Thomas (inventor)

1990-01-01

172

Synthesis of algorithmically oriented architectures of optical processors  

NASA Astrophysics Data System (ADS)

Methods for designing the architecture of optical multiplanar array processor are developed. The underlying abstract model (Parallel Substitution Algorithm and Parallel microprogramming) and associated theory of parallel distributed computations are briefly described. A general method for determining cell logic and structure of the array which implements a fine-grained parallel algorithm into an electrooptic processing unit is proposed. Formal tools for transforming 2D-cellular algorithm into a 3D-representation is given in a short form. Method proposed provide formal ways of creating facilities for computer aided design of algorithmically oriented electrooptic architecture.

Bandman, O. L.

1993-07-01

173

Fast sub-image segmentation using parallel neural processors and image decomposition  

NASA Astrophysics Data System (ADS)

In this paper, an approach to reduce the computation steps required by fast neural networks for the searching process is presented. The principle of divide and conquer strategy is applied through image decomposition. Each image is divided into small in size sub-images and then each one is tested separately using a fast neural network. The operation of fast neural networks based on applying cross correlation in the frequency domain between the input image and the weights of the hidden neurons. Compared to conventional and fast neural networks, experimental results show that a speed up ratio is achieved when applying this technique to locate human faces automatically in cluttered scenes. Furthermore, faster face detection is obtained by using parallel processing techniques to test the resulting sub-images at the same time using the same number of fast neural networks. In contrast to using only fast neural networks, the speed up ratio is increased with the size of the input image when using fast neural networks and image decomposition. This is our new achievement over our previous publications 1,2,7,9. Moreover, simulation results are increased more than those presented in our previous publications.

El-Bakry, Hazem M.

2005-07-01

174

Experimental verification of SNR and parallel imaging improvements using composite arrays.  

PubMed

Composite MRI arrays consist of triplets where two orthogonal upright loops are placed over the same imaging area as a standard surface coil. The optimal height of the upright coils is approximately half the width for the 7?cm coils used in this work. Resistive and magnetic coupling is shown to be negligible within each coil triplet. Experimental evaluation of imaging performance was carried out on a Philips 3?T Achieva scanner using an eight-coil composite array consisting of three surface coils and five upright loops, as well as an array of eight surface coils for comparison. The composite array offers lower overall coupling than the traditional array. The sensitivities of upright coils are complementary to those of the surface coils and therefore provide SNR gains in regions where surface coil sensitivity is low, and additional spatial information for improved parallel imaging performance. Near the surface of the phantom the eight-channel surface coil array provides higher overall SNR than the composite array, but this advantage disappears beyond a depth of approximately one coil diameter, where it is typically more challenging to improve SNR. Furthermore, parallel imaging performance is better with the composite array compared with the surface coil array, especially at high accelerations and in locations deep in the phantom. Composite arrays offer an attractive means of improving imaging performance and channel density without reducing the size, and therefore the loading regime, of surface coil elements. Additional advantages of composite arrays include minimal SNR loss using root-sum-of-squares combination compared with optimal, and the ability to switch from high to low channel density by merely selecting only the surface elements, unlike surface coil arrays, which require additional hardware. Copyright © 2014 John Wiley & Sons, Ltd. PMID:25388793

Maunder, Adam; Fallone, B Gino; Daneshmand, Mojgan; De Zanche, Nicola

2015-02-01

175

Parallel RNA extraction using magnetic beads and a droplet array.  

PubMed

Nucleic acid extraction is a necessary step for most genomic/transcriptomic analyses, but it often requires complicated mechanisms to be integrated into a lab-on-a-chip device. Here, we present a simple, effective configuration for rapidly obtaining purified RNA from low concentration cell medium. This Total RNA Extraction Droplet Array (TREDA) utilizes an array of surface-adhering droplets to facilitate the transportation of magnetic purification beads seamlessly through individual buffer solutions without solid structures. The fabrication of TREDA chips is rapid and does not require a microfabrication facility or expertise. The process takes less than 5 minutes. When purifying mRNA from bulk marine diatom samples, its repeatability and extraction efficiency are comparable to conventional tube-based operations. We demonstrate that TREDA can extract the total mRNA of about 10 marine diatom cells, indicating that the sensitivity of TREDA approaches single-digit cell numbers. PMID:25519439

Shi, Xu; Chen, Chun-Hong; Gao, Weimin; Chao, Shih-Hui; Meldrum, Deirdre R

2015-02-01

176

Parallel recording with optical waveguide array Haifeng Wang*a  

E-print Network

a linear array of laser diodes onto different tracks operated in the far-field1-5 , and using a linear-mode rectangular waveguide. The material of the guide layer is SiN (n=2.05) and of the cladding layer is SiO2 (n=1.47). The thickness of the cladding (guide) layers are 1100 (800) nm for the multi-mode waveguide and 1120 (160) nm

177

Skew-free parallel optical links and their array technology  

Microsoft Academic Search

Parallel optical links are promising key techniques for increasing the transmission capacity and distance of high-throughput interconnections in B-ISDN systems. The most important requirements in these links are small size, low power consumption, and low cost. These requirements cannot be met using the optical devices and modules used in conventional long-haul, large-capacity optical transmission systems. This paper describes some new

M. Yano; G. Nakagawa; N. Fujimoto

1995-01-01

178

Optoelectronic smart-pixel arrays for parallel processing  

NASA Astrophysics Data System (ADS)

The transition of thinking from the use of all-optical logic arrays, to optoelectronic smart-pixel arrays, within digital information processing is described. Detectors with low conversion energies and fast modulators, developed from devices originally intended as bistable elements, now provide one of the technologies being pursued for the optics - electronics interface. The key advantage provided by optics is the huge bandwidth off-chip (THz); electronics provides high density locally interconnected (on-chip) logic devices. Applications that exploit this combination are being sought. One possible area is in the sorting of data sets, where non-local interconnections between stages and modest logic functionality per stage, are required in order to implement fast algorithms. The expected performance of a smart-pixel sorting module, such as that under construction by the Scottish Collaborative Initiative in Optoelectronic Sciences (SCIOS) is summarized. The move from all-optical to hybrid technologies does not eradicate the need for further advances in materials and in the processing control of materials with nonlinear optical (electro-absorption and electro-optic) responses.

Wherrett, B. S.; Desmulliez, M. P. Y.

1996-09-01

179

Achieving supercomputer performance for neural net simulation with an array of digital signal processors  

SciTech Connect

Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

Muller, U.A.; Baumle, B.; Kohler, P.; Gunzinger, A.; Guggenbuhl, W. [Swiss Federal Inst. of Technology, Zurich (Switzerland)] [Swiss Federal Inst. of Technology, Zurich (Switzerland)

1992-10-01

180

Achieving supercomputer performance for neural net simulation with an array of digital signal processors  

Microsoft Academic Search

Music, a DSP-based system with a parallel distributed-memory architecture, provides enormous computing power yet retains the flexibility of a general-purpose computer. Reaching a peak performance of 2.7 Gflops at a significantly lower cost, power consumption, and space requirement than conventional supercomputers, Music is well suited to computationally intensive applications such as neural network simulation. 12 refs., 9 figs., 2 tabs.

Urs A. Muller; Bernhard Baumle; Peter Kohler; Anton Gunzinger; Walter Guggenbuhl

1992-01-01

181

Design Optimization of VLSI Array Processor Architecture for Window Image Processing  

E-print Network

Introduction Window opera(4MP inima) processings, which perform repetitive opera))4M over thepaM45D aM in imam da- spaM( mostly consist of the identica opera tions over the pixels. Ingenera) reara)M opera)3D of aM(D(O(MP with high computa/MP(E ramp sucha s the window opera)5M ca only becaO5E3 out bypa)/B4M aa pipeline processing. In thispa er, we focus on VLSI aSI y processoras hitecture design for window operaBB(M inima4 processing. Systolicat hitectureha ma ya4 a ta55 due to regulaO3 yaD modula4B ya4 a4B w toobta4 high performaD ( in terms ofda4 flowraO5 Ma yresea3 hers ha ve been developing methods to derivededicaM systolical hitectures aecturesM4(E . Most of these methodsaB ba5/ on the saM concept ofaM(4/BOM oriented. In praDO3MP VLSI design, the performa43 of ao hitectures sucha scomputa5O3 time, number of processors, design complexitya4 so on aM often dependent on user's requirement. Thus, the designer must look fora optima ai hitecture fora given specifica tion. Howev

Dongju Li; Li Jiang; Hiroaki KUNIEDA

182

Towards picosecond array detector for single-photon time-resolved multispot parallel analysis  

Microsoft Academic Search

Over the past few years there has been a growing interest in monolithic arrays of single photon avalanche diodes (SPADs) for parallel detection of faint ultrafast optical signals. SPADs implemented in CMOS-compatible planar technologies offer the typical advantages of microelectronic devices (small size, ruggedness, low voltage, low power, etc.). Furthermore, they have inherently higher photon detection efficiency than photomultiplier tubes

Ivan Rech; Angelo Gulinatti; Matteo Crotti; Corrado Cammi; Piera Maccagnani; Massimo Ghioni

2011-01-01

183

Multicoil resonance-based parallel array for smart wireless power delivery.  

PubMed

This paper presents a novel resonance-based multicoil structure as a smart power surface to wirelessly power up apparatus like mobile, animal headstage, implanted devices, etc. The proposed powering system is based on a 4-coil resonance-based inductive link, the resonance coil of which is formed by an array of several paralleled coils as a smart power transmitter. The power transmitter employs simple circuit connections and includes only one power driver circuit per multicoil resonance-based array, which enables higher power transfer efficiency and power delivery to the load. The power transmitted by the driver circuit is proportional to the load seen by the individual coil in the array. Thus, the transmitted power scales with respect to the load of the electric/electronic system to power up, and does not divide equally over every parallel coils that form the array. Instead, only the loaded coils of the parallel array transmit significant part of total transmitted power to the receiver. Such adaptive behavior enables superior power, size and cost efficiency then other solutions since it does not need to use complex detection circuitry to find the location of the load. The performance of the proposed structure is verified by measurement results. Natural load detection and covering 4 times bigger area than conventional topologies with a power transfer efficiency of 55% are the novelties of presented paper. PMID:24109796

Mirbozorgi, S A; Sawan, M; Gosselin, B

2013-01-01

184

Using a Cray Y-MP as an array processor for a RISC Workstation  

NASA Technical Reports Server (NTRS)

As microprocessors increase in power, the economics of centralized computing has changed dramatically. At the beginning of the 1980's, mainframes and super computers were often considered to be cost-effective machines for scalar computing. Today, microprocessor-based RISC (reduced-instruction-set computer) systems have displaced many uses of mainframes and supercomputers. Supercomputers are still cost competitive when processing jobs that require both large memory size and high memory bandwidth. One such application is array processing. Certain numerical operations are appropriate to use in a Remote Procedure Call (RPC)-based environment. Matrix multiplication is an example of an operation that can have a sufficient number of arithmetic operations to amortize the cost of an RPC call. An experiment which demonstrates that matrix multiplication can be executed remotely on a large system to speed the execution over that experienced on a workstation is described.

Lamaster, Hugh; Rogallo, Sarah J.

1992-01-01

185

Development of a ground signal processor for digital synthetic array radar data  

NASA Technical Reports Server (NTRS)

A modified APQ-102 sidelooking array radar (SLAR) in a B-57 aircraft test bed is used, with other optical and infrared sensors, in remote sensing of Earth surface features for various users at NASA Johnson Space Center. The video from the radar is normally recorded on photographic film and subsequently processed photographically into high resolution radar images. Using a high speed sampling (digitizing) system, the two receiver channels of cross-and co-polarized video are recorded on wideband magnetic tape along with radar and platform parameters. These data are subsequently reformatted and processed into digital synthetic aperture radar images with the image data available on magnetic tape for subsequent analysis by investigators. The system design and results obtained are described.

Griffin, C. R.; Estes, J. M.

1981-01-01

186

Multithreading and Parallel Microprocessors  

E-print Network

Multithreading and Parallel Microprocessors Stephen Jenks Electrical Engineering and Computer Scalable Parallel and Distributed Systems Lab 4 Outline Parallelism in Microprocessors Multicore Processor Parallelism Parallel Programming for Shared Memory OpenMP POSIX Threads Java Threads Parallel Microprocessor

Shinozuka, Masanobu

187

Sequence information signal processor  

DOEpatents

An electronic circuit is used to compare two sequences, such as genetic sequences, to determine which alignment of the sequences produces the greatest similarity. The circuit includes a linear array of series-connected processors, each of which stores a single element from one of the sequences and compares that element with each successive element in the other sequence. For each comparison, the processor generates a scoring parameter that indicates which segment ending at those two elements produces the greatest degree of similarity between the sequences. The processor uses the scoring parameter to generate a similar scoring parameter for a comparison between the stored element and the next successive element from the other sequence. The processor also delivers the scoring parameter to the next processor in the array for use in generating a similar scoring parameter for another pair of elements. The electronic circuit determines which processor and alignment of the sequences produce the scoring parameter with the highest value.

Peterson, John C. (Alta Loma, CA); Chow, Edward T. (San Dimas, CA); Waterman, Michael S. (Culver City, CA); Hunkapillar, Timothy J. (Pasadena, CA)

1999-01-01

188

HIFI: A design method for implementing signal processing algorithms on VLSI processor arrays  

NASA Astrophysics Data System (ADS)

A design method, called HIFI, that makes it possible to systematically implement a large class of signal processing algorithms on systolic and wavefront arrays is defined. The model underlying the definition of the design method is a combination of a process oriented model and an applicative, function oriented model. The result is a model that combines a high level of abstraction with powerful decomposition mechanisms. The model is used to define the HIFI design method, which allows both top-down and bottom-up design styles. The HIFI design-method is based on two different design steps: refinement, which makes it possible to define the decomposition of a function with a dependence graph; and partitioning, used to project a dependence graph on a so-called signal flow graph, which allows a more efficient implementation, of the function defined by the dependence graph. The design method is illustrated by an algorithm for the solution of a system of linear equations, and the transitive closure algorithm.

Annevelink, Jurgen

1988-11-01

189

Photon detection with parallel asynchronous processing  

NASA Technical Reports Server (NTRS)

An approach to photon detection with a parallel asynchronous signal processor is described. The visible or IR photon-detection capability of the silicon p(+)-n-n(+) detectors and the parallel asynchronous processing are addressed separately. This approach would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture consisting of a stack of planar arrays of the devices would form a 2D array processor with a 2D array of inputs located directly behind a focal-plane detector array. A 2D image data stream would propagate in neuronlike asynchronous pulse-coded form through the laminar processor. Such systems can integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The possibility of multispectral image processing is addressed.

Coon, D. D.; Perera, A. G. U.

1990-01-01

190

Automatic Parallelization of Numerical Python Applications using the Global Arrays Toolkit  

SciTech Connect

Global Arrays is a software system from Pacific Northwest National Laboratory that enables an efficient, portable, and parallel shared-memory programming interface to manipulate distributed dense arrays. The NumPy module is the de facto standard for numerical calculation in the Python programming language, a language whose use is growing rapidly in the scientific and engineering communities. NumPy provides a powerful N-dimensional array class as well as other scientific computing capabilities. However, like the majority of the core Python modules, NumPy is inherently serial. Using a combination of Global Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can become parallel, scalable GAiN applications with only minor source code changes. Scalability studies of several different GAiN applications will be presented showing the utility of developing serial NumPy codes which can later run on more capable clusters or supercomputers.

Daily, Jeffrey A.; Lewis, Robert R.

2011-11-30

191

Symbolic Array Data ow Analysis for Array Privatization and Program Parallelization 1  

E-print Network

. This paper proposes a powerful symbolicarray data ow analysis to support array privatization and loop poteng/2000 5.2 8% No No Yes TRFD olda/100 16.4 69% Yes No No olda/300 12.3 29% Yes No No OCEAN ocean/270 8.0 3% Yes Yes Yes ocean/480 6.1 4% Yes Yes Yes ocean/500 6.5 3% Yes Yes Yes ARC2D lerx/15 4.0 7

Li, Zhiyuan

192

The effect of steroids on peripheral blood lymphocytes containing parallel tubular arrays.  

PubMed Central

The response of lymphocytes containing cytoplasmic inclusions called parallel tubular arrays (PTA) was determined after the administration of the glucocorticoid dexamethasone to 10 healthy volunteers. The percentage of these lymphocytes was found to increase during the lymphopenia induced by steroid administration. The size and number of parallel tubular arrays per cell showed no differences before and after steroid administration, indicating that the increase was a result of a change in the proportion of whole cells. This indicates, for the first time, that a morphologically defined population of lymphocytes from the normal peripheral circulation has been linked to a specific response, ie, steroid resistance. The possible mechanism of steroid resistance is discussed. Images Figure 2 Figure 1 PMID:686151

Payne, C. M.; Glasser, L.

1978-01-01

193

A 32-Channel Lattice Transmission Line Array for Parallel Transmit and Receive MRI at 7 Tesla  

PubMed Central

Transmit and receive RF coil arrays have proven to be particularly beneficial for ultra-high-field MR. Transmit coil arrays enable such techniques as B1+ shimming to substantially improve transmit B1 homogeneity compared to conventional volume coil designs, and receive coil arrays offer enhanced parallel imaging performance and SNR. Concentric coil arrangements hold promise for developing transceiver arrays incorporating large numbers of coil elements. At magnetic field strengths of 7 tesla and higher where the Larmor frequencies of interest can exceed 300 MHz, the coil array design must also overcome the problem of the coil conductor length approaching the RF wavelength. In this study, a novel concentric arrangement of resonance elements built from capacitively-shortened half-wavelength transmission lines is presented. This approach was utilized to construct an array with whole-brain coverage using 16 transceiver elements and 16 receive-only elements, resulting in a coil with a total of 16 transmit and 32 receive channels. PMID:20512850

Adriany, Gregor; Auerbach, Edward J.; Snyder, Carl J.; Gözübüyük, Ark; Moeller, Steen; Ritter, Johannes; van de Moortele, Pierre-Francois; Vaughan, Tommy; U?urbil, Kamil

2010-01-01

194

Transmit and receive transmission line arrays for 7 Tesla parallel imaging.  

PubMed

Transceive array coils, capable of RF transmission and independent signal reception, were developed for parallel, 1H imaging applications in the human head at 7 T (300 MHz). The coils combine the advantages of high-frequency properties of transmission lines with classic MR coil design. Because of the short wavelength at the 1H frequency at 300 MHz, these coils were straightforward to build and decouple. The sensitivity profiles of individual coils were highly asymmetric, as expected at this high frequency; however, the summed images from all coils were relatively uniform over the whole brain. Data were obtained with four- and eight-channel transceive arrays built using a loop configuration and compared to arrays built from straight stripline transmission lines. With both the four- and the eight-channel arrays, parallel imaging with sensitivity encoding with high reduction numbers was feasible at 7 T in the human head. A one-dimensional reduction factor of 4 was robustly achieved with an average g value of 1.25 with the eight-channel transmit/receive coils. PMID:15678527

Adriany, Gregor; Van de Moortele, Pierre-Francois; Wiesinger, Florian; Moeller, Steen; Strupp, John P; Andersen, Peter; Snyder, Carl; Zhang, Xiaoliang; Chen, Wei; Pruessmann, Klaas P; Boesiger, Peter; Vaughan, Tommy; U?urbil, K?mil

2005-02-01

195

TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. YY, ZZZ 2006 1 Performance Models for Network Processor Design  

E-print Network

- chip memory, various network and off-chip memory interfaces, and other specialized logic components-performance network processor and presents insights into the how best to configure the numerous design elements between the four elements listed above and relates them both to specific design elements (e.g., number

Shenoy, Prashant

196

Image fiber optic space-CDMA parallel transmission experiment using 8 x 8 VCSEL/PD arrays.  

PubMed

We experimentally demonstrate space-code-division multiple access (space-CDMA) based twodimensional (2-D) parallel optical interconnections by using image fibers and 8 x 8 vertical-cavity surface-emitting laser (VCSEL)/photo diode (PD) arrays. Two spatially encoded four-bit (2 x 2) parallel optical signals were emitted fiom 2-D VCSEL arrays and transmitted through image fibers. The encoded signals were multiplexed by an image-fiber coupler and detected by a 2-D PD array on the receiver side. The receiver recovered the intended parallel signal by decoding the signal. The transmission speed was 64 Mbps/ch (total throughput: 512 Mbps). Bit-error-rate (BER) measurement with a laterally misaligned PD array showed the array had a misalignment tolerance of 25 microm for a BER performance of 10(-9). PMID:12440546

Nakamura, Moriya; Kitayama, Ken-ichi; Igasaki, Yasunori; Shamoto, Naoki; Kaneda, Keiji

2002-11-10

197

Weak-Periodic Stochastic Resonance in a Parallel Array of Static Nonlinearities  

PubMed Central

This paper studies the output-input signal-to-noise ratio (SNR) gain of an uncoupled parallel array of static, yet arbitrary, nonlinear elements for transmitting a weak periodic signal in additive white noise. In the small-signal limit, an explicit expression for the SNR gain is derived. It serves to prove that the SNR gain is always a monotonically increasing function of the array size for any given nonlinearity and noisy environment. It also determines the SNR gain maximized by the locally optimal nonlinearity as the upper bound of the SNR gain achieved by an array of static nonlinear elements. With locally optimal nonlinearity, it is demonstrated that stochastic resonance cannot occur, i.e. adding internal noise into the array never improves the SNR gain. However, in an array of suboptimal but easily implemented threshold nonlinearities, we show the feasibility of situations where stochastic resonance occurs, and also the possibility of the SNR gain exceeding unity for a wide range of input noise distributions. PMID:23505523

Ma, Yumei; Duan, Fabing; Chapeau-Blondeau, François; Abbott, Derek

2013-01-01

198

Parallel and series FED microstrip array with high efficiency and low cross polarization  

NASA Technical Reports Server (NTRS)

A microstrip array antenna for vertically polarized fan beam (approximately 2 deg x 50 deg) for C-band SAR applications with a physical area of 1.7 m by 0.17 m comprises two rows of patch elements and employs a parallel feed to left- and right-half sections of the rows. Each section is divided into two segments that are fed in parallel with the elements in each segment fed in series through matched transmission lines for high efficiency. The inboard section has half the number of patch elements of the outboard section, and the outboard sections, which have tapered distribution with identical transmission line sections, terminated with half wavelength long open-circuit stubs so that the remaining energy is reflected and radiated in phase. The elements of the two inboard segments of the two left- and right-half sections are provided with tapered transmission lines from element to element for uniform power distribution over the central third of the entire array antenna. The two rows of array elements are excited at opposite patch feed locations with opposite (180 deg difference) phases for reduced cross-polarization.

Huang, John (inventor)

1995-01-01

199

Design and parallel fabrication of wire-grid polarization arrays for polarization-resolved imaging at 1:55 m  

E-print Network

Design and parallel fabrication of wire-grid polarization arrays for polarization-resolved imaging 2008 Polarization-resolved imaging can provide information about the composition and topography areas of square micrometer size WGP arrays suitable for polarization-resolved imaging on glass were

Klotzkin, David

200

PROPELLER-EPI With Parallel Imaging Using a Circularly Symmetric Phased-Array RF Coil at 3.0 T  

E-print Network

PROPELLER-EPI With Parallel Imaging Using a Circularly Symmetric Phased-Array RF Coil at 3.0 T (PROPELLER) and parallel imaging is presented for diffusion echo-planar im- aging (EPI) at high spatial the phase-encoding direction, and PROPELLER acquisi- tion to further decrease the echo train length (ETL

201

Dynamic scheduling and planning parallel observations on large Radio Telescope Arrays with the Square Kilometre Array in mind  

NASA Astrophysics Data System (ADS)

Scheduling, the task of producing a time table for resources and tasks, is well-known to be a difficult problem the more resources are involved (a NP-hard problem). This is about to become an issue in Radio astronomy as observatories consisting of hundreds to thousands of telescopes are planned and operated. The Square Kilometre Array (SKA), which Australia and New Zealand bid to host, is aiming for scales where current approaches -- in construction, operation but also scheduling -- are insufficent. Although manual scheduling is common today, the problem is becoming complicated by the demand for (1) independent sub-arrays doing simultaneous observations, which requires the scheduler to plan parallel observations and (2) dynamic re-scheduling on changed conditions. Both of these requirements apply to the SKA, especially in the construction phase. We review the scheduling approaches taken in the astronomy literature, as well as investigate techniques from human schedulers and today's observatories. The scheduling problem is specified in general for scientific observations and in particular on radio telescope arrays. Also taken into account is the fact that the observatory may be oversubscribed, requiring the scheduling problem to be integrated with a planning process. We solve this long-term scheduling problem using a time-based encoding that works in the very general case of observation scheduling. This research then compares algorithms from various approaches, including fast heuristics from CPU scheduling, Linear Integer Programming and Genetic algorithms, Branch-and-Bound enumeration schemes. Measures include not only goodness of the solution, but also scalability and re-scheduling capabilities. In conclusion, we have identified a fast and good scheduling approach that allows (re-)scheduling difficult and changing problems by combining heuristics with a Genetic algorithm using block-wise mutation operations. We are able to explain and eradicate two problems in the literature: The inability of a GA to properly improve schedules and the generation of schedules with frequent interruptions. Finally, we demonstrate the scheduling framework for several operating telescopes: (1) Dynamic re-scheduling with the AUT Warkworth 12m telescope, (2) Scheduling for the Australian Mopra 22m telescope and scheduling for the Allen Telescope Array. Furthermore, we discuss the applicability of the presented scheduling framework to the Atacama Large Millimeter/submillimeter Array (ALMA, in construction) and the SKA. In particular, during the development phase of the SKA, this dynamic, scalable scheduling framework can accommodate changing conditions.

Buchner, Johannes

2011-12-01

202

Design and implementation of a parallel array operator for the arbitrary remapping of data.  

SciTech Connect

The data redistribution or remapping functions, gather and scatter, are of long-standing in high-performance computing, having been included in Cray Fortran for decades. In this paper, we present a highly-general array operator with powerful ga.ther and scatter capa.bilities unmatched in other array languages. We discuss an efficient parallel implementation, introducing several new optimizations-run length encoding, dead army reuse, and direct conimunica.tion-that lessen the costs associa.ted with the operator's wide applicability. In our implementation of this operator in ZPL, we demonstrade comparable performance to the highly-tuned, hand-coded Fortran plus MPI versions of the NAS FT and NAS CG benchmarks.

Dietz, Steven; Choi, S. E. (Sung-Eun); Chamberlain, B. L. (Bradford L.); Snyder, Lawrence

2003-01-01

203

Parallel Assisted Assembly of Multilayer DNA and Protein Nanoparticle Structures Using a CMOS Electronic Array  

NASA Astrophysics Data System (ADS)

A CMOS electronic microarray device was used to carry out the rapid parallel assembly of functionalized nanoparticles into multilayer structures. Electronic microarrays produce reconfigurable DC electric fields that allow DNA, proteins as well as charged molecules to be rapidly transported from the bulk solution and addressed to specifically activated sites on the array surface. Such a device was used to carry out the assisted self-assembly DNA, biotin and streptavidin derivatized fluorescent nanoparticles into multilayer structures. Nanoparticle addressing could be carried out in about 15 seconds, and forty depositions of nanoparticles were completed in less than one hour. The final multilayered 3D nanostructures were verified by scanning electron microscopy.

Heller, Michael J.; Dehlinger, Dietrich A.; Sullivan, Benjamin D.

2006-09-01

204

Layer-to-layer parallel fluidic transportation system by addressable fluidic gate arrays.  

PubMed

This paper presents addressable fluidic gate arrays for a layer-to-layer parallel fluidic transportation system. The proposed addressable fluidic gate consists of double valves driven by pneumatic pressure. One of the double valves is controlled by the row channel and the other is controlled by the column channel for row/column addressing. Our study applies addressable fluidic gate arrays to layer-to-layer transportation beyond a typical in-plane fluidic network system. The layer-to-layer transportation makes it possible to collect targeted samples from a testing well plate. 3 x 3 fluidic gate arrays based on the proposed concept are developed and tested. A single PDMS valve (phi400 microm) can be closed by 75.0 kPa. The demonstrated fluidic system is based on all PDMS structures by taking account of its disposable use. This paper also reports a dome-shaped chamber for robust sealing and a switching valve with a bistable diaphragm for memory function. PMID:18818812

Morimoto, Takashi; Konishi, Satoshi

2008-09-01

205

Implementation of monitors with macros: a programming aid for the HEP and other parallel processors. Rev. 1  

SciTech Connect

In this report we give a detailed presentation of how monitors can be implemented on the HEP using a simple macro processor. We then develop the thesis that a small body of general-purpose monitors can be defined to handle most standard synchronization patterns. We include the macro packages required to implement some of the more common synchronization patterns, including the fairly complex logic discussed in a previous paper. Code produced using these macro packages is portable from one multiprocessing environment to another. Indeed, by recoding the set of basic macros (about 100 lines of code for the Denelcor HEP), most programs that we are new writing could be moved to any similar multiprocessing system.

Lusk, E.L.; Overbeek, R.A.

1984-07-01

206

A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically reconfigurable processor  

E-print Network

A parallel hybrid merge-select sorting scheme for K-best LSD MIMO decoder on a dynamically detection (LSD) multi-input multi-output (MIMO) decoder based on a recently developed novel Reconfigurable and mapped onto our proposed platform. We discuss the targeted K-best LSD algorithm as well as the sorting

Arslan, Tughrul

207

Parallel pipeline networking and signal processing with field-programmable gate arrays (FPGAs) and VCSEL-MSM smart pixels  

Microsoft Academic Search

We present a networking and signal processing architecture called Transpar-TR (Translucent Smart Pixel Array-Token- Ring) that utilizes smart pixel technology to perform 2D parallel optical data transfer between digital processing nodes. Transpar-TR moves data through the network in the form of 3D packets (2D spatial and 1D time). By utilizing many spatial parallel channels, Transpar-TR can achieve high throughput, low

C. B. Kuznia; Alexander A. Sawchuk; Liping Zhang; Bogdan Hoanca; Sunkwang Hong; Chris Min; D. Pansatiankul; Z. Y. Alpaslan

2000-01-01

208

Parallel detection of ancient pathogens via array-based DNA capture.  

PubMed

DNA capture coupled with next generation sequencing is highly suitable for the study of ancient pathogens. Screening for pathogens can, however, be meticulous when assays are restricted to the enrichment of single organisms, which is common practice. Here, we report on an array-based DNA capture screening technique for the parallel detection of nearly 100 pathogens that could have potentially left behind molecular signatures in preserved ancient tissues. We demonstrate the sensitivity of our method through evaluation of its performance with a library known to harbour ancient Mycobacterium leprae DNA. This rapid and economical technique will be highly useful for the identification of historical diseases that are difficult to characterize based on archaeological information alone. PMID:25487327

Bos, Kirsten I; Jäger, Günter; Schuenemann, Verena J; Vågene, Ashild J; Spyrou, Maria A; Herbig, Alexander; Nieselt, Kay; Krause, Johannes

2015-01-19

209

New computing environments:Parallel, vector and systolic  

SciTech Connect

This book presents papers on supercomputers and array processors. Topics considered include nested dissection, the systolic level 2 BLAS, parallel processing a hydrodynamic shock wave problem, MACH-1, portable standard LISP on the Cray, distributed combinator evaluation, performance and library issues, scale problems, multiprocessor architecture, the MIDAS multiprocessor system, parallel algorithms for incompressible and compressible flows on a multiprocessor, and parallel algorithms for elliptic equations.

Wouk, A.

1986-01-01

210

Novel optical wavelength interleaver based on symmetrically parallel-coupled and apodized ring resonator arrays  

NASA Astrophysics Data System (ADS)

Optical ring-resonators could be used to synthesize filters with low crosstalk and flat passbands. Their application to DWDM interleaving has been proposed and investigated previously. However, a number of important factors related to this topic have not yet been considered and appropriately addressed. In this paper, we propose a novel scheme of a symmetrically parallel-coupled ring resonator array with coupling apodisation. We show that it can be used to construct a wavelength interleaver with remarkably improved performance. Various design factors have been considered. An optimization procedure was developed based on minimizing the channel crosstalk in the through and drop ports simultaneously by adjusting the ring-bus coupling coefficients. We show that apodisation in coupling could suppress channel crosstalk effectively, by choosing the optimal coupling coefficients. We also introduced the equalization of both the input and output coupling coefficients to minimise passband ripple. For a 50 - 100 GHz DWDM applications, four rings is found to be the best choice for array size. A four-ring filter achieves crosstalk -24 dB, insertion loss at resonance <1 dB, and good passband flatness (shape factor >0.6).

Kaalund, Christopher J.; Jin, Zhe; Li, Wei; Peng, Gang-Ding

2003-10-01

211

The TriMedia Processor The TriMedia Processor  

E-print Network

of the TriMedia­CPU64TM VLIW processor with a Field-Programmable Gate Array (FPGA), and assess the potential on a steel spine is part of a sculptural ensemble originally installed in 1937-1938, which also includes the travertine Table of Silence and Gate of the Kiss. #12;The ­TriMedia Processor PROEFSCHRIFT ter verkrijging

Kuzmanov, Georgi

212

12-channel parallel optical-fiber transmission using a low-drive current 1.3-?m LED array and a p-i-n PD array  

Microsoft Academic Search

Twelve-channel 14-Mb\\/s\\/channel 1-km parallel optical-fiber transmission using a 1×12 low-drive-current 1.3-?m light-emitting diode (LED) linear array and an InGaAs p-i-n photodiode linear array, with the LED drive current as low as 12 mAp-p\\/channel, is discussed. No receiver sensitivity degradation has been observed under simultaneous 12-channel operation. The skew was less than 6 ns after transmission through a 1-km-long 12-channel optical-fiber

Kazuhisa Kaede; Toshio Uji; Takeshi Nagahori; Tetsuyuki Suzaki; Toshitaka Torikai; Junji Hayashi; Isao Watanabe; Masataka Itoh; Hiroshi Honmou; Minoru Shikada

1990-01-01

213

FPGA-Based Coprocessor for Singular Value Array Reconciliation Tomography  

Microsoft Academic Search

We present an FPGA-based co-processor for accelerating computations associated with Singular Value Array Reconciliation Tomography (SART), a recently developed method for RF source localization. The co-processor allows this relatively complex computational task to be performed using less hardware and less power than would be required by a microprocessor-based computing cluster with comparable throughput and accuracy. The architecture exploits parallelism of

Jack Coyne; David Cyganski; R. James Duckworth

2008-01-01

214

An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C  

SciTech Connect

Co-array Fortran (CAF) and Unified Parallel C (UPC) are two emerging languages for single-program, multiple-data global address space programming. These languages boost programmer productivity by providing shared variables for communication instead of message passing. However, the performance of these emerging languages still has room for improvement. In this paper, we study the performance of variants of the NAS MG, CG, SP, and BT benchmarks on several modern cluster architectures to identify challenges that must be met to deliver top performance. We compare CAF and UPC variants of these programs with the original Fortran+MPI code. Today, CAF and UPC programs deliver scalable performance on clusters only when written to use bulk communication. However, our experiments uncovered some significant performance bottlenecks limiting UPC performance on all platforms. We account for the root causes of these performance anomalies and show that they can be remedied with additional compiler improvements, in particular we show that many of these obstacles can be resolved with adequate optimizations by the backend C compilers.

Coarfa, Cristian; Dotsenko, Yuri; Mellor-Crummey, John M.; Cantonnet, Franois; El-Ghazawi, Tarek; Mohanti, Ashrujit; Yao, Yiyi; Chavarría-Miranda, Daniel

2005-06-10

215

Sixteen-element Ge-on-SOI PIN photo-detector arrays for parallel optical interconnects  

NASA Astrophysics Data System (ADS)

We describe the structure and testing of one-dimensional array parallel-optics photo-detectors with 16 photodiodes of which each diode operates up to 8 Gb/s. The single element is vertical and top illuminated 30-?m-diameter silicon on insulator (Ge-on-SOI) PIN photodetector. High-quality Ge absorption layer is epitaxially grown on SOI substrate by the ultra-high vacuum chemical vapor deposition (UHV-CVD). The photodiode exhibits a good responsivity of 0.20 A/W at a wavelength of 1550 nm. The dark current is as low as 0.36 ?A at a reverse bias of 1 V, and the corresponding current density is about 51 mA/cm2. The detector with a diameter of 30 ?m is measured at an incident light of 1.55 ?m and 0.5 mW, and the 3-dB bandwidth is 7.39 GHz without bias and 13.9 GHz at a reverse bias of 3 V. The 16 devices show a good consistency.

Li, Chong; Xue, Chun-Lai; Liu, Zhi; Cheng, Bu-Wen; Wang, Qi-Ming

2014-03-01

216

Atmospheric plasma jet array in parallel electric and gas flow fields for three-dimensional surface treatment  

NASA Astrophysics Data System (ADS)

This letter reports on electrical and optical characteristics of a ten-channel atmospheric pressure glow discharge jet array in parallel electric and gas flow fields. Challenged with complex three-dimensional substrates including surgical tissue forceps and sloped plastic plate of up to 15°, the jet array is shown to achieve excellent jet-to-jet uniformity both in time and in space. Its spatial uniformity is four times better than a comparable single jet when both are used to treat a 15° sloped substrate. These benefits are likely from an effective self-adjustment mechanism among individual jets facilitated by individualized ballast and spatial redistribution of surface charges.

Cao, Z.; Walsh, J. L.; Kong, M. G.

2009-01-01

217

Parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers  

SciTech Connect

In this paper we investigate the feasibility of a massively parallel self-mixing imaging system based on an array of vertical-cavity surface-emitting lasers (VCSELs) to measure surface profiles of displacement,distance, velocity, and liquid flow rate. The concept of the system is demonstrated using a prototype to measure the velocity at different radial points on a rotating disk, and the velocity profile of diluted milk in a custom built diverging-converging planar flow channel. It is envisaged that a scaled up version of the parallel self-mixing imaging system will enable real-time surface profiling, vibrometry, and flowmetry.

Tucker, John R.; Baque, Johnathon L.; Lim, Yah Leng; Zvyagin, Andrei V.; Rakic, Aleksandar D

2007-09-01

218

Three Dimensional, Massively Parallel, Optically Interconnected Silicon Computational Hardware and Architectures for High Speed IR Scene Generation  

Microsoft Academic Search

High frame rate infrared scene generation depends on high performance digital processors that are tightly coupled to infrared emitter arrays. Massively parallel im age generation h ardware ca n realize the type of high throughput, high frame rate processing that will characterize the next generation of scene generators. This work outlines projects in massively parallel, high throughput image generation h

Huy H. Cat; D. Scott Wills; Nan Marie Jokerst; Martin Brooke; April Brown

219

Three-dimensional, massively parallel, optically interconnected silicon computational hardware and architectures for high-speed IR scene generation  

Microsoft Academic Search

High frame rate infrared scene generation depends on high performance digital processors that are tightly coupled to infrared emitter arrays. Massively parallel image generation hardware can realize the type of high throughput, high frame rate processing that will characterize the next generation of scene generators. This work outlines projects in massively parallel, high throughput image generation hardware using thin film

Huy H. Cat; D. Scott Wills; Nan Marie Jokerst; Martin A. Brooke; April S. Brown

1995-01-01

220

Large-scale parallel surface functionalization of goblet-type whispering gallery mode microcavity arrays for biosensing applications.  

PubMed

A novel surface functionalization technique is presented for large-scale selective molecule deposition onto whispering gallery mode microgoblet cavities. The parallel technique allows damage-free individual functionalization of the cavities, arranged on-chip in densely packaged arrays. As the stamp pad a glass slide is utilized, bearing phospholipids with different functional head groups. Coated microcavities are characterized and demonstrated as biosensors. PMID:24990526

Bog, Uwe; Brinkmann, Falko; Kalt, Heinz; Koos, Christian; Mappes, Timo; Hirtz, Michael; Fuchs, Harald; Köber, Sebastian

2014-10-15

221

An associative processor for air traffic control  

Microsoft Academic Search

In recent years associative memories have been receiving an increasing amount of attention. At the same time multiprocessor and parallel processing systems have been under study to solve very large problems. An associative processor is one form of a parallel processor that seems able to provide a cost effective solution to many problems such as the air traffic control (ATC)

Kenneth James Thurber

1971-01-01

222

Parallel detection of harmful algae using reverse transcription polymerase chain reaction labeling coupled with membrane-based DNA array.  

PubMed

Harmful algal blooms (HABs) are a global problem, which can cause economic loss to aquaculture industry's and pose a potential threat to human health. More attention must be made on the development of effective detection methods for the causative microalgae. The traditional microscopic examination has many disadvantages, such as low efficiency, inaccuracy, and requires specialized skill in identification and especially is incompetent for parallel analysis of several morphologically similar microalgae to species level at one time. This study aimed at exploring the feasibility of using membrane-based DNA array for parallel detection of several microalgae by selecting five microaglae, including Heterosigma akashiwo, Chaetoceros debilis, Skeletonema costatum, Prorocentrum donghaiense, and Nitzschia closterium as test species. Five species-specific (taxonomic) probes were designed from variable regions of the large subunit ribosomal DNA (LSU rDNA) by visualizing the alignment of LSU rDNA of related species. The specificity of the probes was confirmed by dot blot hybridization. The membrane-based DNA array was prepared by spotting the tailed taxonomic probes onto positively charged nylon membrane. Digoxigenin (Dig) labeling of target molecules was performed by multiple PCR/RT-PCR using RNA/DNA mixture of five microalgae as template. The Dig-labeled amplification products were hybridized with the membrane-based DNA array to produce visible hybridization signal indicating the presence of target algae. Detection sensitivity comparison showed that RT-PCR labeling (RPL) coupled with hybridization was tenfold more sensitive than DNA-PCR-labeling-coupled with hybridization. Finally, the effectiveness of RPL coupled with membrane-based DNA array was validated by testing with simulated and natural water samples, respectively. All of these results indicated that RPL coupled with membrane-based DNA array is specific, simple, and sensitive for parallel detection of microalgae which shows promise for monitoring natural samples in the future. PMID:24338073

Zhang, Chunyun; Chen, Guofu; Ma, Chaoshuai; Wang, Yuanyuan; Zhang, Baoyu; Wang, Guangce

2014-03-01

223

Benchmarks of Low-Level Vision Algorithms for DSP, FPGA, and Mobile PC Processors  

NASA Astrophysics Data System (ADS)

We present recent results of a performance benchmark of selected low-level vision algorithms implemented on different high-speed embedded platforms. The algorithms were implemented on a digital signal processor (DSP) (Texas Instruments TMS320C6414), a field-programmable gate array (FPGA) (Altera Stratix-I and II families) as well as on a mobile PC processor (Intel Mobile Core 2 Duo T7200). These implementations are evaluated, compared, and discussed in detail. The DSP and the mobile PC implementations, both making heavy use of processor-specific acceleration techniques (intrinsics and resource optimized slicing direct memory access on DSPs or Intel integrated performance primitives Library on mobile PC processors), outperform the FPGA implementations, but at the cost of spending all its resources to these tasks. FPGAs, however, are very well suited to algorithms that benefit from parallel execution.

Baumgartner, Daniel; Roessler, Peter; Kubinger, Wilfried; Zinner, Christian; Ambrosch, Kristian

224

Contact printing of compositionally graded CdS(x)Se(1-x) nanowire parallel arrays for tunable photodetectors.  

PubMed

Spatially composition-graded CdS(x)Se(1-x) (x = 0-1) nanowires are grown and transferred as parallel arrays onto Si/SiO(2) substrates by a one-step, directional contact printing process. Upon subsequent device fabrication, an array of tunable-wavelength photodetectors is demonstrated. From the spectral photoconductivity measurements, the cutoff wavelength for the device array, as determined by the bandgap, is shown to cover a significant portion of the visible spectrum. The ability to transfer a collection of crystalline semiconductor nanowires while preserving the spatially graded composition may enable a wide range of applications, such as tunable lasers and photodetectors, efficient photovoltaics, and multiplexed chemical sensors. PMID:22222254

Takahashi, Toshitake; Nichols, Patricia; Takei, Kuniharu; Ford, Alexandra C; Jamshidi, Arash; Wu, Ming C; Ning, C Z; Javey, Ali

2012-02-01

225

Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging  

PubMed Central

Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than –35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

Pang, Yong; Yu, Baiying; Vigneron, Daniel B.

2014-01-01

226

Quadrature transmit array design using single-feed circularly polarized patch antenna for parallel transmission in MR imaging.  

PubMed

Quadrature coils are often desired in MR applications because they can improve MR sensitivity and also reduce excitation power. In this work, we propose, for the first time, a quadrature array design strategy for parallel transmission at 298 MHz using single-feed circularly polarized (CP) patch antenna technique. Each array element is a nearly square ring microstrip antenna and is fed at a point on the diagonal of the antenna to generate quadrature magnetic fields. Compared with conventional quadrature coils, the single-feed structure is much simple and compact, making the quadrature coil array design practical. Numerical simulations demonstrate that the decoupling between elements is better than -35 dB for all the elements and the RF fields are homogeneous with deep penetration and quadrature behavior in the area of interest. Bloch equation simulation is also performed to simulate the excitation procedure by using an 8-element quadrature planar patch array to demonstrate its feasibility in parallel transmission at the ultrahigh field of 7 Tesla. PMID:24649430

Pang, Yong; Yu, Baiying; Vigneron, Daniel B; Zhang, Xiaoliang

2014-02-01

227

The Indirect Binary n-Cube Microprocessor Array  

Microsoft Academic Search

This paper explores the possibility of using a large-scale array of microprocessors as a computational facility for the execution of massive numerical computations with a high degree of parallelism. By microprocessor we mean a processor realized on one or a few semiconductor chips that include arithmetic and logical facilities and some memory. The current state of LSI technology makes this

Marshall C. Pease III

1977-01-01

228

Hardware multiplier processor  

DOEpatents

A hardware processor is disclosed which in the described embodiment is a memory mapped multiplier processor that can operate in parallel with a 16 bit microcomputer. The multiplier processor decodes the address bus to receive specific instructions so that in one access it can write and automatically perform single or double precision multiplication involving a number written to it with or without addition or subtraction with a previously stored number. It can also, on a single read command automatically round and scale a previously stored number. The multiplier processor includes two concatenated 16 bit multiplier registers, two 16 bit concatenated 16 bit multipliers, and four 16 bit product registers connected to an internal 16 bit data bus. A high level address decoder determines when the multiplier processor is being addressed and first and second low level address decoders generate control signals. In addition, certain low order address lines are used to carry uncoded control signals. First and second control circuits coupled to the decoders generate further control signals and generate a plurality of clocking pulse trains in response to the decoded and address control signals.

Pierce, Paul E. (Albuquerque, NM)

1986-01-01

229

L-shaped array-based 2-D DOA estimation using parallel factor analysis  

Microsoft Academic Search

For two-dimensional (2-D) directions-of-arrival (DOA) estimation problem, the L-shaped array seems to have higher accuracy than other structured arrays (see [3] for details), and has received much attention. Many algorithms firstly estimate two electric angles separately using the two orthogonal subarrays of the L-shaped array, and then obtain elevation and azimuth angles from these two correctly matched electric angles. However,

Ding Liu; Junli Liang

2010-01-01

230

Parallel Solutions for Dynamic Focussing of Large Acoustic Arrays Karen P. Watkins  

E-print Network

and Techniques of Parallel Programming Dr. J. C. Browne Spring 1999 #12; 1 1 Problem Statement: Parallelization to acoustic imaging. Typically, in order to achieve an update rate similar to slow scan television The beamforming function is a spatial filter that processes discrete signals from a number of sensors in order

Browne, James C.

231

A 1.0GHz single-issue 64-bit powerPC integer processor  

Microsoft Academic Search

The organization and circuit design of a 1.0 GHz integer processor built in 0.25 ?m CMOS technology are presented, a microarchitecture emphasizing parallel computation with a single late select per cycle, structured control logic implemented by read-only-memories and programmable logic arrays, and a delayed reset dynamic circuit style enabling complex functions to be implemented in a few levels of logic

Joel Silberman; Naoaki Aoki; David Boerstler; Jeffrey L. Burns; Sang Dhong; Axel Essbaum; Uttam Ghoshal; David Heidel; Peter Hofstee; Kyung Tek Lee; David Meltzer; Hung Ngo; Kevin Nowka; Stephen Posluszny; Osamu Takahashi; Ivan Vo; Brian Zoric

1998-01-01

232

High-efficiency ordered silicon nano-conical-frustum array solar cells by self-powered parallel electron lithography.  

PubMed

Nanostructured silicon thin film solar cells are promising, due to the strongly enhanced light trapping, high carrier collection efficiency, and potential low cost. Ordered nanostructure arrays, with large-area controllable spacing, orientation, and size, are critical for reliable light-trapping and high-efficiency solar cells. Available top-down lithography approaches to fabricate large-area ordered nanostructure arrays are challenging due to the requirement of both high lithography resolution and high throughput. Here, a novel ordered silicon nano-conical-frustum array structure, exhibiting an impressive absorbance of 99% (upper bound) over wavelengths 400-1100 nm by a thickness of only 5 ?m, is realized by our recently reported technique self-powered parallel electron lithography that has high-throughput and sub-35-nm high resolution. Moreover, high-efficiency (up to 10.8%) solar cells are demonstrated, using these ordered ultrathin silicon nano-conical-frustum arrays. These related fabrication techniques can also be transferred to low-cost substrate solar energy harvesting device applications. PMID:20939564

Lu, Yuerui; Lal, Amit

2010-11-10

233

Simulation of three-dimensional laminar flow and heat transfer in an array of parallel microchannels  

E-print Network

Heat transfer and fluid flow are studied numerically for a repeating microchannel array with water as the circulating fluid. Generalized transport equations are discretized and solved in three dimensions for velocities, pressure, and temperature...

Mlcak, Justin Dale

2009-05-15

234

Comparative Analysis on the Performance of a Short String of Series-Connected and Parallel-Connected Photovoltaic Array Under Partial Shading  

NASA Astrophysics Data System (ADS)

The output power from the photovoltaic (PV) array decreases and the array exhibit multiple peaks when it is subjected to partial shading (PS). The power loss in the PV array varies with the array configuration, physical location and the shading pattern. This paper compares the relative performance of a PV array consisting of a short string of three PV modules for two different configurations. The mismatch loss, shading loss, fill factor and the power loss due to the failure in tracking of the global maximum power point, of a series string with bypass diodes and short parallel string are analysed using MATLAB/Simulink model. The performance of the system is investigated for three different conditions of solar insolation for the same shading pattern. Results indicate that there is considerable power loss due to shading in a series string during PS than in a parallel string with same number of modules.

Vijayalekshmy, S.; Rama Iyer, S.; Beevi, Bisharathu

2014-07-01

235

Dynamically reconfigurable optical morphological processor and its applications  

NASA Technical Reports Server (NTRS)

An innovative optically implemented morphological processor is introduced. With the use of a large space-bandwidth-product Dammann grating and a high-speed shutter spatial light modulator, effective structuring element with large size and arbitrary shape can be constructed with dynamic reconfigurability. This reconfigurability is a major improvement over the conventional correlator-based morphological processor in which fixed holographic filters are used as structuring elements (Casasent and Botha, 1988). A novel two-dimensional thresholding photodetector array, capable of performing parallel thresholding and feedback, is utilized in this system and makes possible the implementation of many complex morphological operations requiring iterative feedbacks and full programmability. The optical architecture and the principle of operation are presented. Experimental demonstration of binary image morphological erosion, dilation, opening, and closing are also demonstrated. A technique for extending this technique to gray-scale image using thresholding decomposition technique is also discussed.

Chao, Tien-Hsin

1993-01-01

236

Template-directed atomically precise self-organization of perfectly ordered parallel cerium silicide nanowire arrays on Si(110)-16?×?2 surfaces  

PubMed Central

The perfectly ordered parallel arrays of periodic Ce silicide nanowires can self-organize with atomic precision on single-domain Si(110)-16?×?2 surfaces. The growth evolution of self-ordered parallel Ce silicide nanowire arrays is investigated over a broad range of Ce coverages on single-domain Si(110)-16?×?2 surfaces by scanning tunneling microscopy (STM). Three different types of well-ordered parallel arrays, consisting of uniformly spaced and atomically identical Ce silicide nanowires, are self-organized through the heteroepitaxial growth of Ce silicides on a long-range grating-like 16?×?2 reconstruction at the deposition of various Ce coverages. Each atomically precise Ce silicide nanowire consists of a bundle of chains and rows with different atomic structures. The atomic-resolution dual-polarity STM images reveal that the interchain coupling leads to the formation of the registry-aligned chain bundles within individual Ce silicide nanowire. The nanowire width and the interchain coupling can be adjusted systematically by varying the Ce coverage on a Si(110) surface. This natural template-directed self-organization of perfectly regular parallel nanowire arrays allows for the precise control of the feature size and positions within ±0.2 nm over a large area. Thus, it is a promising route to produce parallel nanowire arrays in a straightforward, low-cost, high-throughput process. PMID:24188092

2013-01-01

237

Analog Processor To Solve Optimization Problems  

NASA Technical Reports Server (NTRS)

Proposed analog processor solves "traveling-salesman" problem, considered paradigm of global-optimization problems involving routing or allocation of resources. Includes electronic neural network and auxiliary circuitry based partly on concepts described in "Neural-Network Processor Would Allocate Resources" (NPO-17781) and "Neural Network Solves 'Traveling-Salesman' Problem" (NPO-17807). Processor based on highly parallel computing solves problem in significantly less time.

Duong, Tuan A.; Eberhardt, Silvio P.; Thakoor, Anil P.

1993-01-01

238

Disposable micro-fluidic biosensor array for online parallelized cell adhesion kinetics analysis on quartz crystal resonators  

NASA Astrophysics Data System (ADS)

In this contribution we present a new disposable micro-fluidic biosensor array for the online analysis of adherent Madin Darby canine kidney (MDCK-II) cells on quartz crystal resonators (QCRs). The device was conceived for the parallel cultivation of cells providing the same experimental conditions among all the sensors of the array. As well, dedicated sensor interface electronics were developed and optimized for fast spectra acquisition of all 16 QCRs with a miniaturized impedance analyzer. This allowed performing cell cultivation experiments for the observation of fast cellular reaction kinetics with focus on the comparison of the resulting sensor signals influenced by different cell distributions on the sensor surface. To prove the assumption of equal flow circulation within the symmetric micro-channel network and support the hypothesis of identical cultivation conditions for the cells living above the sensors, the influence of fabrication tolerances on the flow regime has been simulated. As well, the shear stress on the adherent cell layer due to the flowing media was characterized. Injection molding technology was chosen for the cheap mass production of disposable devices. Furthermore, the injection molding process was simulated in order to optimize the mold geometry and minimize the shrinkage and the warpage of the parts. MDCK-II cells were cultivated in the biosensor array. Parallel cultivation of cells on the gold surface of the QCRs led to first observations of the impact of the cell distribution on the sensor signals during cell cultivation. Indeed, the initial cell distribution revealed a significant influence on the changes in the measured acoustic load on the QCRs suggesting dissimilar cell migrations as well as proliferation kinetics of a non-confluent MDCK-II cell layer.

Cama, G.; Jacobs, T.; Dimaki, M. I.; Svendsen, W. E.; Hauptmann, P.; Naumann, M.

2010-08-01

239

Supercomputing on massively parallel bit-serial architectures  

NASA Technical Reports Server (NTRS)

Research on the Goodyear Massively Parallel Processor (MPP) suggests that high-level parallel languages are practical and can be designed with powerful new semantics that allow algorithms to be efficiently mapped to the real machines. For the MPP these semantics include parallel/associative array selection for both dense and sparse matrices, variable precision arithmetic to trade accuracy for speed, micro-pipelined train broadcast, and conditional branching at the processing element (PE) control unit level. The preliminary design of a FORTRAN-like parallel language for the MPP has been completed and is being used to write programs to perform sparse matrix array selection, min/max search, matrix multiplication, Gaussian elimination on single bit arrays and other generic algorithms. A description is given of the MPP design. Features of the system and its operation are illustrated in the form of charts and diagrams.

Iobst, Ken

1985-01-01

240

Implementation and Assessment of Advanced Analog Vector-Matrix Processor  

NASA Technical Reports Server (NTRS)

This paper discusses the design and implementation of an analog optical vecto-rmatrix coprocessor with a throughput of 128 Mops for a personal computer. Vector matrix calculations are inherently parallel, providing a promising domain for the use of optical calculators. However, to date, digital optical systems have proven too cumbersome to replace electronics, and analog processors have not demonstrated sufficient accuracy in large scale systems. The goal of the work described in this paper is to demonstrate a viable optical coprocessor for linear operations. The analog optical processor presented has been integrated with a personal computer to provide full functionality and is the first demonstration of an optical linear algebra processor with a throughput greater than 100 Mops. The optical vector matrix processor consists of a laser diode source, an acoustooptical modulator array to input the vector information, a liquid crystal spatial light modulator to input the matrix information, an avalanche photodiode array to read out the result vector of the vector matrix multiplication, as well as transport optics and the electronics necessary to drive the optical modulators and interface to the computer. The intent of this research is to provide a low cost, highly energy efficient coprocessor for linear operations. Measurements of the analog accuracy of the processor performing 128 Mops are presented along with an assessment of the implications for future systems. A range of noise sources, including cross-talk, source amplitude fluctuations, shot noise at the detector, and non-linearities of the optoelectronic components are measured and compared to determine the most significant source of error. The possibilities for reducing these sources of error are discussed. Also, the total error is compared with that expected from a statistical analysis of the individual components and their relation to the vector-matrix operation. The sufficiency of the measured accuracy of the processor is compared with that required for a range of typical problems. Calculations resolving alloy concentrations from spectral plume data of rocket engines are implemented on the optical processor, demonstrating its sufficiency for this problem. We also show how this technology can be easily extended to a 100 x 100 10 MHz (200 Cops) processor.

Gary, Charles K.; Bualat, Maria G.; Lum, Henry, Jr. (Technical Monitor)

1994-01-01

241

Declustered Disk Array Architectures with Optimal and Near-Optimal Parallelism  

Microsoft Academic Search

This paper investigates the placement of data and parity on redundant disk arrays. Declustered organizations have been traditionally used to achieve fast reconstruction of a failed disk's contents. In previous work, Holland and Gibson identified six desirable properties for ideal layouts; however, no declustered layout satisfying all properties has been published in the literature. We present a complete, constructive characterization

Guillermo A. Alvarez; Walter A. Burkhard; Larry J. Stockmeyer; Flaviu Cristian

1998-01-01

242

Communication efficient parallel algorithms for nonnumerical computations  

SciTech Connect

The broad goal of this research is to develop a set of paradigms for mapping data-dependent symbolic computations on realistic models of parallel architectures. Within this goal, the thesis represents the initial effort to achieve efficient parallel solutions for a number of non-numerical problems on networks of processors. The specific contributions of the thesis are new parallel algorithms, exhibiting linear speedup on architectures consisting of fixed numbers of processors (i.e., bounded models). The following problems have been considered in the thesis: (1) Determine the minimum spanning tree (MST), and identify the bridges and articulation points (APs) of an undirected weighted graph represented by an n x n adjacency matrix. (2) The pattern matching problem: Given two strings of characters, of lengths m and n ({number sign}m) respectively, mark all positions in the second string where there appears an instance of the first string. (3) Sort n elements. For each problem, the author uses a processor-network consisting of p processors. The network model used in the solution of the first set of problems is the linear array; while that used in the solutions of the second and third problems is a butterfly-connected system. The solutions on the butterfly-connected system apply also on a pipelined hypercube. The performances of the solutions are summarized.

Doshi, K.A.

1988-01-01

243

A fast digital fuzzy processor  

Microsoft Academic Search

This digital fuzzy processor-designed and realized in 0.7-?m CMOS technology-demonstrates a processing rate from 80 to 320 ns. A parallel pipeline architecture supports fast selection of the active fuzzy rules. Specifically, we designed an Active-Rule-Selector for selecting a subset of the fuzzy rules, called active fuzzy rules, and divided the architecture into parallel and pipeline stages. Despite some initial difficulty,

A. Gabrielli; E. Gandolfi

1999-01-01

244

Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays  

Microsoft Academic Search

We describe a novel sequencing approach that combines non-gel-based signature sequencing with in vitro cloning of millions of templates on separate 5 ?m diameter microbeads. After constructing a microbead library of DNA templates by in vitro cloning, we assembled a planar array of a million template-containing microbeads in a flow cell at a density greater than 3 × 106 microbeads\\/cm2.

Maria Johnson; John Bridgham; George Golda; David H. Lloyd; Davida Johnson; Shujun Luo; Sarah McCurdy; Michael Foy; Mark Ewan; Rithy Roth; Dave George; Sam Eletr; Glenn Albrecht; Eric Vermaas; Steven R. Williams; Keith Moon; Timothy Burcham; Michael Pallas; Robert B. DuBridge; James Kirchner; Karen Fearon; Jen-i Mao; Kevin Corcoran; Sydney Brenner

2000-01-01

245

Development of parallel architectures for sensor array-processing algorithms. Semi-Annual report  

SciTech Connect

The high resolution direction of arrival (DOA) estimation has been an important area of research for a number of years. Many researchers have developed a variety of algorithms to estimate the direction of arrival. Another important aspect of the DOA estimation area is the development of high speed hardware capable of computing the DOA in real time. In this research the authors have first focussed on the development of parallel architecture for multiple signal classification (MUSIC) and estimation of signal parameters by rotational invariance technique (ESPRIT) algorithms for the narrow band sources. These algorithms are substituted with computationally efficient modules and converted to pipelined and parallel algorithms. For example one important computation of eigendecomposition of the covariance matrix has been performed using Householders transformations and QR method.

Jamali, M.M.; Kwatra, S.C.; Djoudi, A.; Sheelvant, R.; Rao, M.

1991-08-01

246

Field Programmable Gate Array Based Parallel Strapdown Algorithm Design for Strapdown Inertial Navigation Systems  

PubMed Central

A new generalized optimum strapdown algorithm with coning and sculling compensation is presented, in which the position, velocity and attitude updating operations are carried out based on the single-speed structure in which all computations are executed at a single updating rate that is sufficiently high to accurately account for high frequency angular rate and acceleration rectification effects. Different from existing algorithms, the updating rates of the coning and sculling compensations are unrelated with the number of the gyro incremental angle samples and the number of the accelerometer incremental velocity samples. When the output sampling rate of inertial sensors remains constant, this algorithm allows increasing the updating rate of the coning and sculling compensation, yet with more numbers of gyro incremental angle and accelerometer incremental velocity in order to improve the accuracy of system. Then, in order to implement the new strapdown algorithm in a single FPGA chip, the parallelization of the algorithm is designed and its computational complexity is analyzed. The performance of the proposed parallel strapdown algorithm is tested on the Xilinx ISE 12.3 software platform and the FPGA device XC6VLX550T hardware platform on the basis of some fighter data. It is shown that this parallel strapdown algorithm on the FPGA platform can greatly decrease the execution time of algorithm to meet the real-time and high precision requirements of system on the high dynamic environment, relative to the existing implemented on the DSP platform. PMID:22164058

Li, Zong-Tao; Wu, Tie-Jun; Lin, Can-Long; Ma, Long-Hua

2011-01-01

247

Arrays  

NSDL National Science Digital Library

This interactive Flash applet helps students develop the concept of equal groups as a foundation for multiplication and division. The applet displays an array of dots, some of which are covered by a card. Student use the visible number of rows and columns to determine the total number of dots. Clicking on the card reveals the full array, and a voice announces the total.

2011-01-01

248

Massively parallel information processing systems for space applications  

NASA Technical Reports Server (NTRS)

NASA is developing massively parallel systems for ultra high speed processing of digital image data collected by satellite borne instrumentation. Such systems contain thousands of processing elements. Work is underway on the design and fabrication of the 'Massively Parallel Processor', a ground computer containing 16,384 processing elements arranged in a 128 x 128 array. This computer uses existing technology. Advanced work includes the development of semiconductor chips containing thousands of feedthrough paths. Massively parallel image analog to digital conversion technology is also being developed. The goal is to provide compact computers suitable for real-time onboard processing of images.

Schaefer, D. H.

1979-01-01

249

Parallel array of nanochannels grafted with polymer-brushes-stabilized Au nanoparticles for flow-through catalysis.  

PubMed

Smart systems on the nanometer scale for continuous flow-through reaction present fascinating advantages in heterogeneous catalysis, in which a parallel array of straight nanochannels offers a platform with high surface area for assembling and stabilizing metallic nanoparticles working as catalysts. Herein we demonstrate a method for finely modifying the nanoporous anodic aluminum oxide (AAO), and further integration of nanoreactors. By using atomic transfer radical polymerization (ATRP), polymer brushes were successfully grafted on the inner wall of the nanochannels of the AAO membrane, followed by exchanging counter ions with a precursor for nanoparticles (NPs), and used as the template for deposition of well-defined Au NPs. The membrane was used as a functional nanochannel for novel flow-through catalysis. High catalytic performance and instantaneous separation of products from the reaction system was achieved in reduction of 4-nitrophenol. PMID:24129356

Liu, Jianxi; Ma, Shuanhong; Wei, Qiangbing; Jia, Lei; Yu, Bo; Wang, Daoai; Zhou, Feng

2013-12-01

250

Interactive animation of fault-tolerant parallel algorithms  

SciTech Connect

Animation of algorithms makes understanding them intuitively easier. This paper describes the software tool Raft (Robust Animator of Fault Tolerant Algorithms). The Raft system allows the user to animate a number of parallel algorithms which achieve fault tolerant execution. In particular, we use it to illustrate the key Write-All problem. It has an extensive user-interface which allows a choice of the number of processors, the number of elements in the Write-All array, and the adversary to control the processor failures. The novelty of the system is that the interface allows the user to create new on-line adversaries as the algorithm executes.

Apgar, S.W.

1992-02-01

251

Parallel rendering techniques for massively parallel visualization  

SciTech Connect

As the resolution of simulation models increases, scientific visualization algorithms which take advantage of the large memory. and parallelism of Massively Parallel Processors (MPPs) are becoming increasingly important. For large applications rendering on the MPP tends to be preferable to rendering on a graphics workstation due to the MPP`s abundant resources: memory, disk, and numerous processors. The challenge becomes developing algorithms that can exploit these resources while minimizing overhead, typically communication costs. This paper will describe recent efforts in parallel rendering for polygonal primitives as well as parallel volumetric techniques. This paper presents rendering algorithms, developed for massively parallel processors (MPPs), for polygonal, spheres, and volumetric data. The polygon algorithm uses a data parallel approach whereas the sphere and volume render use a MIMD approach. Implementations for these algorithms are presented for the Thinking Ma.chines Corporation CM-5 MPP.

Hansen, C.; Krogh, M.; Painter, J.

1995-07-01

252

An implementation of a real-time and parallel processing ECG features extraction algorithm in a Field Programmable Gate Array (FPGA)  

Microsoft Academic Search

The objective of the paper is to report a development of real-time and parallel processing algorithm to implement it into a Field Programmable Gate Array (FPGA) for the electrocardiogram (ECG) signals feature extraction. The prototyped system will be extracting the ECG features and tested as a System on Chip (Soc) design. The performance of algorithm was tested against MATLAB routine

Weichih Hu; Chun Cheng Lin; Liang Yu Shyu

2011-01-01

253

Magnetic arrays  

DOEpatents

Electromagnet arrays are disclosed which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness. 12 figs.

Trumper, D.L.; Kim, W.; Williams, M.E.

1997-05-20

254

Magnetic arrays  

SciTech Connect

Electromagnet arrays which can provide selected field patterns in either two or three dimensions, and in particular, which can provide single-sided field patterns in two or three dimensions. These features are achieved by providing arrays which have current densities that vary in the windings both parallel to the array and in the direction of array thickness.

Trumper, David L. (Plaistow, NH); Kim, Won-jong (Cambridge, MA); Williams, Mark E. (Pelham, NH)

1997-05-20

255

Upset Characterization of the PowerPC405 Hard-core Processor Embedded in Virtex-II Pro Field Programmable Gate Arrays  

NASA Technical Reports Server (NTRS)

Shown in this presentation are recent results for the upset susceptibility of the various types of memory elements in the embedded PowerPC405 in the Xilinx V2P40 FPGA. For critical flight designs where configuration upsets are mitigated effectively through appropriate design triplication and configuration scrubbing, these upsets of processor elements can dominate the system error rate. Data from irradiations with both protons and heavy ions are given and compared using available models.

Swift, Gary M.; Allen, Gregory S.; Farmanesh, Farhad; George, Jeffrey; Petrick, David J.; Chayab, Fayez

2006-01-01

256

Parallel recognition of cancer cells using an addressable array of solid-state micropores.  

PubMed

Early stage detection and precise quantification of circulating tumor cells (CTCs) in the peripheral blood of cancer patients are important for early diagnosis. Early diagnosis improves the effectiveness of the therapy and results in better prognosis. Several techniques have been used for CTC detection but are limited by their need for dye tagging, low throughput and lack of statistical reliability at single cell level. Solid-state micropores can characterize each cell in a sample providing interesting information about cellular populations. We report a multi-channel device which utilized solid-state micropores array assembly for simultaneous measurement of cell translocation. This increased the throughput of measurement and as the cells passed the micropores, tumor cells showed distinctive current blockade pulses, when compared to leukocytes. The ionic current across each micropore channel was continuously monitored and recorded. The measurement system not only increased throughput but also provided on-chip cross-relation. The whole blood was lysed to get rid of red blood cells, so the blood dilution was not needed. The approach facilitated faster processing of blood samples with tumor cell detection efficiency of about 70%. The design provided a simple and inexpensive method for rapid and reliable detection of tumor cells without any cell staining or surface functionalization. The device can also be used for high throughput electrophysiological analysis of other cell types. PMID:25038540

Ilyas, Azhar; Asghar, Waseem; Kim, Young-tae; Iqbal, Samir M

2014-12-15

257

Parallel asynchronous systems and image processing algorithms  

NASA Technical Reports Server (NTRS)

A new hardware approach to implementation of image processing algorithms is described. The approach is based on silicon devices which would permit an independent analog processing channel to be dedicated to evey pixel. A laminar architecture consisting of a stack of planar arrays of the device would form a two-dimensional array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuronlike asynchronous pulse coded form through the laminar processor. Such systems would integrate image acquisition and image processing. Acquisition and processing would be performed concurrently as in natural vision systems. The research is aimed at implementation of algorithms, such as the intensity dependent summation algorithm and pyramid processing structures, which are motivated by the operation of natural vision systems. Implementation of natural vision algorithms would benefit from the use of neuronlike information coding and the laminar, 2-D parallel, vision system type architecture. Besides providing a neural network framework for implementation of natural vision algorithms, a 2-D parallel approach could eliminate the serial bottleneck of conventional processing systems. Conversion to serial format would occur only after raw intensity data has been substantially processed. An interesting challenge arises from the fact that the mathematical formulation of natural vision algorithms does not specify the means of implementation, so that hardware implementation poses intriguing questions involving vision science.

Coon, D. D.; Perera, A. G. U.

1989-01-01

258

Parallel Mandelbrot Set Model  

NSDL National Science Digital Library

The Parallel Mandelbrot Set Model is a parallelization of the sequential MandelbrotSet model, which does all the computations on a single processor core. This parallelization is able to use a computer with more than one cores (or processors) to carry out the same computation, thus speeding up the process. The parallelization is done using the model elements in the Parallel Java group. These model elements allow easy use of the Parallel Java library created by Alan Kaminsky. In particular, the parallelization used for this model is based on code in Chapters 11 and 12 of Kaminsky's book Building Parallel Java. The Parallel Mandelbrot Set Model was developed using the Easy Java Simulations (EJS) modeling tool. It is distributed as a ready-to-run (compiled) Java archive. Double click the ejs_chaos_ParallelMandelbrotSet.jar file to run the program if Java is installed.

Franciscouembre

2011-11-24

259

Standard Templates Adaptive Parallel Library  

E-print Network

STAPL (Standard Templates Adaptive Parallel Library) is a parallel C++ library designed as a superset of the C++ Standard Template Library (STL), sequentially consistent for functions with the same name, and executed on uni- or multi- processor...

Arzu, Francisco Jose

2012-06-07

260

A PARALLEL ANALOG-DIGITAL PIN PHOTODIODE PROCESSOR CHIP FOR IMAGE PRE-PROCESSING WITH OPTICAL CHIP-TO-CHIP INTERCONNECTS  

Microsoft Academic Search

A smart detector chip consisting of an array of smart pixel processing elements (PE) realised in 0.6 µm CMOS technology is presented. The area for one PE is only 250×250 µm² in which a pin photodiode, analogue\\/digital conversion and programmable digital logic are integrated. Simulations results show a data rate up to 625 Mbit\\/s for one PE, resulting in a

Lutz Hoppe; Andreas Loos; Michael Förtsch; Dietmar Fey; Horst Zimmermann

261

Highly Parallel Computing Architectures by using Arrays of Quantum-dot Cellular Automata (QCA): Opportunities, Challenges, and Recent Results  

NASA Technical Reports Server (NTRS)

There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA-based architectures for highly parallel and systolic computation of signal/image processing applications, such as FFT and Wavelet and Wlash-Hadamard Transforms.

Fijany, Amir; Toomarian, Benny N.

2000-01-01

262

Fault-tolerant computer architecture based on INMOS transputer processor  

NASA Technical Reports Server (NTRS)

Redundant processing was used for several years in mission flight systems. In these systems, more than one processor performs the same task at the same time but only one processor is actually in real use. A fault-tolerance computer architecture based on the features provided by INMOS Transputers is presented. The Transputer architecture provides several communication links that allow data and command communication with other Transputers without the use of a bus. Additionally the Transputer allows the use of parallel processing to increase the system speed considerably. The processor architecture consists of three processors working in parallel keeping all the processors at the same operational level but only one processor is in real control of the process. The design allows each Transputer to perform a test to the other two Transputers and report the operating condition of the neighboring processors. A graphic display was developed to facilitate the identification of any problem by the user.

Ortiz, Jorge L.

1987-01-01

263

Opto-electronic morphological processor  

NASA Technical Reports Server (NTRS)

The opto-electronic morphological processor of the present invention is capable of receiving optical inputs and emitting optical outputs. The use of optics allows implementation of parallel input/output, thereby overcoming a major bottleneck in prior art image processing systems. The processor consists of three components, namely, detectors, morphological operators and modulators. The detectors and operators are fabricated on a silicon VLSI chip and implement the optical input and morphological operations. A layer of ferro-electric liquid crystals is integrated with a silicon chip to provide the optical modulation. The implementation of the image processing operators in electronics leads to a wide range of applications and the use of optical connections allows cascadability of these parallel opto-electronic image processing components and high speed operation. Such an opto-electronic morphological processor may be used as the pre-processing stage in an image recognition system. In one example disclosed herein, the optical input/optical output morphological processor of the invention is interfaced with a binary phase-only correlator to produce an image recognition system.

Yu, Jeffrey W. (Inventor); Chao, Tien-Hsin (Inventor); Cheng, Li J. (Inventor); Psaltis, Demetri (Inventor)

1993-01-01

264

Parallel architectures for iterative methods on adaptive, block structured grids  

NASA Technical Reports Server (NTRS)

A parallel computer architecture well suited to the solution of partial differential equations in complicated geometries is proposed. Algorithms for partial differential equations contain a great deal of parallelism. But this parallelism can be difficult to exploit, particularly on complex problems. One approach to extraction of this parallelism is the use of special purpose architectures tuned to a given problem class. The architecture proposed here is tuned to boundary value problems on complex domains. An adaptive elliptic algorithm which maps effectively onto the proposed architecture is considered in detail. Two levels of parallelism are exploited by the proposed architecture. First, by making use of the freedom one has in grid generation, one can construct grids which are locally regular, permitting a one to one mapping of grids to systolic style processor arrays, at least over small regions. All local parallelism can be extracted by this approach. Second, though there may be a regular global structure to the grids constructed, there will be parallelism at this level. One approach to finding and exploiting this parallelism is to use an architecture having a number of processor clusters connected by a switching network. The use of such a network creates a highly flexible architecture which automatically configures to the problem being solved.

Gannon, D.; Vanrosendale, J.

1983-01-01

265

Parallel Optimisation  

NSDL National Science Digital Library

An introduction to optimisation techniques that may improve parallel performance and scaling on HECToR. It assumes that the reader has some experience of parallel programming including basic MPI and OpenMP. Scaling is a measurement of the ability for a parallel code to use increasing numbers of cores efficiently. A scalable application is one that, when the number of processors is increased, performs better by a factor which justifies the additional resource employed. Making a parallel application scale to many thousands of processes requires not only careful attention to the communication, data and work distribution but also to the choice of the algorithms to use. Since the choice of algorithm is too broad a subject and very particular to application domain to include in this brief guide we concentrate on general good practices towards parallel optimisation on HECToR.

266

Speculative parallelization of partially parallel loops  

E-print Network

, and applied a fully parallel data dependence test to determine if it had any cross–processor depen- dences. If the test failed, then the loop was re–executed serially. While this method exploits doall parallelism well, it can cause slowdowns for loops...

Dang, Francis Hoai Dinh

2009-05-15

267

Proceedings of the 1983 international conference on parallel processing  

SciTech Connect

The following topics were dealt with: the performance of existing supercomputers on computationally intensive tasks; multistage networks; numerical algorithms; network connection capabilities; special purpose systems; node-to-node networks; nonnumerical algorithms; tree structured systems; parallel programming and languages; images and speech; expressing parallelism; database machines and signal processing; data flow; simulation and operating systems; models; scheduling resources; system performance; VLSI processor arrays; computer architectures; associative processing and distributed systems; multiprocessor systems; and pipelining. 97 papers were presented, all of which are published in full in the present proceedings. Abstracts of individual papers can be found under the relevant classification codes in this or other issues.

Siegel, H.J.; Siegel, L.

1983-01-01

268

MULTI-CORE AND OPTICAL PROCESSOR RELATED APPLICATIONS RESEARCH AT OAK RIDGE NATIONAL LABORATORY  

SciTech Connect

High-speed parallelization of common tasks holds great promise as a low-risk approach to achieving the significant increases in signal processing and computational performance required for next generation innovations in reconfigurable radio systems. Researchers at the Oak Ridge National Laboratory have been working on exploiting the parallelization offered by this emerging technology and applying it to a variety of problems. This paper will highlight recent experience with four different parallel processors applied to signal processing tasks that are directly relevant to signal processing required for SDR/CR waveforms. The first is the EnLight Optical Core Processor applied to matched filter (MF) correlation processing via fast Fourier transform (FFT) of broadband Dopplersensitive waveforms (DSW) using active sonar arrays for target tracking. The second is the IBM CELL Broadband Engine applied to 2-D discrete Fourier transform (DFT) kernel for image processing and frequency domain processing. And the third is the NVIDIA graphical processor applied to document feature clustering. EnLight Optical Core Processor. Optical processing is inherently capable of high-parallelism that can be translated to very high performance, low power dissipation computing. The EnLight 256 is a small form factor signal processing chip (5x5 cm2) with a digital optical core that is being developed by an Israeli startup company. As part of its evaluation of foreign technology, ORNL's Center for Engineering Science Advanced Research (CESAR) had access to a precursor EnLight 64 Alpha hardware for a preliminary assessment of capabilities in terms of large Fourier transforms for matched filter banks and on applications related to Doppler-sensitive waveforms. This processor is optimized for array operations, which it performs in fixed-point arithmetic at the rate of 16 TeraOPS at 8-bit precision. This is approximately 1000 times faster than the fastest DSP available today. The optical core performs the matrix-vector multiplications, where the nominal matrix size is 256x256. The system clock is 125MHz. At each clock cycle, 128K multiply-and-add operations per second (OPS) are carried out, which yields a peak performance of 16 TeraOPS. IBM Cell Broadband Engine. The Cell processor is the extraordinary resulting product of 5 years of sustained, intensive R&D collaboration (involving over $400M investment) between IBM, Sony, and Toshiba. Its architecture comprises one multithreaded 64-bit PowerPC processor element (PPE) with VMX capabilities and two levels of globally coherent cache, and 8 synergistic processor elements (SPEs). Each SPE consists of a processor (SPU) designed for streaming workloads, local memory, and a globally coherent direct memory access (DMA) engine. Computations are performed in 128-bit wide single instruction multiple data streams (SIMD). An integrated high-bandwidth element interconnect bus (EIB) connects the nine processors and their ports to external memory and to system I/O. The Applied Software Engineering Research (ASER) Group at the ORNL is applying the Cell to a variety of text and image analysis applications. Research on Cell-equipped PlayStation3 (PS3) consoles has led to the development of a correlation-based image recognition engine that enables a single PS3 to process images at more than 10X the speed of state-of-the-art single-core processors. NVIDIA Graphics Processing Units. The ASER group is also employing the latest NVIDIA graphical processing units (GPUs) to accelerate clustering of thousands of text documents using recently developed clustering algorithms such as document flocking and affinity propagation.

Barhen, Jacob [ORNL; Kerekes, Ryan A [ORNL; ST Charles, Jesse Lee [ORNL; Buckner, Mark A [ORNL

2008-01-01

269

Final Report, Center for Programming Models for Scalable Parallel Computing: Co-Array Fortran, Grant Number DE-FC02-01ER25505  

SciTech Connect

The major accomplishment of this project is the production of CafLib, an 'object-oriented' parallel numerical library written in Co-Array Fortran. CafLib contains distributed objects such as block vectors and block matrices along with procedures, attached to each object, that perform basic linear algebra operations such as matrix multiplication, matrix transpose and LU decomposition. It also contains constructors and destructors for each object that hide the details of data decomposition from the programmer, and it contains collective operations that allow the programmer to calculate global reductions, such as global sums, global minima and global maxima, as well as vector and matrix norms of several kinds. CafLib is designed to be extensible in such a way that programmers can define distributed grid and field objects, based on vector and matrix objects from the library, for finite difference algorithms to solve partial differential equations. A very important extra benefit that resulted from the project is the inclusion of the co-array programming model in the next Fortran standard called Fortran 2008. It is the first parallel programming model ever included as a standard part of the language. Co-arrays will be a supported feature in all Fortran compilers, and the portability provided by standardization will encourage a large number of programmers to adopt it for new parallel application development. The combination of object-oriented programming in Fortran 2003 with co-arrays in Fortran 2008 provides a very powerful programming model for high-performance scientific computing. Additional benefits from the project, beyond the original goal, include a programto provide access to the co-array model through access to the Cray compiler as a resource for teaching and research. Several academics, for the first time, included the co-array model as a topic in their courses on parallel computing. A separate collaborative project with LANL and PNNL showed how to extend the co-array model to other languages in a small experimental version of Co-array Python. Another collaborative project defined a Fortran 95 interface to ARMCI to encourage Fortran programmers to use the one-sided communication model in anticipation of their conversion to the co-array model later. A collaborative project with the Earth Sciences community at NASA Goddard and GFDL experimented with the co-array model within computational kernels related to their climate models, first using CafLib and then extending the co-array model to use design patterns. Future work will build on the design-pattern idea with a redesign of CafLib as a true object-oriented library using Fortran 2003 and as a parallel numerical library using Fortran 2008.

Robert W. Numrich

2008-04-22

270

Periodic parallel array of nanopillars and nanoholes resulting from colloidal stripes patterned by geometrically confined evaporative self-assembly for unique anisotropic wetting.  

PubMed

In this paper we present an economical process to create anisotropic microtextures based on periodic parallel stripes of monolayer silica nanoparticles (NPs) patterned by geometrically confined evaporative self-assembly (GCESA). In the GCESA process, a straight meniscus of a colloidal dispersion is initially formed in an opened enclosure, which is composed of two parallel plates bounded by a U-shaped spacer sidewall on three sides with an evaporating outlet on the fourth side. Lateral evaporation of the colloidal dispersion leads to periodic "stick-slip" receding of the meniscus (evaporative front), as triggered by the "coffee-ring" effect, promoting the assembly of silica NPs into periodic parallel stripes. The morphology of stripes can be well controlled by tailoring process variables such as substrate wettability, NP concentration, temperature, and gap height, etc. Furthermore, arrayed patterns of nanopillars or nanoholes are generated on a silicon wafer using the as-prepared colloidal stripes as an etching mask or template. Such arrayed patterns can reveal unique anisotropic wetting properties, which have a large contact angle hysteresis viewing from both the parallel and perpendicular directions in addition to a large wetting anisotropy. PMID:25353399

Li, Xiangmeng; Wang, Chunhui; Shao, Jinyou; Ding, Yucheng; Tian, Hongmiao; Li, Xiangming; Wang, Li

2014-11-26

271

Efficient design space exploration of high performance embedded out-of-order processors  

Microsoft Academic Search

Previous work on efficient customized processor design primarily focused on in-order architectures. However, with the recent introduction of out-of-order processors for high- end high-performance embedded applications, researchers and designers need to address how to automate the design process of customized out-of-order processors. Because of the parallel execution of independent instructions in out- of-order processors, in-order processor design methodolo- gies which

Stijn Eyerman; Lieven Eeckhout; Koen De Bosschere

2006-01-01

272

Space-efficient scheduling of nested parallelism  

Microsoft Academic Search

Many of today's high-level parallel languages support dynamic, fine-grained parallelism. These languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is

Girija J. Narlikar; Guy E. Blelloch

1999-01-01

273

Spatio-temporal operator formalism for holographic recording and diffraction in a photorefractive-based true-time-delay phased-array processor.  

PubMed

We present a spatio-temporal operator formalism and beam propagation simulations that describe the broadband efficient adaptive method for a true-time-delay array processing (BEAMTAP) algorithm for an optical beamformer by use of a photorefractive crystal. The optical system consists of a tapped-delay line implemented with an acoustooptic Bragg cell, an accumulating scrolling time-delay detector achieved with a traveling-fringes detector, and a photorefractive crystal to store the adaptive spatio-temporal weights as volume holographic gratings. In this analysis, linear shift-invariant integral operators are used to describe the propagation, interference, grating accumulation, and volume holographic diffraction of the spatio-temporally modulated optical fields in the system to compute the adaptive array processing operation. In addition, it is shown that the random fluctuation in time and phase delays of the optically modulated and transmitted array signals produced by fiber perturbations (temperature fluctuations, vibrations, or bending) are dynamically compensated for through the process of holographic wavefront reconstruction as a byproduct of the adaptive beam-forming and jammer-excision operation. The complexity of the cascaded spatial-temporal integrals describing the holographic formation, and subsequent readout processes, is shown to collapse to a simple imaging condition through standard operator manipulation. We also present spatio-temporal beam propagation simulation results as an illustrative demonstration of our analysis and the operation of a BEAMTAP beamformer. PMID:14503701

Kiruluta, Andrew; Pati, Gour S; Kriehn, Gregory; Silveira, Paulo E X; Sarto, Anthony W; Wagner, Kelvin

2003-09-10

274

Parallelization Strategies for Network Interface Firmware  

Microsoft Academic Search

Typical data-intensive embedded applications have large amounts of instruction-level parallelism that is often ex- ploited with wide-issue VLIW processors. In contrast, event-driven embedded applications are believed to have very little instruction-level parallelism, so these appli ca- tions often utilize much simpler processor cores. Pro- grammable network interface cards, for example, utilize thread-level parallelism across multiple processor cores to handle multiple

Michael Brogioli; Paul Willmann; Scott Rixner

275

Parallel I/O Systems  

NSDL National Science Digital Library

* Redundant disk array architectures,* Fault tolerance issues in parallel I/O systems,* Caching and prefetching,* Parallel file systems,* Parallel I/O systems, * Parallel I/O programming paradigms, * Parallel I/O applications and environments, * Parallel programming with parallel I/O

Amy Apon

276

The Processor Chapter 4 --The Processor --2  

E-print Network

transfer: beq, j §4.1Introduction #12;Chapter 4 -- The Processor -- 3 Instruction Execution PC instruction memory, fetch instruction Register numbers register file, read registers Depending address Access data memory for load/store PC target address or PC + 4 #12;Chapter 4 -- The Processor

Harcourt, Ed

277

Optimal processor assignment for pipeline computations  

NASA Technical Reports Server (NTRS)

The availability of large scale multitasked parallel architectures introduces the following processor assignment problem for pipelined computations. Given a set of tasks and their precedence constraints, along with their experimentally determined individual responses times for different processor sizes, find an assignment of processor to tasks. Two objectives are of interest: minimal response given a throughput requirement, and maximal throughput given a response time requirement. These assignment problems differ considerably from the classical mapping problem in which several tasks share a processor; instead, it is assumed that a large number of processors are to be assigned to a relatively small number of tasks. Efficient assignment algorithms were developed for different classes of task structures. For a p processor system and a series parallel precedence graph with n constituent tasks, an O(np2) algorithm is provided that finds the optimal assignment for the response time optimization problem; it was found that the assignment optimizing the constrained throughput in O(np2log p) time. Special cases of linear, independent, and tree graphs are also considered.

Nicol, David M.; Simha, Rahul; Choudhury, Alok N.; Narahari, Bhagirath

1991-01-01

278

Parallel algorithms for mapping pipelined and parallel computations  

NASA Technical Reports Server (NTRS)

Many computational problems in image processing, signal processing, and scientific computing are naturally structured for either pipelined or parallel computation. When mapping such problems onto a parallel architecture it is often necessary to aggregate an obvious problem decomposition. Even in this context the general mapping problem is known to be computationally intractable, but recent advances have been made in identifying classes of problems and architectures for which optimal solutions can be found in polynomial time. Among these, the mapping of pipelined or parallel computations onto linear array, shared memory, and host-satellite systems figures prominently. This paper extends that work first by showing how to improve existing serial mapping algorithms. These improvements have significantly lower time and space complexities: in one case a published O(nm sup 3) time algorithm for mapping m modules onto n processors is reduced to an O(nm log m) time complexity, and its space requirements reduced from O(nm sup 2) to O(m). Run time complexity is further reduced with parallel mapping algorithms based on these improvements, which run on the architecture for which they create the mappings.

Nicol, David M.

1988-01-01

279

Customization of application specific heterogeneous multi-pipeline processors  

Microsoft Academic Search

In this paper we propose Application Specic Instruction Set Pro- cessors with heterogeneous multiple pipelines to efciently exploit the available parallelism at instruction level. We have developed a design system based on the Thumb processor architecture. Given an application specied in C language, the design system can generate a processor with a number of pipelines specically suitable to the ap-

Swarnalatha Radhakrishnan; Hui Guo; Sri Parameswaran

2006-01-01

280

Parallel asynchronous hardware implementation of image processing algorithms  

NASA Technical Reports Server (NTRS)

Research is being carried out on hardware for a new approach to focal plane processing. The hardware involves silicon injection mode devices. These devices provide a natural basis for parallel asynchronous focal plane image preprocessing. The simplicity and novel properties of the devices would permit an independent analog processing channel to be dedicated to every pixel. A laminar architecture built from arrays of the devices would form a two-dimensional (2-D) array processor with a 2-D array of inputs located directly behind a focal plane detector array. A 2-D image data stream would propagate in neuron-like asynchronous pulse-coded form through the laminar processor. No multiplexing, digitization, or serial processing would occur in the preprocessing state. High performance is expected, based on pulse coding of input currents down to one picoampere with noise referred to input of about 10 femtoamperes. Linear pulse coding has been observed for input currents ranging up to seven orders of magnitude. Low power requirements suggest utility in space and in conjunction with very large arrays. Very low dark current and multispectral capability are possible because of hardware compatibility with the cryogenic environment of high performance detector arrays. The aforementioned hardware development effort is aimed at systems which would integrate image acquisition and image processing.

Coon, Darryl D.; Perera, A. G. U.

1990-01-01

281

Transitive closure on the imagine stream processor  

SciTech Connect

The increasing gap between processor and memory speeds is a well-known problem in modern computer architecture. The Imagine system is designed to address the processor-memory gap through streaming technology. Stream processors are best-suited for computationally intensive applications characterized by high data parallelism and producer-consumer locality with minimal data dependencies. This work examines an efficient streaming implementation of the computationally intensive Transitive Closure (TC) algorithm on the Imagine platform. We develop a tiled TC algorithm specifically for the Imagine environment, which efficiently reuses streams to minimize expensive off-chip data transfers. The implementation requires complex stream programming since the memory hierarchy and cluster organization of the underlying architecture are exposed to the Imagine programmer. Results demonstrate that limited performance of TC is achieved primarily due to the complicated data-dependencies of the blocked algorithm. This work is an ongoing effort to identify classes of scientific problems well-suited for streaming processors.

Griem, Gorden; Oliker, Leonid

2003-11-11

282

Architecture and data processing alternatives for the TSE computer. Volume 3: Execution of a parallel counting algorithm using array logic (Tse) devices  

NASA Technical Reports Server (NTRS)

A parallel algorithm for counting the number of logic-l elements in a binary array or image developed during preliminary investigation of the Tse concept is described. The counting algorithm is implemented using a basic combinational structure. Modifications which improve the efficiency of the basic structure are also presented. A programmable Tse computer structure is proposed, along with a hardware control unit, Tse instruction set, and software program for execution of the counting algorithm. Finally, a comparison is made between the different structures in terms of their more important characteristics.

Metcalfe, A. G.; Bodenheimer, R. E.

1976-01-01

283

Architectures for reasoning in parallel  

NASA Technical Reports Server (NTRS)

The research conducted has dealt with rule-based expert systems. The algorithms that may lead to effective parallelization of them were investigated. Both the forward and backward chained control paradigms were investigated in the course of this work. The best computer architecture for the developed and investigated algorithms has been researched. Two experimental vehicles were developed to facilitate this research. They are Backpac, a parallel backward chained rule-based reasoning system and Datapac, a parallel forward chained rule-based reasoning system. Both systems have been written in Multilisp, a version of Lisp which contains the parallel construct, future. Applying the future function to a function causes the function to become a task parallel to the spawning task. Additionally, Backpac and Datapac have been run on several disparate parallel processors. The machines are an Encore Multimax with 10 processors, the Concert Multiprocessor with 64 processors, and a 32 processor BBN GP1000. Both the Concert and the GP1000 are switch-based machines. The Multimax has all its processors hung off a common bus. All are shared memory machines, but have different schemes for sharing the memory and different locales for the shared memory. The main results of the investigations come from experiments on the 10 processor Encore and the Concert with partitions of 32 or less processors. Additionally, experiments have been run with a stripped down version of EMYCIN.

Hall, Lawrence O.

1989-01-01

284

The profiling method in multicore processor for effective performance improvement  

Microsoft Academic Search

Today, multi-core processors are being used widely in mobile environments in addition to the existing PC-based environment. In order to use a multi-core processor efficiently, parallel programming skills are required. However, incorrect parallelization can degrade the performance of the overall system in a mobile multitasking environment. Therefore, a profiling technique that measures the performance of the whole system is needed.

Seung Hyun Yoon; Kyung Min Lee; Yong Seok Kim; Seong Jin Cho; Dong Won Choi; Key Ho Kwon; Kil Jae Kim; Jong Hyun Park; Jae Wook Jeon

2012-01-01

285

Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of Rb atoms in a quantum information processor.  

PubMed

We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb87 atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 and 480 nm) and have nonoverlapping Bragg-matched frequency response at these wavelengths, so that there will be no cross talk when proportional frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a tellurium dioxide crystal (TeO(2)). The designed and fabricated AOD has more than 100 resolvable spots, widely separated band shapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 and 480 nm), and a 4 micros or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD. PMID:18404181

Kim, Sangtaek; Mcleod, Robert R; Saffman, M; Wagner, Kelvin H

2008-04-10

286

Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of Rb atoms in a quantum information processor  

NASA Astrophysics Data System (ADS)

We demonstrate a dual wavelength acousto-optic deflector (AOD) designed to deflect two wavelengths to the same angles by driving with two RF frequencies. The AOD is designed as a beam scanner to address two-photon transitions in a two-dimensional array of trapped neutral Rb87 atoms in a quantum computer. Momentum space is used to design AODs that have the same diffraction angles for two wavelengths (780 and 480 nm) and have nonoverlapping Bragg-matched frequency response at these wavelengths, so that there will be no cross talk when proportional frequencies are applied to diffract the two wavelengths. The appropriate crystal orientation, crystal shape, transducer size, and transducer height are determined for an AOD made with a tellurium dioxide crystal (TeO2). The designed and fabricated AOD has more than 100 resolvable spots, widely separated band shapes for the two wavelengths within an overall octave bandwidth, spatially overlapping diffraction angles for both wavelengths (780 and 480 nm), and a 4 ?s or less access time. Cascaded AODs in which the first device upshifts and the second downshifts allow Doppler-free scanning as required for addressing the narrow atomic resonance without detuning. We experimentally show the diffraction-limited Doppler-free scanning performance and spatial resolution of the designed AOD.

Kim, Sangtaek; McLeod, Robert R.; Saffman, M.; Wagner, Kelvin H.

2008-04-01

287

A novel picoliter droplet array for parallel real-time polymerase chain reaction based on double-inkjet printing.  

PubMed

We developed and characterized a novel picoliter droplet-in-oil array generated by a double-inkjet printing method on a uniform hydrophobic silicon chip specifically designed for quantitative polymerase chain reaction (qPCR) analysis. Double-inkjet printing was proposed to efficiently address the evaporation issues of picoliter droplets during array generation on a planar substrate without the assistance of a humidifier or glycerol. The method utilizes piezoelectric inkjet printing equipment to precisely eject a reagent droplet into an oil droplet, which had first been dispensed on a hydrophobic and oleophobic substrate. No evaporation, random movement, or cross-contamination was observed during array fabrication and thermal cycling. We demonstrated the feasibility and effectiveness of this novel double-inkjet method for real-time PCR analysis. This method can readily produce multivolume droplet-in-oil arrays with volume variations ranging from picoliters to nanoliters. This feature would be useful for simultaneous multivolume PCR experiments aimed at wide and tunable dynamic ranges. These double-inkjet-based picoliter droplet arrays may have potential for multiplexed applications that require isolated containers for single-cell cultures, single molecular enzymatic assays, or digital PCR and provide an alternative option for generating droplet arrays on planar substrates without chemical patterning. PMID:25070461

Sun, Yingnan; Zhou, Xiaoguang; Yu, Yude

2014-09-21

288

FPGA realization of a split radix FFT processor  

NASA Astrophysics Data System (ADS)

Applications based on Fast Fourier Transform (FFT) such as signal and image processing require high computational power, plus the ability to choose the algorithm and architecture to implement it. This paper explains the realization of a Split Radix FFT (SRFFT) processor based on a pipeline architecture reported before by the same authors. This architecture has as basic building blocks a Complex Butterfly and a Delay Commutator. The main advantages of this architecture are: * To combine the higher parallelism of the 4r-FFTs and the possibility of processing sequences having length of any power of two. * The simultaneous operation of multipliers and adder-subtracters implicit in the SRFFT, which leads to faster operation at the same degree of pipeline. The implementation has been made on a Field Programmable Gate Array (FPGA) as a way of obtaining high performance at economical price and a short time of realization. The Delay Commutator has been designed to be customized for even and odd SRFFT computation levels. It can be used with segmented arithmetic of any level of pipeline in order to speed up the operating frequency. The processor has been simulated up to 350 MHz, with an EP2S15F672C3 Altera Stratix II as a target device, for a transform length of 256 complex points.

García, Jesús; Michell, Juan A.; Ruiz, Gustavo; Burón, Angel M.

2007-05-01

289

Sandia secure processor : a native Java processor.  

SciTech Connect

The Sandia Secure Processor (SSP) is a new native Java processor that has been specifically designed for embedded applications. The SSP's design is a system composed of a core Java processor that directly executes Java bytecodes, on-chip intelligent IO modules, and a suite of software tools for simulation and compiling executable binary files. The SSP is unique in that it provides a way to control real-time IO modules for embedded applications. The system software for the SSP is a 'class loader' that takes Java .class files (created with your favorite Java compiler), links them together, and compiles a binary. The complete SSP system provides very powerful functionality with very light hardware requirements with the potential to be used in a wide variety of small-system embedded applications. This paper gives a detail description of the Sandia Secure Processor and its unique features.

Wickstrom, Gregory Lloyd; Gale, Jason Carl; Ma, Kwok Kee

2003-08-01

290

C-slow Technique vs Multiprocessor in designing Low Area Customized Instruction set Processor for Embedded Applications  

E-print Network

The demand for high performance embedded processors, for consumer electronics, is rapidly increasing for the past few years. Many of these embedded processors depend upon custom built Instruction Ser Architecture (ISA) such as game processor (GPU), multimedia processors, DSP processors etc. Primary requirement for consumer electronic industry is low cost with high performance and low power consumption. A lot of research has been evolved to enhance the performance of embedded processors through parallel computing. But some of them focus superscalar processors i.e. single processors with more resources like Instruction Level Parallelism (ILP) which includes Very Long Instruction Word (VLIW) architecture, custom instruction set extensible processor architecture and others require more number of processing units on a single chip like Thread Level Parallelism (TLP) that includes Simultaneous Multithreading (SMT), Chip Multithreading (CMT) and Chip Multiprocessing (CMP). In this paper, we present a new technique, n...

Akram, Muhammad Adeel; Sarfaraz, Muhammad Masood

2012-01-01

291

An implementation of scoreboarding mechanism for ARM-based SMT processor  

Microsoft Academic Search

A SMT architecture uses TLP (Thread Level Parallelism) and increases processor throughput, such that issue slots can be filled with instructions from multiple independent threads. Having multiple ready threads reduces the probability that a functional unit is left idle, which increases processor efficiency. To utilize those advantages for the SMT processor, the issue unit must control the flow of instructions

Chang-Yong Heo; Kyu-Baik Choi; In-Pyo Hong; Yong-Surk Lee

2003-01-01

292

Models for Dynamic Load Balancing in a Heterogeneous Multiple Processor System  

Microsoft Academic Search

Queueing models for a simple heterogeneous multiple processor system are presented, analyzed, and compared. Each model is distinguished by a job routing strategy which is designed to reduce the average job turnaround time by balancing the total load among the processors. In each case an arriving job is routed by a job dispatcher to one of m parallel processors. The

Yuan-chieh Chow; Walter H. Kohler

1979-01-01

293

Parallel Computations on Reconfigurable Meshes  

Microsoft Academic Search

The mesh with reconfigurable bus is presented as a model of computation. The reconfigurable mesh captures salient features from a variety of sources, including the CAAPP, CHiP, polymorphic-torus network, and bus automation. It consists of an array of processors interconnected by a reconfigurable bus system that can be used to dynamically obtain various interconnection patterns between the processors. A variety

Russ Miller; Viktor K. Prasanna; Dionisios I. Reisis; Quentin F. Stout

1993-01-01

294

Primitive operations for a hierarchical parallel processor  

SciTech Connect

Pyramid data structures make some image processing operations easier to compute. This paper discusses the programming strategies for pyramid machines with a view toward a set of primitive operations. One such set of operations is described. 8 references.

Tanimoto, S.L.

1982-01-01

295

Digital optical cellular image processor (DOCIP): experimental implementation.  

PubMed

We demonstrate experimentally the concept of the digital optical cellular image processor architecture by implementing one processing element of a prototype optical computer that includes a 54-gate processor, an instruction decoder, and electronic input-output interfaces. The processor consists of a twodimensional (2-D) array of 54 optical logic gates implemented by use of a liquid-crystal light valve and a 2-D array of 53 subholograms to provide interconnections between gates. The interconnection hologram is fabricated by a computer-controlled optical system. PMID:20802673

Huang, K S; Sawchuk, A A; Jenkins, B K; Chavel, P; Wang, J M; Weber, A G; Wang, C H; Glaser, I

1993-01-10

296

"Parallel array" of tubules with pipe-like structure in the mitochondria of glomerulosa cells of adrenal cortex. Ultrastructural and freeze-fracture replica studies.  

PubMed

We found "parallel array" of tubules with a pipe-like structure in the mitochondria of glomerulosa cells of the adrenal cortex of the rats treated with atrial natriuretic factor (ANF) that inhibits aldosterone production. This structure was characterized by the regular parallel arrays of tubules running straight along the mitochondrial long axis, representing a pipe-like structure. The cross section exhibited the round shape with an electron lucent core measuring about 40 nm and dual thick membranes measuring 10 nm in thickness. Some of these tubules were connected with the mitochondrial inner membrane. Freeze fracture replica represented round particles distributed on the surface membrane of these tubules, indicating protein particles. This particular structure was considered to be one of the changes occurring in the mitochondrial inner membrane on which structural proteins were localized. These tubules probably appeared in connection with the impaired mitochondrial protein synthesis in the process of the biosynthesis of mineralcorticoid, because this structure was found only in the glomerulosa cells under the decreased and increased aldosterone production. PMID:2527759

Kawai, K; Shigematsu, K; Irie, J; Tsuchiyama, H

1989-01-01

297

Efficient parallel algorithms for some graph problems  

Microsoft Academic Search

We study parallel algorithms for a number of graph problems, using the Single Instruction Stream-Multiple Data Stream model. We assume that the processors have access to a common memory and that no memory or data alignment time penalties are incurred. We derive a general time bound for a parallel algorithm that uses K processors for finding the connected components of

Francis Y. L. Chin; John Lam; I-Ngo Chen

1982-01-01

298

Data parallel sequential circuit fault simulation  

Microsoft Academic Search

Sequential circuit fault simulation is a compute-intensive problem. Parallel simulation is one method to reduce fault simulation time. In this paper, we discuss a novel technique to partition the fault set for the fault parallel simulation of sequential circuits on multiple processors. When applied statically, the technique can scale well for up to thirty two processors on an ethernet. The

Minesh B. Amin; Bapiraju Vinnakota

1996-01-01

299

Virtual Reality and Parallel Systems Performance Analysis  

Microsoft Academic Search

Recording and analyzing the dynamics of application program, system software, and hardware interactions are the keys to understanding and tuning the performance of massively parallel systems. Because massively parallel systems contain hundreds or thousands of processors, each potentially with many dynamic performance metrics, the performance data occupy a sparsely populated, high-dimensional space. These dynamic performance metrics for each processor define

Daniel A. Reed; Keith A. Shields; Will H. Scullin; Luis F. Tawera; Christopher L. Elford

1995-01-01

300

Parallel Algorithms for Computer Vision on the Connection Machine  

E-print Network

The Connection Machine is a fine-grained parallel computer having up to 64K processors. It supports both local communication among the processors, which are situated in a two-dimensional mesh, and high-bandwidth ...

Little, James J.

1986-11-01

301

Design and microfabrication of a high-aspect-ratio PDMS microbeam array for parallel nanonewton force measurement and protein printing  

NASA Astrophysics Data System (ADS)

Cell and protein mechanics has applications ranging from cellular development to tissue engineering. Techniques such as magnetic tweezers, optic tweezers and atomic force microscopy have been used to measure cell deformation forces of the order of piconewtons to nanonewtons. In this study, an array of polymeric polydimethylsiloxane (PDMS) microbeams with diameters of 10-40 µm and lengths of 118 µm was fabricated from Sylgard® with curing agent concentrations ranging from 5% to 20%. The resulting spring constants were 100-300 nN µm-1. The elastic modulus of PDMS was determined experimentally at different curing agent concentrations and found to be 346 kPa to 704 kPa in a millimeter-scale array and ~1 MPa in a microbeam array. Additionally, the microbeam array was used to print laminin for the purpose of cell adhesion. Linear and nonlinear finite element analyses are presented and compared to the closed-from solution. The highly compliant, transparent, biocompatible PDMS may offer a method for more rapid throughput in cell and protein mechanics force measurement experiments with sensitivities necessary for highly compliant structures such as axons.

Sasoglu, F. M.; Bohl, A. J.; Layton, B. E.

2007-03-01

302

Texas Instruments' DLP products massively paralleled MOEMS arrays for display applications: a distant second to Mother Nature  

NASA Astrophysics Data System (ADS)

This paper describes the business scope to which DLP® Products works under with emphasis placed upon some of the technological complications and challenges present when developing an actuator array with the ultimate intention of rendering visual content at high-definition and standard video rates. Additionally, some general thoughts on alternative applications of this spatial light modulation technology are provided.

Oden, P. I.

2008-04-01

303

Parallel architecture for OPS5  

Microsoft Academic Search

An architecture that captures some of the inherent parallelism of the OPS5 expert system language has been designed and implemented at Oak Ridge National Laboratory. A central feature of this architecture is a network bus over which a single host processor broadcasts messages to a set of parallel rule processors. This transmit-only bus is implemented by a memory-mapped scheme which

Philip L. Butler; J. D. Allen Jr.; Donald W. Bouldin

1988-01-01

304

Parallel Algorithms for Term Matching  

Microsoft Academic Search

We present a new randomized parallel algorithm for term matching. Let n be the number of nodes of the directed acyclic graphs (dags) representing the terms to be matched, then our algorithm uses O(log2n) parallel time and M(n) processors, where M(n) is the complexity of n by n matrix multiplication. The number of processors is a significant improvement over previously

Cynthia Dwork; Paris C. Kanellakis; Larry J. Stockmeyer

1986-01-01

305

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination.

Crosette, Dario B. (DeSoto, TX)

1994-01-01

306

Switch for serial or parallel communication networks  

DOEpatents

A communication switch apparatus and a method for use in a geographically extensive serial, parallel or hybrid communication network linking a multi-processor or parallel processing system has a very low software processing overhead in order to accommodate random burst of high density data. Associated with each processor is a communication switch. A data source and a data destination, a sensor suite or robot for example, may also be associated with a switch. The configuration of the switches in the network are coordinated through a master processor node and depends on the operational phase of the multi-processor network: data acquisition, data processing, and data exchange. The master processor node passes information on the state to be assumed by each switch to the processor node associated with the switch. The processor node then operates a series of multi-state switches internal to each communication switch. The communication switch does not parse and interpret communication protocol and message routing information. During a data acquisition phase, the communication switch couples sensors producing data to the processor node associated with the switch, to a downlink destination on the communications network, or to both. It also may couple an uplink data source to its processor node. During the data exchange phase, the switch couples its processor node or an uplink data source to a downlink destination (which may include a processor node or a robot), or couples an uplink source to its processor node and its processor node to a downlink destination. 9 figs.

Crosette, D.B.

1994-07-19

307

A rapidly modulated multifocal detection scheme for parallel acquisition of Raman spectra from a 2-D focal array.  

PubMed

We report the development of a rapidly modulated multifocal detection scheme that enables full Raman spectra (~500-2000 cm(-1)) from a 2-D focal array to be acquired simultaneously. A spatial light modulator splits a laser beam to generate an m × n multifocal array. Raman signals generated within each focus are projected simultaneously into a spectrometer and imaged onto a TE-cooled CCD camera. A shuttering system using different masks is constructed to collect the superimposed Raman spectra of different multifocal patterns. The individual Raman spectrum from each focus is then retrieved from the superimposed spectra with no crosstalk using a postacquisition data processing algorithm. This system is expected to significantly improve the speed of current Raman-based instruments such as laser tweezers Raman spectroscopy and hyperspectral Raman imaging. PMID:24892877

Kong, Lingbo; Chan, James

2014-07-01

308

Speculative multithreaded processors  

Microsoft Academic Search

In this paper we present a novel processor microarchitecture that relieves four of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the instruction window size, the complexity of a wide issue machine and the instruction fetch bandwidth requirements. The new microarchitecture executes simultaneously multiple threads of control obtained from a single program by means of

Pedro Marcuello; Antonio González; Jordi Tubella

1998-01-01

309

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modelling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modelling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modelling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication.

Karasick, Michael S. (Ridgefield, CT); Strip, David R. (Albuquerque, NM)

1996-01-01

310

System and method for representing and manipulating three-dimensional objects on massively parallel architectures  

DOEpatents

A parallel computing system is described that comprises a plurality of uniquely labeled, parallel processors, each processor capable of modeling a three-dimensional object that includes a plurality of vertices, faces and edges. The system comprises a front-end processor for issuing a modeling command to the parallel processors, relating to a three-dimensional object. Each parallel processor, in response to the command and through the use of its own unique label, creates a directed-edge (d-edge) data structure that uniquely relates an edge of the three-dimensional object to one face of the object. Each d-edge data structure at least includes vertex descriptions of the edge and a description of the one face. As a result, each processor, in response to the modeling command, operates upon a small component of the model and generates results, in parallel with all other processors, without the need for processor-to-processor intercommunication. 8 figs.

Karasick, M.S.; Strip, D.R.

1996-01-30

311

Design definition for a digital beamforming processor  

NASA Astrophysics Data System (ADS)

Very large scale integrated circuit technology now makes large bandwidth digital beamforming array antennas practical. Algorithms and architectures were investigated for the implementation of a processor capable of producing large bandwidth multiple output beams for both near and far-term applications. Algorithms in element space and beam space were investigated. Structures for dedicated algorithm execution with highly pipelined, systolic hardware were examined. Arithmetic execution alternatives were considered. The impact of channel errors were investigated and methods of calibrating the beamformer to compensate for these errors were developed. The effects of quantization errors were investigated and processor dynamic range requirements were assessed. The capabilities of Si and GaAs technologies were assessed. The implementation of a processor chip set using Application Specific Integrated Circuits (ASIC) was investigated. A recommended brassboard demonstration system design was derived.

Langston, J. L.; Sanzgiri, Shashikant; Hinman, Karl; Keisner, Kevin; Garcia, Domingo

1988-04-01

312

Processing techniques for software based SAR processors  

NASA Technical Reports Server (NTRS)

Software SAR processing techniques defined to treat Shuttle Imaging Radar-B (SIR-B) data are reviewed. The algorithms are devised for the data processing procedure selection, SAR correlation function implementation, multiple array processors utilization, cornerturning, variable reference length azimuth processing, and range migration handling. The Interim Digital Processor (IDP) originally implemented for handling Seasat SAR data has been adapted for the SIR-B, and offers a resolution of 100 km using a processing procedure based on the Fast Fourier Transformation fast correlation approach. Peculiarities of the Seasat SAR data processing requirements are reviewed, along with modifications introduced for the SIR-B. An Advanced Digital SAR Processor (ADSP) is under development for use with the SIR-B in the 1986 time frame as an upgrade for the IDP, which will be in service in 1984-5.

Leung, K.; Wu, C.

1983-01-01

313

Dynamic parallel complexity of computational circuits  

Microsoft Academic Search

The dynamic parallel complexity of general computational circuits (defined in introduction) is discussed. We exhibit some relationships between parallel circuit evaluation and some uniform closure properties of a certain class of unary functions and present a systematic method for the design of processor efficient parallel algorithms for circuit evaluation. Using this method: (1) we improve the algorithm for parallel Boolean

Gary L. Miller; Shang-Hua Teng

1987-01-01

314

Control structures for high speed processors  

NASA Technical Reports Server (NTRS)

A special processor was designed to function as a Reed Solomon decoder with throughput data rate in the Mhz range. This data rate is significantly greater than is possible with conventional digital architectures. To achieve this rate, the processor design includes sequential, pipelined, distributed, and parallel processing. The processor was designed using a high level language register transfer language. The RTL can be used to describe how the different processes are implemented by the hardware. One problem of special interest was the development of dependent processes which are analogous to software subroutines. For greater flexibility, the RTL control structure was implemented in ROM. The special purpose hardware required approximately 1000 SSI and MSI components. The data rate throughput is 2.5 megabits/second. This data rate is achieved through the use of pipelined and distributed processing. This data rate can be compared with 800 kilobits/second in a recently proposed very large scale integration design of a Reed Solomon encoder.

Maki, G. K.; Mankin, R.; Owsley, P. A.; Kim, G. M.

1982-01-01

315

Development of a prototype PET scanner with depth-of-interaction measurement using solid-state photomultiplier arrays and parallel readout electronics.  

PubMed

In this study, we developed a prototype animal PET by applying several novel technologies to use solid-state photomultiplier (SSPM) arrays to measure the depth of interaction (DOI) and improve imaging performance. Each PET detector has an 8 × 8 array of about 1.9 × 1.9 × 30.0 mm(3) lutetium-yttrium-oxyorthosilicate scintillators, with each end optically connected to an SSPM array (16 channels in a 4 × 4 matrix) through a light guide to enable continuous DOI measurement. Each SSPM has an active area of about 3 × 3 mm(2), and its output is read by a custom-developed application-specific integrated circuit to directly convert analogue signals to digital timing pulses that encode the interaction information. These pulses are transferred to and are decoded by a field-programmable gate array-based time-to-digital convertor for coincident event selection and data acquisition. The independent readout of each SSPM and the parallel signal process can significantly improve the signal-to-noise ratio and enable the use of flexible algorithms for different data processes. The prototype PET consists of two rotating detector panels on a portable gantry with four detectors in each panel to provide 16 mm axial and variable transaxial field-of-view (FOV) sizes. List-mode ordered subset expectation maximization image reconstruction was implemented. The measured mean energy, coincidence timing and DOI resolution for a crystal were about 17.6%, 2.8 ns and 5.6 mm, respectively. The measured transaxial resolutions at the center of the FOV were 2.0 mm and 2.3 mm for images reconstructed with and without DOI, respectively. In addition, the resolutions across the FOV with DOI were substantially better than those without DOI. The quality of PET images of both a hot-rod phantom and mouse acquired with DOI was much higher than that of images obtained without DOI. This study demonstrates that SSPM arrays and advanced readout/processing electronics can be used to develop a practical DOI-measureable PET scanner. PMID:24556629

Shao, Yiping; Sun, Xishan; Lan, Kejian A; Bircher, Chad; Lou, Kai; Deng, Zhi

2014-03-01

316

Generating local addresses and communication sets for data-parallel programs  

NASA Technical Reports Server (NTRS)

Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance Fortran. We show that for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution, and a computation involving the regular section A, the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little runtime overhead and acceptable preprocessing time.

Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

1993-01-01

317

Generating local addresses and communication sets for data-parallel programs  

NASA Technical Reports Server (NTRS)

Generating local addresses and communication sets is an important issue in distributed-memory implementations of data-parallel languages such as High Performance FORTRAN. We show that, for an array A affinely aligned to a template that is distributed across p processors with a cyclic(k) distribution and a computation involving the regular section A(l:h:s), the local memory access sequence for any processor is characterized by a finite state machine of at most k states. We present fast algorithms for computing the essential information about these state machines, and extend the framework to handle multidimensional arrays. We also show how to generate communication sets using the state machine approach. Performance results show that this solution requires very little run-time overhead and acceptable preprocessing time.

Chatterjee, Siddhartha; Gilbert, John R.; Long, Fred J. E.; Schreiber, Robert; Teng, Shang-Hua

1993-01-01

318

Rapid, Single-Molecule Assays in Nano/Micro-Fluidic Chips with Arrays of Closely Spaced Parallel Channels Fabricated by Femtosecond Laser Machining  

PubMed Central

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

Canfield, Brian K.; King, Jason K.; Robinson, William N.; Hofmeister, William H.; Davis, Lloyd M.

2014-01-01

319

Rapid, single-molecule assays in nano/micro-fluidic chips with arrays of closely spaced parallel channels fabricated by femtosecond laser machining.  

PubMed

Cost-effective pharmaceutical drug discovery depends on increasing assay throughput while reducing reagent needs. To this end, we are developing an ultrasensitive, fluorescence-based platform that incorporates a nano/micro-fluidic chip with an array of closely spaced channels for parallelized optical readout of single-molecule assays. Here we describe the use of direct femtosecond laser machining to fabricate several hundred closely spaced channels on the surfaces of fused silica substrates. The channels are sealed by bonding to a microscope cover slip spin-coated with a thin film of poly(dimethylsiloxane). Single-molecule detection experiments are conducted using a custom-built, wide-field microscope. The array of channels is epi-illuminated by a line-generating red diode laser, resulting in a line focus just a few microns thick across a 500 micron field of view. A dilute aqueous solution of fluorescently labeled biomolecules is loaded into the device and fluorescence is detected with an electron-multiplying CCD camera, allowing acquisition rates up to 7 kHz for each microchannel. Matched digital filtering based on experimental parameters is used to perform an initial, rapid assessment of detected fluorescence. More detailed analysis is obtained through fluorescence correlation spectroscopy. Simulated fluorescence data is shown to agree well with experimental values. PMID:25140634

Canfield, Brian K; King, Jason K; Robinson, William N; Hofmeister, William H; Davis, Lloyd M

2014-01-01

320

Simulation of an array-based neural net model  

NASA Technical Reports Server (NTRS)

Research in cognitive science suggests that much of cognition involves the rapid manipulation of complex data structures. However, it is very unclear how this could be realized in neural networks or connectionist systems. A core question is: how could the interconnectivity of items in an abstract-level data structure be neurally encoded? The answer appeals mainly to positional relationships between activity patterns within neural arrays, rather than directly to neural connections in the traditional way. The new method was initially devised to account for abstract symbolic data structures, but it also supports cognitively useful spatial analogue, image-like representations. As the neural model is based on massive, uniform, parallel computations over 2D arrays, the massively parallel processor is a convenient tool for simulation work, although there are complications in using the machine to the fullest advantage. An MPP Pascal simulation program for a small pilot version of the model is running.

Barnden, John A.

1987-01-01

321

Integrated fuel processor development.  

SciTech Connect

The Department of Energy's Office of Advanced Automotive Technologies has been supporting the development of fuel-flexible fuel processors at Argonne National Laboratory. These fuel processors will enable fuel cell vehicles to operate on fuels available through the existing infrastructure. The constraints of on-board space and weight require that these fuel processors be designed to be compact and lightweight, while meeting the performance targets for efficiency and gas quality needed for the fuel cell. This paper discusses the performance of a prototype fuel processor that has been designed and fabricated to operate with liquid fuels, such as gasoline, ethanol, methanol, etc. Rated for a capacity of 10 kWe (one-fifth of that needed for a car), the prototype fuel processor integrates the unit operations (vaporization, heat exchange, etc.) and processes (reforming, water-gas shift, preferential oxidation reactions, etc.) necessary to produce the hydrogen-rich gas (reformate) that will fuel the polymer electrolyte fuel cell stacks. The fuel processor work is being complemented by analytical and fundamental research. With the ultimate objective of meeting on-board fuel processor goals, these studies include: modeling fuel cell systems to identify design and operating features; evaluating alternative fuel processing options; and developing appropriate catalysts and materials. Issues and outstanding challenges that need to be overcome in order to develop practical, on-board devices are discussed.

Ahmed, S.; Pereira, C.; Lee, S. H. D.; Krumpelt, M.

2001-12-04

322

National Resource for Computation in Chemistry (NRCC). Attached scientific processors for chemical computations: a report to the chemistry community  

SciTech Connect

The demands of chemists for computational resources are well known and have been amply documented. The best and most cost-effective means of providing these resources is still open to discussion, however. This report surveys the field of attached scientific processors (array processors) and attempts to indicate their present and possible future use in computational chemistry. Array processors have the possibility of providing very cost-effective computation. This report attempts to provide information that will assist chemists who might be considering the use of an array processor for their computations. It describes the general ideas and concepts involved in using array processors, the commercial products that are available, and the experiences reported by those currently using them. In surveying the field of array processors, the author makes certain recommendations regarding their use in computational chemistry. 5 figures, 1 table (RWR)

Ostlund, N.S.

1980-01-01

323

Parallelization of CFD codes  

NASA Astrophysics Data System (ADS)

The use of parallelization is examined for conducting CFD representations such as 3D Navier-Stokes simulations of flows about aircraft for engineering purposes. References are made to fine-, medium-, and coarse-grain levels of parallelism, the use of artificial viscosity, and the use of explicit Runge-Kutta time integration. The inherent parallelism in CFD is examined with attention given to the use of patched multiblocks on shared-memory and local-memory MIMD machines. Medium-grain parallelism is effective for the shared-memory MIMDs when using a compiler directive that advances the equations in time after copying them onto several independent processors. Local-memory computers can be used to avoid the performance restrictions of memory access by using processors with built-in memories. The microblock concept is described, and some examples are given of decomposed domains including a computational result for a simulated Euler equations.

Bergman, C. M.; Vos, J. B.

1991-08-01

324

Stochastic propagation of an array of parallel cracks: Exploratory work on matrix fatigue damage in composite laminates  

SciTech Connect

Transverse cracking of polymeric matrix materials is an important fatigue damage mechanism in continuous-fiber composite laminates. The propagation of an array of these cracks is a stochastic problem usually treated by Monte Carlo methods. However, this exploratory work proposes an alternative approach wherein the Monte Carlo method is replaced by a more closed-form recursion relation based on fractional Brownian motion.'' A fractal scaling equation is also proposed as a substitute for the more empirical Paris equation describing individual crack growth in this approach. Preliminary calculations indicate that the new recursion relation is capable of reproducing the primary features of transverse matrix fatigue cracking behavior. Although not yet fully tested or verified, this cursion relation may eventually be useful for real-time applications such as monitoring damage in aircraft structures.

Williford, R.E.

1989-09-01

325

Portable, Flexible, and Scalable Soft Vector Processors  

Microsoft Academic Search

Field-programmable gate arrays (FPGAs) are increasingly used to implement embedded digital systems, however, the hardware design necessary to do so is time-consuming and tedious. The amount of hardware design can be reduced by employing a microprocessor for less-critical computation in the system. Often this microprocessor is implemented using the FPGA reprogrammable fabric as a soft processor which presently have simple

Peter Yiannacouras; J. Gregory Steffan; Jonathan Rose

2012-01-01

326

Fabrication and evaluation of a micro(bio)sensor array chip for multiple parallel measurements of important cell biomarkers.  

PubMed

This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

Pemberton, Roy M; Cox, Timothy; Tuffin, Rachel; Drago, Guido A; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C; Davies, Rhodri; Jackson, Simon K; Kenna, Gerry; Luxton, Richard; Hart, John P

2014-01-01

327

Fabrication and Evaluation of a Micro(Bio)Sensor Array Chip for Multiple Parallel Measurements of Important Cell Biomarkers  

PubMed Central

This report describes the design and development of an integrated electrochemical cell culture monitoring system, based on enzyme-biosensors and chemical sensors, for monitoring indicators of mammalian cell metabolic status. MEMS technology was used to fabricate a microwell-format silicon platform including a thermometer, onto which chemical sensors (pH, O2) and screen-printed biosensors (glucose, lactate), were grafted/deposited. Microwells were formed over the fabricated sensors to give 5-well sensor strips which were interfaced with a multipotentiostat via a bespoke connector box interface. The operation of each sensor/biosensor type was examined individually, and examples of operating devices in five microwells in parallel, in either potentiometric (pH sensing) or amperometric (glucose biosensing) mode are shown. The performance characteristics of the sensors/biosensors indicate that the system could readily be applied to cell culture/toxicity studies. PMID:25360580

Pemberton, Roy M.; Cox, Timothy; Tuffin, Rachel; Drago, Guido A.; Griffiths, John; Pittson, Robin; Johnson, Graham; Xu, Jinsheng; Sage, Ian C.; Davies, Rhodri; Jackson, Simon K.; Kenna, Gerry; Luxton, Richard; Hart, John P.

2014-01-01

328

Multiple Embedded Processors for Fault-Tolerant Computing  

NASA Technical Reports Server (NTRS)

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

329

Efficiency of parallel direct optimization  

NASA Technical Reports Server (NTRS)

Tremendous progress has been made at the level of sequential computation in phylogenetics. However, little attention has been paid to parallel computation. Parallel computing is particularly suited to phylogenetics because of the many ways large computational problems can be broken into parts that can be analyzed concurrently. In this paper, we investigate the scaling factors and efficiency of random addition and tree refinement strategies using the direct optimization software, POY, on a small (10 slave processors) and a large (256 slave processors) cluster of networked PCs running LINUX. These algorithms were tested on several data sets composed of DNA and morphology ranging from 40 to 500 taxa. Various algorithms in POY show fundamentally different properties within and between clusters. All algorithms are efficient on the small cluster for the 40-taxon data set. On the large cluster, multibuilding exhibits excellent parallel efficiency, whereas parallel building is inefficient. These results are independent of data set size. Branch swapping in parallel shows excellent speed-up for 16 slave processors on the large cluster. However, there is no appreciable speed-up for branch swapping with the further addition of slave processors (>16). This result is independent of data set size. Ratcheting in parallel is efficient with the addition of up to 32 processors in the large cluster. This result is independent of data set size. c2001 The Willi Hennig Society.

Janies, D. A.; Wheeler, W. C.

2001-01-01

330

Design and analysis of real-time wavefront processor  

Microsoft Academic Search

Latency of wavefront processor is an important factor of closed loop adaptive optical systems. For an adaptive optical system using Shack-Hartmann wave-front sensing and point beam, by ways of task queue, subtask arithmetic decomposition and subtask structure design, a multi-processors structure based on moder parallelism theory is built to realize a pipeline of wavefront gradient, wavefront reconstruction and wavefront control.

Luchun Zhou; Chunhong Wang; Mei Li; Wenhan Jiang

2004-01-01

331

Novel real-time infrared image processor with ADSP  

NASA Astrophysics Data System (ADS)

Digital Signal Processing (DSP) processors are microprocessors designed to perform digital signal processing-the mathematical manipulation of digitally represented signals. In this paper, the novel Infrared Image Processor (IIP) and the pattern process algorithm run in the processor are designed with DSP chips ADSP-TS201S manufactured by ADI (Analog Device Inc.). There are two signal channels within the IIP. The signals passed through the channels may be same or not. In this case, it is different. There are four pieces of DSPs which utilized and parallel procedure and serial procedure. The special character of the algorithm is the neural network method.

Ge, Cheng-liang; Fan, Guo-bin; Liu, Zhi-qiang; Li, Zheng-dong; Wu, Jian-tao; Huang, Zhi-wei; Liang, Zheng

2005-06-01

332

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

NASA Astrophysics Data System (ADS)

A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans.

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2014-01-01

333

Hybrid photomultiplier tube and photodiode parallel detection array for wideband optical spectroscopy of the breast guided by magnetic resonance imaging  

PubMed Central

Abstract. A new optical parallel detection system of hybrid frequency and continuous-wave domains was developed to improve the data quality and accuracy in recovery of all breast optical properties. This new system was deployed in a previously existing system for magnetic resonance imaging (MRI)-guided spectroscopy, and allows incorporation of additional near-infrared wavelengths beyond 850 nm, with interlaced channels of photomultiplier tubes (PMTs) and silicon photodiodes (PDs). The acquisition time for obtaining frequency-domain data at six wavelengths (660, 735, 785, 808, 826, and 849 nm) and continuous-wave data at three wavelengths (903, 912, and 948 nm) is 12 min. The dynamic ranges of the detected signal are 105 and 106 for PMT and PD detectors, respectively. Compared to the previous detection system, the SNR ratio of frequency-domain detection was improved by nearly 103 through the addition of an RF amplifier and the utilization of programmable gain. The current system is being utilized in a clinical trial imaging suspected breast cancer tumors as detected by contrast MRI scans. PMID:23979460

El-Ghussein, Fadi; Mastanduno, Michael A.; Jiang, Shudong; Pogue, Brian W.; Paulsen, Keith D.

2013-01-01

334

Parallel VLSI architecture emulation and the organization of APSA/MPP  

NASA Technical Reports Server (NTRS)

The Applicative Programming System Architecture (APSA) combines an applicative language interpreter with a novel parallel computer architecture that is well suited for Very Large Scale Integration (VLSI) implementation. The Massively Parallel Processor (MPP) can simulate VLSI circuits by allocating one processing element in its square array to an area on a square VLSI chip. As long as there are not too many long data paths, the MPP can simulate a VLSI clock cycle very rapidly. The APSA circuit contains a binary tree with a few long paths and many short ones. A skewed H-tree layout allows every processing element to simulate a leaf cell and up to four tree nodes, with no loss in parallelism. Emulation of a key APSA algorithm on the MPP resulted in performance 16,000 times faster than a Vax. This speed will make it possible for the APSA language interpreter to run fast enough to support research in parallel list processing algorithms.

Odonnell, John T.

1987-01-01

335

Configurable Multi-Purpose Processor  

NASA Technical Reports Server (NTRS)

Advancements in technology have allowed the miniaturization of systems used in aerospace vehicles. This technology is driven by the need for next-generation systems that provide reliable, responsive, and cost-effective range operations while providing increased capabilities such as simultaneous mission support, increased launch trajectories, improved launch, and landing opportunities, etc. Leveraging the newest technologies, the command and telemetry processor (CTP) concept provides for a compact, flexible, and integrated solution for flight command and telemetry systems and range systems. The CTP is a relatively small circuit board that serves as a processing platform for high dynamic, high vibration environments. The CTP can be reconfigured and reprogrammed, allowing it to be adapted for many different applications. The design is centered around a configurable field-programmable gate array (FPGA) device that contains numerous logic cells that can be used to implement traditional integrated circuits. The FPGA contains two PowerPC processors running the Vx-Works real-time operating system and are used to execute software programs specific to each application. The CTP was designed and developed specifically to provide telemetry functions; namely, the command processing, telemetry processing, and GPS metric tracking of a flight vehicle. However, it can be used as a general-purpose processor board to perform numerous functions implemented in either hardware or software using the FPGA s processors and/or logic cells. Functionally, the CTP was designed for range safety applications where it would ultimately become part of a vehicle s flight termination system. Consequently, the major functions of the CTP are to perform the forward link command processing, GPS metric tracking, return link telemetry data processing, error detection and correction, data encryption/ decryption, and initiate flight termination action commands. Also, the CTP had to be designed to survive and operate in a launch environment. Additionally, the CTP was designed to interface with the WFF (Wallops Flight Facility) custom-designed transceiver board which is used in the Low Cost TDRSS Transceiver (LCT2) also developed by WFF. The LCT2 s transceiver board demodulates commands received from the ground via the forward link and sends them to the CTP, where they are processed. The CTP inputs and processes data from the inertial measurement unit (IMU) and the GPS receiver board, generates status data, and then sends the data to the transceiver board where it is modulated and sent to the ground via the return link. Overall, the CTP has combined processing with the ability to interface to a GPS receiver, an IMU, and a pulse code modulation (PCM) communication link, while providing the capability to support common interfaces including Ethernet and serial interfaces boarding a relatively small-sized, lightweight package.

Valencia, J. Emilio; Forney, Chirstopher; Morrison, Robert; Birr, Richard

2010-01-01

336

Parallel Video Surveillance on the Multi-core Cell Broadband Engine  

Microsoft Academic Search

The IBM Cell Broadband Engine (BE) is a multi-core processor with a PowerPC host processor (PPE) and 8 synergic processor engines (SPEs). The Cell BE architecture is designed to improve upon conventional processors in terms of memory latency, bandwidth and power computation. In this paper, we discuss the parallelization, implementation and performance of a video surveillance application on the IBM

Tamer F. Rabie; Hashir Karim Kidwai; Fadi N. Sibai

2009-01-01

337

Incremental Parallelization of Non-Data-Parallel Programs Using the Charon Message-Passing Library  

NASA Technical Reports Server (NTRS)

Message passing is among the most popular techniques for parallelizing scientific programs on distributed-memory architectures. The reasons for its success are wide availability (MPI), efficiency, and full tuning control provided to the programmer. A major drawback, however, is that incremental parallelization, as offered by compiler directives, is not generally possible, because all data structures have to be changed throughout the program simultaneously. Charon remedies this situation through mappings between distributed and non-distributed data. It allows breaking up the parallelization into small steps, guaranteeing correctness at every stage. Several tools are available to help convert legacy codes into high-performance message-passing programs. They usually target data-parallel applications, whose loops carrying most of the work can be distributed among all processors without much dependency analysis. Others do a full dependency analysis and then convert the code virtually automatically. Even more toolkits are available that aid construction from scratch of message passing programs. None, however, allows piecemeal translation of codes with complex data dependencies (i.e. non-data-parallel programs) into message passing codes. The Charon library (available in both C and Fortran) provides incremental parallelization capabilities by linking legacy code arrays with distributed arrays. During the conversion process, non-distributed and distributed arrays exist side by side, and simple mapping functions allow the programmer to switch between the two in any location in the program. Charon also provides wrapper functions that leave the structure of the legacy code intact, but that allow execution on truly distributed data. Finally, the library provides a rich set of communication functions that support virtually all patterns of remote data demands in realistic structured grid scientific programs, including transposition, nearest-neighbor communication, pipelining, gather/scatter, and redistribution. At the end of the conversion process most intermediate Charon function calls will have been removed, the non-distributed arrays will have been deleted, and virtually the only remaining Charon functions calls are the high-level, highly optimized communications. Distribution of the data is under complete control of the programmer, although a wide range of useful distributions is easily available through predefined functions. A crucial aspect of the library is that it does not allocate space for distributed arrays, but accepts programmer-specified memory. This has two major consequences. First, codes parallelized using Charon do not suffer from encapsulation; user data is always directly accessible. This provides high efficiency, and also retains the possibility of using message passing directly for highly irregular communications. Second, non-distributed arrays can be interpreted as (trivial) distributions in the Charon sense, which allows them to be mapped to truly distributed arrays, and vice versa. This is the mechanism that enables incremental parallelization. In this paper we provide a brief introduction of the library and then focus on the actual steps in the parallelization process, using some representative examples from, among others, the NAS Parallel Benchmarks. We show how a complicated two-dimensional pipeline-the prototypical non-data-parallel algorithm- can be constructed with ease. To demonstrate the flexibility of the library, we give examples of the stepwise, efficient parallel implementation of nonlocal boundary conditions common in aircraft simulations, as well as the construction of the sequence of grids required for multigrid.

VanderWijngaart, Rob F.

2000-01-01

338

Extending Automatic Parallelization to Optimize High-Level Abstractions for Multicore  

SciTech Connect

Automatic introduction of OpenMP for sequential applications has attracted significant attention recently because of the proliferation of multicore processors and the simplicity of using OpenMP to express parallelism for shared-memory systems. However, most previous research has only focused on C and Fortran applications operating on primitive data types. C++ applications using high-level abstractions, such as STL containers and complex user-defined types, are largely ignored due to the lack of research compilers that are readily able to recognize high-level object-oriented abstractions and leverage their associated semantics. In this paper, we automatically parallelize C++ applications using ROSE, a multiple-language source-to-source compiler infrastructure which preserves the high-level abstractions and gives us access to their semantics. Several representative parallelization candidate kernels are used to explore semantic-aware parallelization strategies for high-level abstractions, combined with extended compiler analyses. Those kernels include an array-base computation loop, a loop with task-level parallelism, and a domain-specific tree traversal. Our work extends the applicability of automatic parallelization to modern applications using high-level abstractions and exposes more opportunities to take advantage of multicore processors.

Liao, C; Quinlan, D J; Willcock, J J; Panas, T

2008-12-12

339

QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays  

PubMed Central

Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as “smart” Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

2014-01-01

340

QSpike tools: a generic framework for parallel batch preprocessing of extracellular neuronal signals recorded by substrate microelectrode arrays.  

PubMed

Micro-Electrode Arrays (MEAs) have emerged as a mature technique to investigate brain (dys)functions in vivo and in in vitro animal models. Often referred to as "smart" Petri dishes, MEAs have demonstrated a great potential particularly for medium-throughput studies in vitro, both in academic and pharmaceutical industrial contexts. Enabling rapid comparison of ionic/pharmacological/genetic manipulations with control conditions, MEAs are employed to screen compounds by monitoring non-invasively the spontaneous and evoked neuronal electrical activity in longitudinal studies, with relatively inexpensive equipment. However, in order to acquire sufficient statistical significance, recordings last up to tens of minutes and generate large amount of raw data (e.g., 60 channels/MEA, 16 bits A/D conversion, 20 kHz sampling rate: approximately 8 GB/MEA,h uncompressed). Thus, when the experimental conditions to be tested are numerous, the availability of fast, standardized, and automated signal preprocessing becomes pivotal for any subsequent analysis and data archiving. To this aim, we developed an in-house cloud-computing system, named QSpike Tools, where CPU-intensive operations, required for preprocessing of each recorded channel (e.g., filtering, multi-unit activity detection, spike-sorting, etc.), are decomposed and batch-queued to a multi-core architecture or to a computers cluster. With the commercial availability of new and inexpensive high-density MEAs, we believe that disseminating QSpike Tools might facilitate its wide adoption and customization, and inspire the creation of community-supported cloud-computing facilities for MEAs users. PMID:24678297

Mahmud, Mufti; Pulizzi, Rocco; Vasilaki, Eleni; Giugliano, Michele

2014-01-01

341

Soft-core processor study for node-based architectures.  

SciTech Connect

Node-based architecture (NBA) designs for future satellite projects hold the promise of decreasing system development time and costs, size, weight, and power and positioning the laboratory to address other emerging mission opportunities quickly. Reconfigurable Field Programmable Gate Array (FPGA) based modules will comprise the core of several of the NBA nodes. Microprocessing capabilities will be necessary with varying degrees of mission-specific performance requirements on these nodes. To enable the flexibility of these reconfigurable nodes, it is advantageous to incorporate the microprocessor into the FPGA itself, either as a hardcore processor built into the FPGA or as a soft-core processor built out of FPGA elements. This document describes the evaluation of three reconfigurable FPGA based processors for use in future NBA systems--two soft cores (MicroBlaze and non-fault-tolerant LEON) and one hard core (PowerPC 405). Two standard performance benchmark applications were developed for each processor. The first, Dhrystone, is a fixed-point operation metric. The second, Whetstone, is a floating-point operation metric. Several trials were run at varying code locations, loop counts, processor speeds, and cache configurations. FPGA resource utilization was recorded for each configuration. Cache configurations impacted the results greatly; for optimal processor efficiency it is necessary to enable caches on the processors. Processor caches carry a penalty; cache error mitigation is necessary when operating in a radiation environment.

Van Houten, Jonathan Roger; Jarosz, Jason P.; Welch, Benjamin James; Gallegos, Daniel E.; Learn, Mark Walter

2008-09-01

342

Data Parallel SwitchLevel Simulation \\Lambda Randal E. Bryant  

E-print Network

Mellon University Abstract Data parallel simulation involves simulating the be­ havior of a circuit over runs on a a massively­ parallel SIMD machine, with each processor simulat­ ing the circuit behavior parallelism in simulation utilize circuit parallelism. In this mode, the simulator extracts parallelism from

Bryant, Randal E.

343

Algorithmic commonalities in the parallel environment  

NASA Technical Reports Server (NTRS)

The ultimate aim of this project was to analyze procedures from substantially different application areas to discover what is either common or peculiar in the process of conversion to the Massively Parallel Processor (MPP). Three areas were identified: molecular dynamic simulation, production systems (rule systems), and various graphics and vision algorithms. To date, only selected graphics procedures have been investigated. They are the most readily available, and produce the most visible results. These include simple polygon patch rendering, raycasting against a constructive solid geometric model, and stochastic or fractal based textured surface algorithms. Only the simplest of conversion strategies, mapping a major loop to the array, has been investigated so far. It is not entirely satisfactory.

Mcanulty, Michael A.; Wainer, Michael S.

1987-01-01

344

Is Monte Carlo embarrassingly parallel?  

SciTech Connect

Monte Carlo is often stated as being embarrassingly parallel. However, running a Monte Carlo calculation, especially a reactor criticality calculation, in parallel using tens of processors shows a serious limitation in speedup and the execution time may even increase beyond a certain number of processors. In this paper the main causes of the loss of efficiency when using many processors are analyzed using a simple Monte Carlo program for criticality. The basic mechanism for parallel execution is MPI. One of the bottlenecks turn out to be the rendez-vous points in the parallel calculation used for synchronization and exchange of data between processors. This happens at least at the end of each cycle for fission source generation in order to collect the full fission source distribution for the next cycle and to estimate the effective multiplication factor, which is not only part of the requested results, but also input to the next cycle for population control. Basic improvements to overcome this limitation are suggested and tested. Also other time losses in the parallel calculation are identified. Moreover, the threading mechanism, which allows the parallel execution of tasks based on shared memory using OpenMP, is analyzed in detail. Recommendations are given to get the maximum efficiency out of a parallel Monte Carlo calculation. (authors)

Hoogenboom, J. E. [Delft Univ. of Technology, Mekelweg 15, 2629 JB Delft (Netherlands); Delft Nuclear Consultancy, IJsselzoom 2, 2902 LB Capelle aan den IJssel (Netherlands)

2012-07-01

345

Wavelength-encoded OCDMA system using opto-VLSI processors.  

PubMed

We propose and experimentally demonstrate a 2.5 Gbits/sper user wavelength-encoded optical code-division multiple-access encoder-decoder structure based on opto-VLSI processing. Each encoder and decoder is constructed using a single 1D opto-very-large-scale-integrated (VLSI) processor in conjunction with a fiber Bragg grating (FBG) array of different Bragg wavelengths. The FBG array spectrally and temporally slices the broadband input pulse into several components and the opto-VLSI processor generates codewords using digital phase holograms. System performance is measured in terms of the autocorrelation and cross-correlation functions as well as the eye diagram. PMID:17603568

Aljada, Muhsen; Alameh, Kamal

2007-07-01

346

NWChem: scalable parallel computational chemistry  

SciTech Connect

NWChem is a general purpose computational chemistry code specifically designed to run on distributed memory parallel computers. The core functionality of the code focuses on molecular dynamics, Hartree-Fock and density functional theory methods for both plane-wave basis sets as well as Gaussian basis sets, tensor contraction engine based coupled cluster capabilities and combined quantum mechanics/molecular mechanics descriptions. It was realized from the beginning that scalable implementations of these methods required a programming paradigm inherently different from what message passing approaches could offer. In response a global address space library, the Global Array Toolkit, was developed. The programming model it offers is based on using predominantly one-sided communication. This model underpins most of the functionality in NWChem and the power of it is exemplified by the fact that the code scales to tens of thousands of processors. In this paper the core capabilities of NWChem are described as well as their implementation to achieve an efficient computational chemistry code with high parallel scalability. NWChem is a modern, open source, computational chemistry code1 specifically designed for large scale parallel applications2. To meet the challenges of developing efficient, scalable and portable programs of this nature a particular code design was adopted. This code design involved two main features. First of all, the code is build up in a modular fashion so that a large variety of functionality can be integrated easily. Secondly, to facilitate writing complex parallel algorithms the Global Array toolkit was developed. This toolkit allows one to write parallel applications in a shared memory like approach, but offers additional mechanisms to exploit data locality to lower communication overheads. This framework has proven to be very successful in computational chemistry but is applicable to any engineering domain. Within the context created by the features above NWChem has grown into a general purpose computational chemistry code that supports a wide variety of energy expressions and capabilities to calculate properties based there upon. The main energy expressions are classical mechanics force fields, Hartree-Fock and DFT both for finite systems and condensed phase systems, coupled cluster, as well as QM/MM. For most energy expressions single point calculations, geometry optimizations, excited states, and other properties are available. Below we briefly discuss each of the main energy expressions and the critical points involved in scalable implementations thereof.

van Dam, Hubertus JJ; De Jong, Wibe A.; Bylaska, Eric J.; Govind, Niranjan; Kowalski, Karol; Straatsma, TP; Valiev, Marat

2011-11-01

347

Highly parallel computer architecture for robotic computation  

NASA Technical Reports Server (NTRS)

In a computer having a large number of single instruction multiple data (SIMD) processors, each of the SIMD processors has two sets of three individual processor elements controlled by a master control unit and interconnected among a plurality of register file units where data is stored. The register files input and output data in synchronism with a minor cycle clock under control of two slave control units controlling the register file units connected to respective ones of the two sets of processor elements. Depending upon which ones of the register file units are enabled to store or transmit data during a particular minor clock cycle, the processor elements within an SIMD processor are connected in rings or in pipeline arrays, and may exchange data with the internal bus or with neighboring SIMD processors through interface units controlled by respective ones of the two slave control units.

Fijany, Amir (inventor); Bejczy, Anta K. (inventor)

1991-01-01

348

CFD on parallel computers  

NASA Astrophysics Data System (ADS)

CFD or Computational Fluid Dynamics is one of the scientific disciplines that has always posed new challenges to the capabilities of the modern, ultra-fast supercomputers, and now to the even faster parallel computers. For applications where number crunching is of primary importance, there is perhaps no escaping parallel computers since sequential computers can only be (as projected) as fast as a few gigaflops and no more, unless, of course, some altogether new technology appears in future. For parallel computers, on the other hand, there is no such limit since any number of processors can be made to work in parallel. Computationally demanding CFD codes and parallel computers are therefore soul-mates, and will remain so for all foreseeable future. So much so that there is a separate and fast-emerging discipline that tackles problems specific to CFD as applied to parallel computers. For some years now, there is an international conference on parallel CFD. So, one can indeed say that parallel CFD has arrived. To understand how CFD codes are parallelized, one must understand a little about how parallel computers function. Therefore, in what follows we will first deal with parallel computers, how a typical CFD code (if there is one such) looks like, and then the strategies of parallelization.

Basu, A. J.

1994-10-01

349

Problem size, parallel architecture and optimal speedup  

NASA Technical Reports Server (NTRS)

The communication and synchronization overhead inherent in parallel processing can lead to situations where adding processors to the solution method actually increases execution time. Problem type, problem size, and architecture type all affect the optimal number of processors to employ. The numerical solution of an elliptic partial differential equation is examined in order to study the relationship between problem size and architecture. The equation's domain is discretized into n sup 2 grid points which are divided into partitions and mapped onto the individual processor memories. The relationships between grid size, stencil type, partitioning strategy, processor execution time, and communication network type are analytically quantified. In so doing, the optimal number of processors was determined to assign to the solution, and identified (1) the smallest grid size which fully benefits from using all available processors, (2) the leverage on performance given by increasing processor speed or communication network speed, and (3) the suitability of various architectures for large numerical problems.

Nicol, David M.; Willard, Frank H.

1987-01-01

350

Kismet: parallel speedup estimates for serial programs  

Microsoft Academic Search

Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs

Donghwan Jeon; Saturnino Garcia; Chris Louie; Michael Bedford Taylor

2011-01-01

351

MAPS: multi-algorithm parallel circuit simulation  

Microsoft Academic Search

The emergence of multi-core and many-core processors has introduced new opportunities and challenges to EDA research and development. While the availability of increasing parallel computing power holds new promise to address many computing challenges in CAD, the leverage of hardware parallelism can only be possible with a new generation of parallel CAD applications. In this paper, we propose a novel

Xiaoji Ye; Wei Dong; Peng Li; Sani R. Nassif

2008-01-01

352

Parallel Modem Architectures for High-Data-Rate Space Modems  

NASA Astrophysics Data System (ADS)

Existing software-defined radios (SDRs) for space are limited in data volume by several factors, including bandwidth, space-qualified analog-to-digital converter (ADC) technology, and processor throughput, e.g., the throughput of a space-qualified field-programmable gate array (FPGA). In an attempt to further improve the throughput of space-based SDRs and to fully exploit the newer and more capable space-qualified technology (ADCs, FPGAs), we are evaluating parallel transmitter/receiver architectures for space SDRs. These architectures would improve data volume for both deep-space and particularly proximity (e.g., relay) links. In this article, designs for FPGA implementation of a high-rate parallel modem are presented as well as both fixed- and floating-point simulated performance results based on a functional design that is suitable for FPGA implementation.

Satorius, E.

2014-08-01

353

Optimizing Vector-Quantization Processor Architecture for Intelligent Query-Search Applications  

NASA Astrophysics Data System (ADS)

The architecture of a very large scale integration (VLSI) vector-quantization processor (VQP) has been optimized to develop a general-purpose intelligent query-search agent. The agent performs a similarity-based search in a large-volume database. Although similarity-based search processing is computationally very expensive, latency-free searches have become possible due to the highly parallel maximum-likelihood search architecture of the VQP chip. Three architectures of the VQP chip have been studied and their performances are compared. In order to give reasonable searching results according to the different policies, the concept of penalty function has been introduced into the VQP. An E-commerce real-estate agency system has been developed using the VQP chip implemented in a field-programmable gate array (FPGA) and the effectiveness of such an agency system has been demonstrated.

Xu, Huaiyu; Mita, Yoshio; Shibata, Tadashi

2002-04-01

354

Parallel symmetry-breaking in sparse graphs  

Microsoft Academic Search

We describe efficient deterministic techniques for breaking symmetry in parallel. The techniques work well on rooted trees and graphs of constant degree or genus. Our primary technique allows us to 3-color a rooted tree in &Ogr;(lg*n) time on an EREW PRAM using a linear number of processors. We apply these techniques to construct fast linear processor algorithms for several problems,

Andrew V. Goldberg; Serge A. Plotkint; Gregory E. Shannon

1987-01-01

355

Processor Allocation on Cplant: Achieving General Processor Locality Using One-Dimensional Allocation Strategies  

SciTech Connect

The Computational Plant, or Cplant is a commodity-based supercomputer under development at Sandia National Laboratories. This paper describes resource-allocation strategies to achieve processor locality for parallel jobs in Cplant and other supercomputers. Users of Cplant and other Sandia supercomputers submit parallel jobs to a job queue. When a job is scheduled to run, it is assigned to a set of processors. To obtain maximum throughput, jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This paper introduces new allocation strategies and performance metrics based on space-tilling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in the new release of the Cplant System Software, Version 2.0, phased into the Cplant systems at Sandia by May 2002.

LEUNG,VITUS J.; ARKIN,ESTHER M.; BENDER,MICHAEL A.; BUNDE,DAVID; JOHNSTON,JEANETTE R.; LAL,ALOK; MITCHELL,JOSEPH S.B.; PHILLIPS,CYNTHIA; SEIDEN,STEVEN S.

2002-07-01

356

Nearest-Neighbor Mapping of Finite Element Graphs onto Processor Meshes  

Microsoft Academic Search

The processor allocation problem is addressed in the context of the parallelization of a finite element modeling program on a processor mesh. A heuristic two-step, graph-based mapping scheme with polynomial-time complexity is developed: 1) initial generation of a graph partition for nearest-neighbor mapping of the finite element graph onto the processor graph, and, 2) a heuristic boundary refinement procedure to

Ponnuswamy Sadayappan; Fikret Erçal

1987-01-01

357

Energy estimation and optimization of embedded VLIW processors based on instruction clustering  

Microsoft Academic Search

Aim of this paper is to propose a methodology for the definition of an instruction-level energy estimation framework for VLIW (Very Long Instruction Word) processors. The power modeling methodology is the key issue to define an effective energy-aware software optimisation strategy for state-of-the-art ILP (Instruction Level Parallelism) processors. The methodology is based on an energy model for VLIW processors that

Andrea Bona; Mariagiovanna Sami; Donatella Sciuto; Vittorio Zaccaria; Cristina Silvano; Roberto Zafalon

2002-01-01

358

Simulation and Test of AN Optical Matrix-Vector Processor.  

NASA Astrophysics Data System (ADS)

This dissertation describes research in the computer simulation and the experimental laboratory evaluation of optical matrix-vector (linear algebra) processors. A single optical linear algebraic processing architecture is used for both the simulations and the laboratory implementation. The case study solved by the processor is a linear dynamic structural analysis finite element problem. The response of a plane frame structure under earthquake loading is investigated. The laboratory optical processor utilizes a new AC-coupled modulation technique which eliminates thermal problems discovered in previous laboratory work. The processor uses laser diodes, a multi-channel acousto-optic Bragg cell, and a multi-channel linear detector array and wide-band detector amplifiers. Simplified optical preocessor error source models are developed to simulate the optical processor. The error model simplifications ease the computational requirements and reduce the complexity of the simulations. The error source levels are determined for the laboratory optical processor, which is used to verifty the validity of the error source models. The case study is run on the laboratory optical processor and its operation is evaluated. We find that the AC-coupled modulation technique is extremely useful for eliminating detector thermal effects. The case study is solved successfully on the optical system. Laboratory optical processor experiments and measurements verify that the error source model simulator accurately predicts the performance of the laboratory processor. Laboratory and simulation results are analyzed, and various critical processor fabrication issues are then detailed. Extensions of the laboratory system to larger size are discussed, and comments on potential improvements with the latest technology are advance.

Taylor, Bradley Keith

1988-12-01

359

Graphite: A Distributed Parallel Simulator for Multicores  

E-print Network

This paper introduces the open-source Graphite distributed parallel multicore simulator infrastructure. Graphite is designed from the ground up for exploration of future multicore processors containing dozens, hundreds, ...

Beckmann, Nathan

2009-11-09

360

Interactive Digital Signal Processor  

NASA Technical Reports Server (NTRS)

Interactive Digital Signal Processor, IDSP, consists of set of time series analysis "operators" based on various algorithms commonly used for digital signal analysis. Processing of digital signal time series to extract information usually achieved by applications of number of fairly standard operations. IDSP excellent teaching tool for demonstrating application for time series operators to artificially generated signals.

Mish, W. H.

1985-01-01

361

The reconfigurable arithmetic processor  

Microsoft Academic Search

The Reconfigurable Arithmetic Processor (RAP) is an arithmetic processing node for a message-passing, MIMD concurrent computer. It incorporates on one chip several serial, 64 bit floating point arithmetic units connected by a switching network. By sequencing the switch through different patterns, the RAP chip calculates complete arithmetic formulas. By chaining together its arithmetic units the RAP reduces the amount of

Stuart Fiske; William J. Dally

1988-01-01

362

Quantum processors and controllers  

Microsoft Academic Search

In this paper is presented in abstract theory of quantum processors and controllers, special kind of quantum computational network defined on a composite quantum system with two parts: the controlling and controlled subsystems. Such approach formally differs from consideration of quantum control as some external influence on a system using some set of Hamiltonians or quantum gates. The model of

Alexander Yu. Vlasov

2003-01-01

363

Quantum Processors and Controllers  

Microsoft Academic Search

In this paper is presented an abstract theory of quan- tum processors and controllers, special kind of quan- tum computational network defined on a composite quantum system with two parts: the controlling and controlled subsystems. Such approach formally dif- fers from consideration of quantum control as some external influence on a system using some set of Hamiltonians or quantum gates.

Alexander Yu

364

NAS Parallel Benchmarks Results  

NASA Technical Reports Server (NTRS)

The NAS Parallel Benchmarks (NPB) were developed in 1991 at NASA Ames Research Center to study the performance of parallel supercomputers. The eight benchmark problems are specified in a pencil and paper fashion i.e. the complete details of the problem to be solved are given in a technical document, and except for a few restrictions, benchmarkers are free to select the language constructs and implementation techniques best suited for a particular system. In this paper, we present new NPB performance results for the following systems: (a) Parallel-Vector Processors: Cray C90, Cray T'90 and Fujitsu VPP500; (b) Highly Parallel Processors: Cray T3D, IBM SP2 and IBM SP-TN2 (Thin Nodes 2); (c) Symmetric Multiprocessing Processors: Convex Exemplar SPP1000, Cray J90, DEC Alpha Server 8400 5/300, and SGI Power Challenge XL. We also present sustained performance per dollar for Class B LU, SP and BT benchmarks. We also mention NAS future plans of NPB.

Subhash, Saini; Bailey, David H.; Lasinski, T. A. (Technical Monitor)

1995-01-01

365

Parallel logic simulation on general purpose machines  

Microsoft Academic Search

Three parallel algorithms for logic simulation have been developed and implemented on a general purpose shared-memory parallel machine. The first algorithm is a synchronous version of a traditional event-driven algorithm which achieves speed-ups of 6 to 9 with 15 processors. The second algorithm is a synchronous unit-delay compiled mode algorithm which achieves speed-ups of 10 to 13 with 15 processors.

Larry Soulé; Tom Blank

1988-01-01

366

First-level trigger processor for the ZEUS calorimeter  

SciTech Connect

The design of the first-level trigger processor for the Zeus calorimeter is discussed. This processor accepts data from the 13,000 photomultipliers of the calorimeter, which is topologically divided into 16 regions, and after regional preprocessing performs logical and numerical operations that cross regional boundaries. Because the crossing period at the HERA collider is 96 ns, it is necessary that first-level trigger decisions be made in pipelined hardware. One microsecond is allowed for the processor to perform the required logical and numerical operations, during which time the data from ten crossings would be resident in the processor while being clocked through the pipelined hardware. The circuitry is implemented in 100K emitter-coupled logic (ECL), advanced CMOS discrete devices and programmable gate arrays, and operates in a VME environment. All tables and registers are written/read from VME, and all diagnostic codes are executed from VME. Preprocessed data flows into the processor at a rate of 5.2 Gbyte/s, and processed data flows from the processor to the global first-level trigger at a rate of 70 Mbyte/s. The system allows for subsets of the logic to be configured by software and for various important variables to be histogrammed as they flow through the processor.

Dawson, J.W.; Talaga, R.L.; Burr, G.W.; Laird, R.J. (Argonne National Lab., Argonne, IL (US)); Smith, W.; Lackey, J. (Univ. of Wisconsin, Dept. of Physics, Madison, WI (US))

1990-12-01

367

Parallel Ab initio quantum chemistry on pentium-pro networks  

SciTech Connect

As the performance of inexpensive PCS approaches that of the fastest single processor supercomputers, high-end computing is increasingly dominated by massively parallel computers with hundreds or thousands of CPUs. Although such systems are essential for many applications, smaller parallel computers can achieve much lower price/performance ratios using commodity processors and interconnections. To investigate the feasibility of this approach for parallel quantum chemistry we have constructed a 56 processor parallel computer from fourteen 4-processor shared memory Pentium-Pro motherboards. These are interconnected by a 100 Mbit/sec. fast ethernet switch and each motherboard has 256 Mbytes of RAM and 1 Gbyte of disk. The system runs the LINUX operating system which supports symmetric multiprocessing on each four processor motherboard. Although some bottlenecks still exist in the inter-system communication, we have achieved very reasonable speedups running our massively parallel quantum chemistry program (MPQC).

Seidl, E.; Janssen, C.; Colvin, M. [Sandia National Lab., Albuquerque, NM (United States)

1997-12-31

368

Survey of new vector computers: The CRAY 1S from CRAY research; the CYBER 205 from CDC and the parallel computer from ICL - architecture and programming  

NASA Technical Reports Server (NTRS)

Problems which can arise with vector and parallel computers are discussed in a user oriented context. Emphasis is placed on the algorithms used and the programming techniques adopted. Three recently developed supercomputers are examined and typical application examples are given in CRAY FORTRAN, CYBER 205 FORTRAN and DAP (distributed array processor) FORTRAN. The systems performance is compared. The addition of parts of two N x N arrays is considered. The influence of the architecture on the algorithms and programming language is demonstrated. Numerical analysis of magnetohydrodynamic differential equations by an explicit difference method is illustrated, showing very good results for all three systems. The prognosis for supercomputer development is assessed.

Gentzsch, W.

1982-01-01

369

Efficacy of Code Optimization on Cache-based Processors  

NASA Technical Reports Server (NTRS)

The current common wisdom in the U.S. is that the powerful, cost-effective supercomputers of tomorrow will be based on commodity (RISC) micro-processors with cache memories. Already, most distributed systems in the world use such hardware as building blocks. This shift away from vector supercomputers and towards cache-based systems has brought about a change in programming paradigm, even when ignoring issues of parallelism. Vector machines require inner-loop independence and regular, non-pathological memory strides (usually this means: non-power-of-two strides) to allow efficient vectorization of array operations. Cache-based systems require spatial and temporal locality of data, so that data once read from main memory and stored in high-speed cache memory is used optimally before being written back to main memory. This means that the most cache-friendly array operations are those that feature zero or unit stride, so that each unit of data read from main memory (a cache line) contains information for the next iteration in the loop. Moreover, loops ought to be 'fat', meaning that as many operations as possible are performed on cache data-provided instruction caches do not overflow and enough registers are available. If unit stride is not possible, for example because of some data dependency, then care must be taken to avoid pathological strides, just ads on vector computers. For cache-based systems the issues are more complex, due to the effects of associativity and of non-unit block (cache line) size. But there is more to the story. Most modern micro-processors are superscalar, which means that they can issue several (arithmetic) instructions per clock cycle, provided that there are enough independent instructions in the loop body. This is another argument for providing fat loop bodies. With these restrictions, it appears fairly straightforward to produce code that will run efficiently on any cache-based system. It can be argued that although some of the important computational algorithms employed at NASA Ames require different programming styles on vector machines and cache-based machines, respectively, neither architecture class appeared to be favored by particular algorithms in principle. Practice tells us that the situation is more complicated. This report presents observations and some analysis of performance tuning for cache-based systems. We point out several counterintuitive results that serve as a cautionary reminder that memory accesses are not the only factors that determine performance, and that within the class of cache-based systems, significant differences exist.

VanderWijngaart, Rob F.; Chancellor, Marisa K. (Technical Monitor)

1997-01-01

370

Parallelization of a treecode  

E-print Network

I describe here the performance of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture using the MPI message-passing library. For a configuration with a constant number of particles per processor the scalability of the code was tested up to P=128 processors on an IBM SP4 machine. In the large $P$ limit the average CPU time per processor necessary for solving the gravitational interactions is $\\sim 10 %$ higher than that expected from the ideal scaling relation. The processor domains are determined every large timestep according to a recursive orthogonal bisection, using a weighting scheme which takes into account the total particle computational load within the timestep. The results of the numerical tests show that the load balancing efficiency $L$ of the code is high ($>=90%$) up to P=32, and decreases to $L\\sim 80%$ when P=128. In the latter case it is found that some aspects of the code performance are affected by machine hardware, while the proposed weighting scheme can achieve a load balance as high as $L\\sim 90%$ even in the large $P$ limit.

R. Valdarnini

2003-03-18

371

Multiple processor version of a Monte Carlo code for photon transport in turbid media  

NASA Astrophysics Data System (ADS)

Although Monte Carlo (MC) simulations represent an accurate and flexible tool to study the photon transport in strongly scattering media with complex geometrical topologies, they are very often infeasible because of their very high computation times. Parallel computing, in principle very suitable for MC approach because it consists in the repeated application of the same calculations to unrelated and superposing events, offers a possible approach to overcome this problem. It was developed an MC multiple processor code for optical and IR photon transport which was run on the parallel processor computer CRAY-T3E (128 DEC Alpha EV5 nodes, 600 Mflops) at CINECA (Bologna, Italy). The comparison between single processor and multiple processor runs for the same tissue models shows that the parallelization reduces the computation time by a factor of about N , where N is the number of used processors. This means a computation time reduction by a factor ranging from about 10 2 (as in our case where 128 processors are available) up to about 10 3 (with the most powerful parallel computers with 1024 processors). This reduction could make feasible MC simulations till now impracticable. The scaling of the execution time of the parallel code, as a function of the values of the main input parameters, is also evaluated.

Colasanti, Alberto; Guida, Giovanni; Kisslinger, Annamaria; Liuzzi, Raffaele; Quarto, Maria; Riccio, Patrizia; Roberti, Giuseppe; Villani, Fulvia

2000-10-01

372

Simulating synchronous processors. Technical report  

SciTech Connect

This paper shows how a distributed system with synchronous processors and asynchronous message delays can be simulated by a system with both asynchronous processors and asynchronous message delays in the presence of various types of processor faults. Consequently, the result of Fischer, Lynch and Paterson (1985), that no consensus protocol for asynchronous processors and communication can tolerate one failstop fault, implies a result of Dolev, Dwork, and Stockmeyer (1987), that no consensus protocol for synchronous processors and asynchronous communication can tolerate one failstop fault.

Welch, J.L.

1988-06-01

373

Parallel-computing structures for adaptive maximum-likelihood receivers  

SciTech Connect

Bandwidth-efficient digital data transmission over telephone and radio channels is significantly improved by the use of adaptive equalization. Among the numerous adaptive equalizer and receiver structures developed during the last two decades, adaptive maximum-likelihood receivers have emerged as front runners with respect to error-rate performance. However, the high degree of computational complexity of the optimum maximum-likelihood receivers has prohibited their use in many applications. This dissertation presents a study of parallel-computing structures that provide high computation throughput for implementation of adaptive maximum-likelihood receivers. Based on systolic array concepts, a two-dimensional array implementation of the Viterbi processor for adaptive maximum-likelihood receivers is presented. The array computes state transition metrics and survivor metric table addresses in a highly concurrent fashion. All interprocessor data flow and interconnections within the array are nearest-neighbor. A number of variations in the array design are described that enhance its versatility. A high-bandwidth memory interface for the survivor metric table memory is proposed.

Provence, J.D.

1987-01-01

374

A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC (Superconducting Super Collider) detectors  

SciTech Connect

A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab.

Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C. (Fermi National Accelerator Lab., Batavia, IL (USA)); Lockyer, N.; VanBerg, R. (Pennsylvania Univ., Philadelphia, PA (USA))

1989-12-01

375

A Move Processor for Bio-Inspired Systems  

Microsoft Academic Search

The structure and operation of multi-cellular organisms relies, among other things, on the specialization of the cells' physical structure to a finite set of specific operations. If we wish to make the analogy between a biological cell and a digital processor, we should note that nature's approach to parallel processing is subtly different from conventional von Neumann architectures or even

Gianluca Tempesti; Pierre-andré Mudry; Ralph Hoffmann

2005-01-01

376

Dynamically allocating processor resources between nearby and distant ILP  

Microsoft Academic Search

Modern superscalar processors use wide instruction is- sue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file , issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these

Rajeev Balasubramonian; Sandhya Dwarkadas; David H. Albonesi

2001-01-01

377

Guarded execution and branch prediction in dynamic ILP processors  

Microsoft Academic Search

We evaluate the effects of guarded (or conditional, or predicated) execution on the performance of an instruction level parallel processor employing dynamic branch prediction. First, we assess the utility of guarded execution, both qualitatively and quantitatively, using a variety of application programs. Our assessment shows that guarded execution significantly increases the opportunities, for both compiler and dynamic hardware, to extract

Dionisios N. Pnevmatikatos; Gurindar S. Sohi

1994-01-01

378

Guarded Execution and Branch Prediction in Dynamic ILP Processors  

Microsoft Academic Search

We evaluate the effects of guarded (or conditional, or predicated) execution on the performance of an instruc- tion level parallel processor employing dynamic branch prediction. First, we assess the utility of guarded execu- tion, both qualitatively and quantitatively, using a variety of application programs. Our assessment shows that guarded execution significantly increases the opportuni- ties, for both compiler and dynamic

Dionisios N. Pnevmatikatos; Gurindar S. Sohi

379

Fast Parallel Computation Of Manipulator Inverse Dynamics  

NASA Technical Reports Server (NTRS)

Method for fast parallel computation of inverse dynamics problem, essential for real-time dynamic control and simulation of robot manipulators, undergoing development. Enables exploitation of high degree of parallelism and, achievement of significant computational efficiency, while minimizing various communication and synchronization overheads as well as complexity of required computer architecture. Universal real-time robotic controller and simulator (URRCS) consists of internal host processor and several SIMD processors with ring topology. Architecture modular and expandable: more SIMD processors added to match size of problem. Operate asynchronously and in MIMD fashion.

Fijany, Amir; Bejczy, Antal K.

1991-01-01

380

A parallel algorithm for channel routing on a hypercube  

NASA Technical Reports Server (NTRS)

A new parallel simulated annealing algorithm for channel routing on a P processor hypercube is presented. The basic idea used is to partition a set of tracks equally among processors in the hypercube. In parallel, P/2 pairs of processors perform displacements and exchanges of nets between tracks, compute the changes in cost functions, and accept moves using a parallel annealing criteria. Through the use of a unique distributed data structure, it is possible to minimize message traffic and add versatility and efficiency in a parallel routing tool. The algorithm has been implemented and is being tested on some of the popular channel problems from the literature.

Brouwer, Randall; Banerjee, Prithviraj

1987-01-01

381

Implementing clips on a parallel computer  

NASA Technical Reports Server (NTRS)

The C language integrated production system (CLIPS) is a forward chaining rule based language to provide training and delivery for expert systems. Conceptually, rule based languages have great potential for benefiting from the inherent parallelism of the algorithms that they employ. During each cycle of execution, a knowledge base of information is compared against a set of rules to determine if any rules are applicable. Parallelism also can be employed for use with multiple cooperating expert systems. To investigate the potential benefits of using a parallel computer to speed up the comparison of facts to rules in expert systems, a parallel version of CLIPS was developed for the FLEX/32, a large grain parallel computer. The FLEX implementation takes a macroscopic approach in achieving parallelism by splitting whole sets of rules among several processors rather than by splitting the components of an individual rule among processors. The parallel CLIPS prototype demonstrates the potential advantages of integrating expert system tools with parallel computers.

Riley, Gary

1987-01-01

382

Parallel machine architecture for production rule systems  

DOEpatents

A parallel processing system for production rule programs utilizes a host processor for storing production rule right hand sides (RHS) and a plurality of rule processors for storing left hand sides (LHS). The rule processors operate in parallel in the recognize phase of the system recognize -Act Cycle to match their respective LHS's against a stored list of working memory elements (WME) in order to find a self consistent set of WME's. The list of WME is dynamically varied during the Act phase of the system in which the host executes or fires rule RHS's for those rules for which a self-consistent set has been found by the rule processors. The host transmits instructions for creating or deleting working memory elements as dictated by the rule firings until the rule processors are unable to find any further self-consistent working memory element sets at which time the production rule system is halted.

Allen, Jr., John D. (Knoxville, TN); Butler, Philip L. (Knoxville, TN)

1989-01-01

383

A Complexity Theory for Unbounded Fan-In Parallelism  

Microsoft Academic Search

A complexity theory for unbounded fan-in parallelism is developed where the complexity measure is the simultaneous measure (number of processors, parallel time). Two models of unbounded fan-in parallelism are (1) parallel random access machines that allow simultaneous reading from or writing to the same common memory location, and (2) circuits containing AND's, OR's and NOT's with no bound placed on

Ashok K. Chandra; Larry J. Stockmeyer; Uzi Vishkin

1982-01-01

384

Increasing the parallelism of filters through transformation to block state variable form  

Microsoft Academic Search

The block state variable form is investigated as a technique to increase the parallelism of a filter. This increase in parallelism allows more parallel processors to be usefully applied to the problem, resulting in a faster processing rate than is possible in the unblocked form. Upper and lower bounds on the sample period bound and the number of processors required

D. Schwartz; T. Barnwell

1984-01-01

385

Performance characterization of the NAS Parallel Benchmarks in OpenCL  

Microsoft Academic Search

Heterogeneous parallel computing platforms, which are composed of different processors (e.g., CPUs, GPUs, FPGAs, and DSPs), are widening their user base in all computing domains. With this trend, parallel programming models need to achieve portability across different processors as well as high performance with reasonable programming effort. OpenCL (Open Computing Language) is an open standard and emerging parallel programming model

Sangmin Seo; Gangwon Jo; Jaejin Lee

2011-01-01

386

Parallel processing data network of master and slave transputers controlled by a serial control network  

DOEpatents

The present device provides for a dynamically configurable communication network having a multi-processor parallel processing system having a serial communication network and a high speed parallel communication network. The serial communication network is used to disseminate commands from a master processor to a plurality of slave processors to effect communication protocol, to control transmission of high density data among nodes and to monitor each slave processor`s status. The high speed parallel processing network is used to effect the transmission of high density data among nodes in the parallel processing system. Each node comprises a transputer, a digital signal processor, a parallel transfer controller, and two three-port memory devices. A communication switch within each node connects it to a fast parallel hardware channel through which all high density data arrives or leaves the node. 6 figs.

Crosetto, D.B.

1996-12-31

387

Efficient Spare Allocation for Reconfigurable Arrays  

Microsoft Academic Search

Yield degradation from physical failures in large memories and processor arrays is of significant concern to semiconductor manufacturers. One method of increasing the yield for iterated arrays of memory cells or processing elements is to incorporate spare rows and columns in the die or wafer. These spare rows and columns can then be programmed into the array. The authors discuss

Sy-Yen Kuo; W. K. Fuchs

1987-01-01

388

Distributed processor allocation for launching applications in a massively connected processors complex  

DOEpatents

A compute processor allocator architecture for allocating compute processors to run applications in a multiple processor computing apparatus is distributed among a subset of processors within the computing apparatus. Each processor of the subset includes a compute processor allocator. The compute processor allocators can share a common database of information pertinent to compute processor allocation. A communication path permits retrieval of information from the database independently of the compute processor allocators.

Pedretti, Kevin (Goleta, CA)

2008-11-18

389

Scalable parallel communications  

NASA Technical Reports Server (NTRS)

Coarse-grain parallelism in networking (that is, the use of multiple protocol processors running replicated software sending over several physical channels) can be used to provide gigabit communications for a single application. Since parallel network performance is highly dependent on real issues such as hardware properties (e.g., memory speeds and cache hit rates), operating system overhead (e.g., interrupt handling), and protocol performance (e.g., effect of timeouts), we have performed detailed simulations studies of both a bus-based multiprocessor workstation node (based on the Sun Galaxy MP multiprocessor) and a distributed-memory parallel computer node (based on the Touchstone DELTA) to evaluate the behavior of coarse-grain parallelism. Our results indicate: (1) coarse-grain parallelism can deliver multiple 100 Mbps with currently available hardware platforms and existing networking protocols (such as Transmission Control Protocol/Internet Protocol (TCP/IP) and parallel Fiber Distributed Data Interface (FDDI) rings); (2) scale-up is near linear in n, the number of protocol processors, and channels (for small n and up to a few hundred Mbps); and (3) since these results are based on existing hardware without specialized devices (except perhaps for some simple modifications of the FDDI boards), this is a low cost solution to providing multiple 100 Mbps on current machines. In addition, from both the performance analysis and the properties of these architectures, we conclude: (1) multiple processors providing identical services and the use of space division multiplexing for the physical channels can provide better reliability than monolithic approaches (it also provides graceful degradation and low-cost load balancing); (2) coarse-grain parallelism supports running several transport protocols in parallel to provide different types of service (for example, one TCP handles small messages for many users, other TCP's running in parallel provide high bandwidth service to a single application); and (3) coarse grain parallelism will be able to incorporate many future improvements from related work (e.g., reduced data movement, fast TCP, fine-grain parallelism) also with near linear speed-ups.

Maly, K.; Khanna, S.; Overstreet, C. M.; Mukkamala, R.; Zubair, M.; Sekhar, Y. S.; Foudriat, E. C.

1992-01-01

390

Electrostatically focused addressable field emission array chips (AFEA's) for high-speed massively parallel maskless digital E-beam direct write lithography and scanning electron microscopy  

DOEpatents

Systems and methods are described for addressable field emission array (AFEA) chips. A method of operating an addressable field-emission array, includes: generating a plurality of electron beams from a pluralitly of emitters that compose the addressable field-emission array; and focusing at least one of the plurality of electron beams with an on-chip electrostatic focusing stack. The systems and methods provide advantages including the avoidance of space-charge blow-up.

Thomas, Clarence E. (Knoxville, TN); Baylor, Larry R. (Farragut, TN); Voelkl, Edgar (Oak Ridge, TN); Simpson, Michael L. (Knoxville, TN); Paulus, Michael J. (Knoxville, TN); Lowndes, Douglas H. (Knoxville, TN); Whealton, John H. (Oak Ridge, TN); Whitson, John C. (Clinton, TN); Wilgen, John B. (Oak Ridge, TN)

2002-12-24

391

http://www.cse.wustl.edu/~jain/cse567-06/ftp/processor_workloads/inde... 1 of 14 A Survey Paper on Processor Workloads  

E-print Network

, Fhourstone, SPEC CPU, SiSoft Sandra, LINPACK, LAPACK, Lawrence Livermore Loops, NAS Parallel Benchmarks, 3.3 Lawrence Livermore Loops 4.4 NAS Parallel Benchmarks #12;http://www.cse.wustl.edu/~jain/cse567-06/ftp to compare them. Benchmarks have been developed to measure performance of processors and rank them among

Jain, Raj

392

Coarray Fortran for parallel programming  

Microsoft Academic Search

Co-Array Fortran, formerly known as F--, is a small extension of Fortran 95 for parallel processing. A Co-Array Fortran program is interpreted as if it were replicated a number of times and all copies were executed asynchronously. Each copy has its own set of data objects and is termed an image. The array syntax of Fortran 95 is extended with

Robert W. Numrich; John Reid

1998-01-01

393

Japanese document recognition and retrieval system using programmable SIMD processor  

NASA Astrophysics Data System (ADS)

This paper describes a new efficient information-filing system for a large number of documents. The system is designed to recognize Japanese characters and make full-text searches across a document database. Key components of the system are a small fully-programmable parallel processor for both recognition and retrieval an image scanner for document input and a personal computer as the operator console. The processor is constructed by a bit-serial single instruction multiple data stream architecture (SIMD) and all components including the 256 processor elements and 11 MB of RAM are integrated on one board. The recognition process divides a document into text lines isolates each character extracts character pattern features and then identifies character categories. The entire process is performed by a single micro-program package down-loaded from the console. The recognition accuracy is more than 99. 0 for about 3 printed Japanese characters at a performance speed of more than 14 characters per second. The processor can also be made available for high speed information retrieval by changing the down-loaded microprogram package. The retrieval process can obtain sentences that include the same information as an inquiry text from the database previously created through character recognition. Retrieval performance is very fast with 20 million individual Japanese characters being examined each second when the database is stored in the processor''s IC memory. It was confirmed that a high performance but flexible and cost-effective document-information-processing system

Miyahara, Sueharu; Suzuki, Akira; Tada, Shunkichi; Kawatani, Takahiko

1991-02-01

394

CoNNeCT Baseband Processor Module  

NASA Technical Reports Server (NTRS)

A document describes the CoNNeCT Baseband Processor Module (BPM) based on an updated processor, memory technology, and field-programmable gate arrays (FPGAs). The BPM was developed from a requirement to provide sufficient computing power and memory storage to conduct experiments for a Software Defined Radio (SDR) to be implemented. The flight SDR uses the AT697 SPARC processor with on-chip data and instruction cache. The non-volatile memory has been increased from a 20-Mbit EEPROM (electrically erasable programmable read only memory) to a 4-Gbit Flash, managed by the RTAX2000 Housekeeper, allowing more programs and FPGA bit-files to be stored. The volatile memory has been increased from a 20-Mbit SRAM (static random access memory) to a 1.25-Gbit SDRAM (synchronous dynamic random access memory), providing additional memory space for more complex operating systems and programs to be executed on the SPARC. All memory is EDAC (error detection and correction) protected, while the SPARC processor implements fault protection via TMR (triple modular redundancy) architecture. Further capability over prior BPM designs includes the addition of a second FPGA to implement features beyond the resources of a single FPGA. Both FPGAs are implemented with Xilinx Virtex-II and are interconnected by a 96-bit bus to facilitate data exchange. Dedicated 1.25- Gbit SDRAMs are wired to each Xilinx FPGA to accommodate high rate data buffering for SDR applications as well as independent SpaceWire interfaces. The RTAX2000 manages scrub and configuration of each Xilinx.

Yamamoto, Clifford K; Jedrey, Thomas C.; Gutrich, Daniel G.; Goodpasture, Richard L.

2011-01-01

395

Silicon Auditory Processors Computer Peripherals  

E-print Network

Silicon Auditory Processors as Computer Peripherals John Lazzaro, John Wawrzynek CS Division UC describe an alternative output method for silicon auditory models, suitable for direct interface to digital

Lazzaro, John

396

Parallel supercomputing today and the cedar approach  

Microsoft Academic Search

More and more scientists and engineers are becoming interested in using supercomputers. Earlier barriers to using these machines are disappearing as software for their use improves. Meanwhile, new parallel supercomputer architectures are emerging that may provide rapid growth in performance. These systems may use a large number of processors with an intricate memory system that is both parallel and hierarchical;

D. J. Kuck; E. S. Davidson; D. H. Lawrie; A. H. Sameh

1986-01-01

397

Parallel boxing in B-spline intersection  

Microsoft Academic Search

A modified formulation of oriented boxing called oriented slab boxing is presented. It almost doubles the speed of the oriented boxing component in B-spline intersection. The method used to accelerate B-spline intersection includes algorithmic improvements and parallelization of the algorithm at different levels of granularity to find an optimum solution on a network of parallel processors. The software testbed is

J. Yen; S. Spach; M. T. Smith; R. W. Pulleyblank

1991-01-01

398

Parallel benchmarks on the Transtech Paramid  

Microsoft Academic Search

This paper presents the results of running the some benchmarks from the Genesis suite on the Transtech Paramid. The benchmarks use the PARMACS parallel processing standard, and are based on applications in the fields of general relativity, molecular dynamics and QCD. The Paramid is a distributed memory parallel computer, using up to 64 Intel i860-XP processors. The results demonstrate good

R. S. Stephens

1994-01-01

399

Parallel algorithms for finding trigonometric sums  

SciTech Connect

Parallel versions of Goertzel and Reinsch algorithms for finding trigonometric sums are introduced as a special case of effcient parallel algorithms for solving linear recurrence systems. The results of the experiments performed on a 20-processors Sequent Symmetry are presented and discussed.

Stpiczynski, P. [Marie Curie-Sklodowska Univ., Lublin (Poland); Paprzycki, M. [Univ. of Texas of the Permian Basin, Odessa, TX (United States)

1995-12-01

400

Reconfigurable data path processor  

NASA Technical Reports Server (NTRS)

A reconfigurable data path processor comprises a plurality of independent processing elements. Each of the processing elements advantageously comprising an identical architecture. Each processing element comprises a plurality of data processing means for generating a potential output. Each processor is also capable of through-putting an input as a potential output with little or no processing. Each processing element comprises a conditional multiplexer having a first conditional multiplexer input, a second conditional multiplexer input and a conditional multiplexer output. A first potential output value is transmitted to the first conditional multiplexer input, and a second potential output value is transmitted to the second conditional multiplexer output. The conditional multiplexer couples either the first conditional multiplexer input or the second conditional multiplexer input to the conditional multiplexer output, according to an output control command. The output control command is generated by processing a set of arithmetic status-bits through a logical mask. The conditional multiplexer output is coupled to a first processing element output. A first set of arithmetic bits are generated according to the processing of the first processable value. A second set of arithmetic bits may be generated from a second processing operation. The selection of the arithmetic status-bits is performed by an arithmetic-status bit multiplexer selects the desired set of arithmetic status bits from among the first and second set of arithmetic status bits. The conditional multiplexer evaluates the select arithmetic status bits according to logical mask defining an algorithm for evaluating the arithmetic status bits.

Donohoe, Gregory (Inventor)

2005-01-01

401

Radiofrequency detector coil performance maps for parallel MRI applications  

E-print Network

Parallel MRI techniques allow acceleration of MR imaging beyond traditional speed limits. In parallel MRI, arrays of radiofrequency (RF) detector coil arrays are used to perform some degree of spatial encoding which ...

Lattanzi, Riccardo

2006-01-01

402

SIMD-parallel understanding of natural language with application to magnitude-only optical parsing of text  

NASA Astrophysics Data System (ADS)

A novel parallel model of natural language (NL) understanding is presented which can realize high levels of semantic abstraction, and is designed for implementation on synchronous SIMD architectures and optical processors. Theory is expressed in terms of the Image Algebra (IA), a rigorous, concise, inherently parallel notation which unifies the design, analysis, and implementation of image processing algorithms. The IA has been implemented on numerous parallel architectures, and IA preprocessors and interpreters are available for the FORTRAN and Ada languages. In a previous study, we demonstrated the utility of IA for mapping MEA- conformable (Multiple Execution Array) algorithms to optical architectures. In this study, we extend our previous theory to map serial parsing algorithms to the synchronous SIMD paradigm. We initially derive a two-dimensional image that is based upon the adjacency matrix of a semantic graph. Via IA template mappings, the operations of bottom-up parsing, semantic disambiguation, and referential resolution are implemented as image-processing operations upon the adjacency matrix. Pixel-level operations are constrained to Hadamard addition and multiplication, thresholding, and row/column summation, which are available in magnitude-only optics. Assuming high parallelism in the parse rule base, the parsing of n input symbols with a grammar consisting of M rules of arity H, on an N-processor architecture, could exhibit time complexity of T(n) parallelism, the computational cost is constant and of order H. Since H < < n is typical, we claim a fundamental complexity advantage over the current O(n) theoretical time limit of MIMD parsing architectures. Additionally, we show that inference over a semantic net is achievable is parallel in O(m) time, where m corresponds to the depth of the search tree. Results are evaluated in terms of computational cost on SISD and SIMD processors, with discussion of implementation on electro-optic architectures.

Schmalz, Mark S.

1992-08-01

403

A Cooperative Management Scheme for Power Efficient Implementations of Real-Time Operating Systems on Soft Processors  

Microsoft Academic Search

A cooperative management scheme for power efficient implementations of real-time operating systems on field-programmable gate-array (FPGA)-based soft processors is presented. Dedicated power management hardware peripherals are tightly coupled to a soft processor by utilizing its configurability. These hardware peripherals manage tasks and interrupts in cooperation with the soft processor, while retaining the real-time responsiveness of the operating system. More specifically,

Jingzhao Ou; Viktor K. Prasanna

2008-01-01

404

A Phase Preserving Sar Processor  

Microsoft Academic Search

Synthetic aperture radar (SAR) image phase information IS necessary to support many advanced SAR applications. The phase information in the complex image for conventional range­ Doppler processors is not a robust estimate of scene phase . A SAR processor specifically designed to preserve phase informa­ tion is being developed at the Canada Centre for Remote Sens­ ing (CCRS). In addition

R. Keith Raney; Paris W. Vachon

1989-01-01

405

Microprogramming enhances signal processor's performance  

SciTech Connect

The authors describe the use of a modular software-programmable processor for signal processing above 100 khz. The emitter-coupled logic enables 2*10/sup 7/ complex multiplications to be performed each second, while the microprogrammable processor gives designers the freedom to make changes in control or processing flow.

Chin, S.H.; Brooks, C.W.

1982-11-17

406

Microcode development for microprogrammed processors  

Microsoft Academic Search

The aim of this paper is to develop a top-down design automation tool for digital system design such as microprogrammed processors. The package contains a hardware description language to specify the design, a microcode development module to generate an efficient microprogam for the microprogrammed processor's control, and a functional simulator module to verify the validity of the design. The goal

J. P-C Hwang; C. A. Papachristou; D. D. Cornett

1985-01-01

407

Micromechanical Signal Processors  

NASA Astrophysics Data System (ADS)

Completely monolithic high-Q micromechanical signal processors constructed of polycrystalline silicon and integrated with CMOS electronics are described. The signal processors implemented include an oscillator, a bandpass filter, and a mixer + filter--all of which are components commonly required for up- and down-conversion in communication transmitters and receivers, and all of which take full advantage of the high Q of micromechanical resonators. Each signal processor is designed, fabricated, then studied with particular attention to the performance consequences associated with miniaturization of the high-Q element. The fabrication technology which realizes these components merges planar integrated circuit CMOS technologies with those of polysilicon surface micromachining. The technologies are merged in a modular fashion, where the CMOS is processed in the first module, the microstructures in a following separate module, and at no point in the process sequence are steps from each module intermixed. Although the advantages of such modularity include flexibility in accommodating new module technologies, the developed process constrained the CMOS metallization to a high temperature refractory metal (tungsten metallization with TiSi _2 contact barriers) and constrained the micromachining process to long-term temperatures below 835^circC. Rapid-thermal annealing (RTA) was used to relieve residual stress in the mechanical structures. To reduce the complexity involved with developing this merged process, capacitively transduced resonators are utilized. High-Q single resonator and spring-coupled micromechanical resonator filters are also investigated, with particular attention to noise performance, bandwidth control, and termination design. The noise in micromechanical filters is found to be fairly high due to poor electromechanical coupling on the micro-scale with present-day technologies. Solutions to this high series resistance problem are suggested, including smaller electrode-to-resonator gaps to increase the coupling capacitance. Active Q-control techniques are demonstrated which control the bandwidth of micromechanical filters and simulate filter terminations with little passband distortion. Noise analysis shows that these active techniques are relatively quiet when compared with other resistive techniques. Modulation techniques are investigated whereby a single resonator or a filter constructed from several such resonators can provide both a mixing and a filtering function, or a filtering and amplitude modulation function. These techniques center around the placement of a carrier signal on the micromechanical resonator. Finally, micro oven stabilization is investigated in an attempt to null the temperature coefficient of a polysilicon micromechanical resonator. Here, surface micromachining procedures are utilized to fabricate a polysilicon resonator on a microplatform--two levels of suspension--equipped with heater and temperature sensing resistors, which are then imbedded in a feedback loop to control the platform (and resonator) temperature. (Abstract shortened by UMI.).

Nguyen, Clark Tu-Cuong

408

Acousto-optic/CCD real-time SAR data processor  

NASA Technical Reports Server (NTRS)

The SAR processor which uses an acousto-optic device as the input electronic-to-optical transducer and a 2-D CCD image sensor, which is operated in the time-delay-and-integrate (TDI) mode is presented. The CCD serves as the optical detector, and it simultaneously operates as an array of optically addressed correlators. The lines of the focused SAR image form continuously (at the radar PRF) at the final row of the CCD. The principles of operation of this processor, its performance characteristics, the state-of-the-art of the devices used and experimental results are outlined. The methods by which this processor can be made flexible so that it can be dynamically adapted to changing SAR geometries is discussed.

Psaltis, D.

1983-01-01

409

Performance scalability and dynamic behavior of Parsec benchmarks on many-core processors  

E-print Network

benchmark suite is widely used in evaluation of parallel architectures, both existing and novel, the latter [1] is widely used in parallel architectures research. While the benchmarks in this suitePerformance scalability and dynamic behavior of Parsec benchmarks on many-core processors Oved

Keidar, Idit

410

A generic fine-grained parallel C  

NASA Technical Reports Server (NTRS)

With the present availability of parallel processors of vastly different architectures, there is a need for a common language interface to multiple types of machines. The parallel C compiler, currently under development, is intended to be such a language. This language is based on the belief that an algorithm designed around fine-grained parallelism can be mapped relatively easily to different parallel architectures, since a large percentage of the parallelism has been identified. The compiler generates a FORTH-like machine-independent intermediate code. A machine-dependent translator will reside on each machine to generate the appropriate executable code, taking advantage of the particular architectures. The goal of this project is to allow a user to run the same program on such machines as the Massively Parallel Processor, the CRAY, the Connection Machine, and the CYBER 205 as well as serial machines such as VAXes, Macintoshes and Sun workstations.

Hamet, L.; Dorband, John E.

1988-01-01

411

Response matrix transport calculations on parallel computers  

SciTech Connect

The response matrix method offers an excellent vehicle for adapting three-dimensional neutron transport methods to parallel computers. Our current thrust is in utilizing the three-dimensional Variational nodal code VARIANT as a point of departure for performing three- dimensional parallel computations on the IBM SPx at Argonne National Laboratory. The code employs a planar red-black iteration with a secondary red-black or four-color iteration within each plane. Speed- up and efficiency results have been obtained with a two-stage parallel implementation. First, the response matrix coefficients are calculated in parallel for each unique node type. Second, parallel iterations are performed with one red-black pair of planes assigned to each processor. A hierarchical structure may be employed to obtain finer parallel granularity by assigning multiple processors to the planer red-black or four-color iterations.

Hanebutte, U.R.; Palmiotti, G.; Khalil, H.S. [Argonne National Lab., IL (United States). Reactor Analysis Div.; Tatsumi, M. [Nuclear Fuel Industries, Ltd., Osaka (Japan). Nuclear Engineering Div.; Lewis, E.E. [Northwestern Univ., Evanston, IL (United States). Dept. of Mechanical Engineering

1996-12-31

412

Massively Parallel Finite Element Programming  

Microsoft Academic Search

\\u000a Today’s large finite element simulations require parallel algorithms to scale on clusters with thousands or tens of thousands\\u000a of processor cores. We present data structures and algorithms to take advantage of the power of high performance computers\\u000a in generic finite element codes.\\u000a \\u000a \\u000a Existing generic finite element libraries often restrict the parallelization to parallel linear algebra routines. This is\\u000a a limiting

Timo Heister; Martin Kronbichler; Wolfgang Bangerth

2010-01-01

413

Performance characteristics of a parallel treecode  

E-print Network

I describe here the performances of a parallel treecode with individual particle timesteps. The code is based on the Barnes-Hut algorithm and runs cosmological N-body simulations on parallel machines with a distributed memory architecture using the MPI message passing library. For a configuration with a constant number of particles per processor the scalability of the code has been tested up to P=32 processors. The average CPU time per processor necessary for solving the gravitational interactions is within $\\sim 10 %$ of that expected from the ideal scaling relation. The load balancing efficiency is high ($\\simgt90%$) if the processor domains are determined every large timestep according to a weighting scheme which takes into account the total particle computational load within the timestep.

R. Valdarnini

2002-12-11

414

Scalable load balancing for massively parallel distributed Monte Carlo particle transport  

SciTech Connect

In order to run computer simulations efficiently on massively parallel computers with hundreds of thousands or millions of processors, care must be taken that the calculation is load balanced across the processors. Examining the workload of every processor leads to an unscalable algorithm, with run time at least as large as O(N), where N is the number of processors. We present a scalable load balancing algorithm, with run time 0(log(N)), that involves iterated processor-pair-wise balancing steps, ultimately leading to a globally balanced workload. We demonstrate scalability of the algorithm up to 2 million processors on the Sequoia supercomputer at Lawrence Livermore National Laboratory. (authors)

O'Brien, M. J.; Brantley, P. S. [Lawrence Livermore National Laboratory, 7000 East Avenue, Livermore, CA 94550 (United States); Joy, K. I. [Institute for Data Analysis and Visualization, Computer Science Department, University of California, One Shields Avenue, Davis, CA 95616 (United States)

2013-07-01

415

Understanding Language Support for Irregular Parallelism  

Microsoft Academic Search

. While software support for array-based, data-parallel algorithmshas been studied extensively, less attention has been devoted toirregular parallel applications. The majority of these applications areunstructured, that is, they possess asynchronous components that donot fit the data-parallel model. Examples of unstructured applicationsinclude sparse matrix and n-body problems. Previous research, such asParti[11] and CHAOS[13], has concentrated on extending the array-baseddata-parallel...

Mukund Raghavachari; Anne Rogers

1995-01-01

416

Computation of Watersheds Based on Parallel Graph Algorithms  

Microsoft Academic Search

In this paper the implementation of a parallel watershed algorithm is described. The algorithm has been implemented on a Cray J932, which is a shared memory architecture with 32 processors. The watershed transform has generally been considered to be inherently sequential, but recently a few research groups, have designed parallel algorithms for computing watersheds. Most of these parallel algorithms are

A. Meijster; J. B. T. M. Roerdink

1996-01-01

417

Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput  

E-print Network

Online Scheduling of Parallel Jobs on Hypercubes: Maximizing the Throughput Ondrej Zaj´icek1 , Jir of scheduling unit-time parallel jobs on hypercubes. A parallel job has to be scheduled between its release time and deadline on a subcube of processors. The objective is to max- imize the number of early jobs. We provide

Sgall, Jiri

418

Architectural Requirements and Scalability of the NAS Parallel Benchmarks  

E-print Network

Architectural Requirements and Scalability of the NAS Parallel Benchmarks Frederick C. Wong and scalability of the NAS Parallel Benchmarks. We find that the local processor and memory systems dominate both the overall performance and scalability of the NAS parallel benchmarks, as opposed to the communication

Culler, David E.

419

Space Division Multiple Access in MIMO Systems with Parallel Data Transmission  

NASA Astrophysics Data System (ADS)

We study multiple-input multiple-output (MIMO) cellular communication systems with antenna arrays at both link ends and parallel channels for data transmission. These channels (the so-called eigenchannels) are formed with the help of adaptive transmitting and receiving beamformer processors matched with a random fading environment. To increase the capacity of MIMO systems, we propose a space-division multiple-access (SDMA) method, which does not require estimation of signal-arrival directions and is based on orthogonalization of the parallel channels of all users. We find the signal-to-noise ratios at the eigenchannel outputs and the total capacity of a MIMO system in the case of simultaneous servicing of an arbitrary number of users. We present numerical results for the case of Rayleigh fading of signals, which confirm the high effectiveness of the proposed SDMA method.

Flaksman, A. G.

2002-11-01

420

Low-power, parallel photonic interconnections for Multi-Chip Module applications  

SciTech Connect

New applications of photonic interconnects will involve the insertion of parallel-channel links into Multi-Chip Modules (MCMs). Such applications will drive photonic link components into more compact forms that consume far less power than traditional telecommunication data links. MCM-based applications will also require simplified drive circuitry, lower cost, and higher reliability than has been demonstrated currently in photonic and optoelectronic technologies. The work described is a parallel link array, designed for vertical (Z-Axis) interconnection of the layers in a MCM-based signal processor stack, operating at a data rate of 100 Mb/s. This interconnect is based upon high-efficiency VCSELs, HBT photoreceivers, integrated micro-optics, and MCM-compatible packaging techniques.

Carson, R.F.; Lovejoy, M.L.; Lear, K.L.

1994-12-31

421

Parallel Cfd Benchmarks on Cray Computers  

Microsoft Academic Search

In this paper we present benchmark results from the parallel implementation of the three-dimensional Navier-Stokes solver Prism [1] on the Cray T3D, We compare the single processor performance with other Cray computers, namely the Cray C90, J90 and EL98, as well as Digital Equipment Corporation's DEC 3000\\/500 (which uses the same processor as the T3D) and AlphaServer 8400 5\\/300 (which

Constantinos Evangelinos; GEORGE EM KARNIADAKIS

1996-01-01

422

PVM Enhancement for Beowulf Multiple-Processor Nodes  

NASA Technical Reports Server (NTRS)

A recent version of the Parallel Virtual Machine (PVM) computer program has been enhanced to enable use of multiple processors in a single node of a Beowulf system (a cluster of personal computers that runs the Linux operating system). A previous version of PVM had been enhanced by addition of a software port, denoted BEOLIN, that enables the incorporation of a Beowulf system into a larger parallel processing system administered by PVM, as though the Beowulf system were a single computer in the larger system. BEOLIN spawns tasks on (that is, automatically assigns tasks to) individual nodes within the cluster. However, BEOLIN does not enable the use of multiple processors in a single node. The present enhancement adds support for a parameter in the PVM command line that enables the user to specify which Internet Protocol host address the code should use in communicating with other Beowulf nodes. This enhancement also provides for the case in which each node in a Beowulf system contains multiple processors. In this case, by making multiple references to a single node, the user can cause the software to spawn multiple tasks on the multiple processors in that node.

Springer, Paul

2006-01-01

423

ELIPS: Toward a Sensor Fusion Processor on a Chip  

NASA Technical Reports Server (NTRS)

The paper presents the concept and initial tests from the hardware implementation of a low-power, high-speed reconfigurable sensor fusion processor. The Extended Logic Intelligent Processing System (ELIPS) processor is developed to seamlessly combine rule-based systems, fuzzy logic, and neural networks to achieve parallel fusion of sensor in compact low power VLSI. The first demonstration of the ELIPS concept targets interceptor functionality; other applications, mainly in robotics and autonomous systems are considered for the future. The main assumption behind ELIPS is that fuzzy, rule-based and neural forms of computation can serve as the main primitives of an "intelligent" processor. Thus, in the same way classic processors are designed to optimize the hardware implementation of a set of fundamental operations, ELIPS is developed as an efficient implementation of computational intelligence primitives, and relies on a set of fuzzy set, fuzzy inference and neural modules, built in programmable analog hardware. The hardware programmability allows the processor to reconfigure into different machines, taking the most efficient hardware implementation during each phase of information processing. Following software demonstrations on several interceptor data, three important ELIPS building blocks (a fuzzy set preprocessor, a rule-based fuzzy system and a neural network) have been fabricated in analog VLSI hardware and demonstrated microsecond-processing times.

Daud, Taher; Stoica, Adrian; Tyson, Thomas; Li, Wei-te; Fabunmi, James

1998-01-01

424

Parallelized direct execution simulation of message-passing parallel programs  

NASA Technical Reports Server (NTRS)

As massively parallel computers proliferate, there is growing interest in findings ways by which performance of massively parallel codes can be efficiently predicted. This problem arises in diverse contexts such as parallelizing computers, parallel performance monitoring, and parallel algorithm development. In this paper we describe one solution where one directly executes the application code, but uses a discrete-event simulator to model details of the presumed parallel machine such as operating system and communication network behavior. Because this approach is computationally expensive, we are interested in its own parallelization specifically the parallelization of the discrete-event simulator. We describe methods suitable for parallelized direct execution simulation of message-passing parallel programs, and report on the performance of such a system, Large Application Parallel Simulation Environment (LAPSE), we have built on the Intel Paragon. On all codes measured to date, LAPSE predicts performance well typically within 10 percent relative error. Depending on the nature of the application code, we have observed low slowdowns (relative to natively executing code) and high relative speedups using up to 64 processors.

Dickens, Phillip M.; Heidelberger, Philip; Nicol, David M.

1994-01-01

425

Asynchronous parallel status comparator  

DOEpatents

Disclosed is an apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition. 4 figs.

Arnold, J.W.; Hart, M.M.

1992-12-15

426

Asynchronous parallel status comparator  

DOEpatents

Apparatus for matching asynchronously received signals and determining whether two or more out of a total number of possible signals match. The apparatus comprises, in one embodiment, an array of sensors positioned in discrete locations and in communication with one or more processors. The processors will receive signals if the sensors detect a change in the variable sensed from a nominal to a special condition and will transmit location information in the form of a digital data set to two or more receivers. The receivers collect, read, latch and acknowledge the data sets and forward them to decoders that produce an output signal for each data set received. The receivers also periodically reset the system following each scan of the sensor array. A comparator then determines if any two or more, as specified by the user, of the output signals corresponds to the same location. A sufficient number of matches produces a system output signal that activates a system to restore the array to its nominal condition.

Arnold, Jeffrey W. (828 Hickory Ridge Rd., Aiken, SC 29801); Hart, Mark M. (223 Limerick Dr., Aiken, SC 29803)

1992-01-01

427

A large scale, homogeneous, fully distributed parallel machine, I  

Microsoft Academic Search

The preliminary hardware description of CHOPP (Columbia Homogeneous Parallel Processor), a MIMD machine supporting a fully distributed host-less operating system is presented. The architecture is intended to permit implementation of machines with 105 to 106 processors. Issues of interconnection networks, throughput, and memory structure are treated.

Herbert Sullivan; Theodore R. Bashkow; David Klappholz

1977-01-01

428

Compact hohlraum configuration with parallel planar-wire-array x-ray sources at the 1.7-MA Zebra generator  

NASA Astrophysics Data System (ADS)

A compact Z-pinch x-ray hohlraum design with parallel-driven x-ray sources is experimentally demonstrated in a configuration with a central target and tailored shine shields at a 1.7-MA Zebra generator. Driving in parallel two magnetically decoupled compact double-planar-wire Z pinches has demonstrated the generation of synchronized x-ray bursts that correlated well in time with x-ray emission from a central reemission target. Good agreement between simulated and measured hohlraum radiation temperature of the central target is shown. The advantages of compact hohlraum design applications for multi-MA facilities are discussed.

Kantsyrev, V. L.; Chuvatin, A. S.; Rudakov, L. I.; Velikovich, A. L.; Shrestha, I. K.; Esaulov, A. A.; Safronova, A. S.; Shlyaptseva, V. V.; Osborne, G. C.; Astanovitsky, A. L.; Weller, M. E.; Stafford, A.; Schultz, K. A.; Cooper, M. C.; Cuneo, M. E.; Jones, B.; Vesey, R. A.

2014-12-01

429

Practical Simulation of Large-Scale Parallel Programs and Its Performance Analysis of the NAS Parallel Benchmarks  

Microsoft Academic Search

. A simulation technique for very large-scale data parallel programsis proposed. In our simulation method, a data parallel program isdivided into computation and communication sections. When the controlflow of the parallel program does not depend on the contents of networkmessages, the computation time on each processor is calculated independently.An instrumentation tool called EXCIT is used to calculate theexecution time on

Kazuto Kubota; Ken’ichi Itakura; Mitsuhisa Sato; Taisuke Boku

1998-01-01

430

Journal of Parallel and Distributed Computing 61, 401 426 (2001) Parallel Sequence Mining on  

E-print Network

elements must be sent from one processor to the other, utilizing the message passing programming paradigm. Although a shared memory architecture offers programming simplicity, the finite bandwidth of a com- mon busJournal of Parallel and Distributed Computing 61, 401 426 (2001) Parallel Sequence Mining on Shared

Zaki, Mohammed Javeed

431

Algorithms for Automatic Alignment of Arrays  

NASA Technical Reports Server (NTRS)

Aggregate data objects (such as arrays) are distributed across the processor memories when compiling a data-parallel language for a distributed-memory machine. The mapping determines the amount of communication needed to bring operands of parallel operations into alignment with each other. A common approach is to break the mapping into two stages: an alignment that maps all the objects to an abstract template, followed by a distribution that maps the template to the processors. This paper describes algorithms for solving the various facets of the alignment problem: axis and stride alignment, static and mobile offset alignment, and replication labeling. We show that optimal axis and stride alignment is NP-complete for general program graphs, and give a heuristic method that can explore the space of possible solutions in a number of ways. We show that some of these strategies can give better solutions than a simple greedy approach proposed earlier. We also show how local graph contractions can reduce the size of the problem significantly without changing the best solution. This allows more complex and effective heuristics to be used. We show how to model the static offset alignment problem using linear programming, and we show that loop-dependent mobile offset alignment is sometimes necessary for optimum performance. We describe an algorithm with for determining mobile alignments for objects within do loops. We also identify situations in which replicated alignment is either required by the program itself or can be used to improve performance. We describe an algorithm based on network flow that replicates objects so as to minimize the total amount of broadcast communication in replication.

Chatterjee, Siddhartha; Gilbert, John R.; Oliker, Leonid; Schreiber, Robert; Sheffler, Thomas J.

1996-01-01

432

Green Secure Processors: Towards Power-Efficient Secure Processor Design  

NASA Astrophysics Data System (ADS)

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.

Chhabra, Siddhartha; Solihin, Yan

433

Integrating firewire peripheral interface with an ethernet custom network processor  

Microsoft Academic Search

Bandwidth demands on ubiquitous Ethernet have grown immensely, driven by the rapid expansion of real-time applications like audio\\/video streaming. In a related research, the authors designed a novel high-performance custom network processor chip using field programmable gate arrays (FPGAs). The main function of this chip (named SPEED) is to bypass the operating system processing of network protocol stack at the

Omar S. Elkeelany; Ghulam Chaudhry

2007-01-01

434

Digital pulse processor for ion beam microprobe tomography  

NASA Astrophysics Data System (ADS)

A digital pulse processor, suitable for 3D imaging with focused ion beam techniques, has been developed using a high performance FPGA (field-programmable gate array) and a high-level FPGA design tool. The hardware and software implementation of several components, including analog signal conditioning, trapezoidal filtering, and median filtering, are presented. The system is applied to the 3D STIM (scanning transmission ion microscopy) tomography technique.

Bogovac, M.; Jakši?, M.; Wegrzynek, D.; Markowicz, A.

2009-09-01

435

Event Pre Processor for the CZT Detector on MIRAX  

NASA Astrophysics Data System (ADS)

We describe the Event Pre Processor (EPP) for the Hard X-ray Imager (HXI) on MIRAX. The EPP provides on board data reduction and event filtering for the HXI Cadmium Zinc Telluride strip detector. Emphasis is placed upon the EPP requirements, its implementation as VHDL design in a Field Programmable Gate Array (FPGA), and the description of a test environment for both the VHDL code and the FPGA hardware.

Kendziorra, Eckhard; Schanz, Thomas; Suchy, Slawomir; Distratis, Giuseppe

2006-06-01

436

Speech recognizer-based microphone array processing for robust hands-free speech recognition  

Microsoft Academic Search

We present a new array processing algorithm for microphone array speech recognition. Conventionally, the goal of array processing is to take distorted signals captured by the array and generate a cleaner output waveform. However, speech recognition systems operate on a set of features derived from the waveform, rather than the waveform itself. The goal of an array processor used in

Michael L. Seltzer; Bhiksha Raj; Richard M. Stern

2002-01-01

437

An Efficient Solution Method for Multibody Systems with Loops Using Multiple Processors  

NASA Technical Reports Server (NTRS)

This paper describes a multibody dynamics algorithm formulated for parallel implementation on multiprocessor computing platforms using the divide-and-conquer approach. The system of interest is a general topology of rigid and elastic articulated bodies with or without loops. The algorithm divides the multibody system into a number of smaller sets of bodies in chain or tree structures, called "branches" at convenient joints called "connection points", and uses an Order-N (O (N)) approach to formulate the dynamics of each branch in terms of the unknown spatial connection forces. The equations of motion for the branches, leaving the connection forces as unknowns, are implemented in separate processors in parallel for computational efficiency, and the equations for all the unknown connection forces are synthesized and solved in one or several processors. The performances of two implementations of this divide-and-conquer algorithm in multiple processors are compared with an existing method implemented on a single processor.

Ghosh, Tushar K.; Nguyen, Luong A.; Quiocho, Leslie J.

2015-01-01

438

Dynamically allocating processor resources between nearby and distant ILP  

Microsoft Academic Search

Modern superscalar processors use wide instruction issue widths and out-of-order execution in order to increase instruction-level parallelism (ILP). Because instructions must be committed in order so as to guarantee precise exceptions, increasing ILP implies increasing the sizes of structures such as the register file, issue queue, and reorder buffer. Simultaneously, cycle time constraints limit the sizes of these structures, resulting

Rajeev Balasubramonian; Sandhya Dwarkadas; David H. Albonesi

2001-01-01

439

PQEMU: A Parallel System Emulator Based on QEMU Jiun-Hung Ding  

E-print Network

PQEMU: A Parallel System Emulator Based on QEMU Jiun-Hung Ding MediaTek-NTHU Joint Lab Department@cs.nthu.edu.tw Abstract--A full system emulator, such as QEMU, can provide a versatile virtual platform for software-processor emulations to effectively utilize the underlying parallelism presented by today's multi-core processors

Chung, Yeh-Ching

440

Design of a massively parallel computer using bit serial processing elements  

NASA Technical Reports Server (NTRS)

A 1-bit serial processor designed for a parallel computer architecture is described. This processor is used to develop a massively parallel computational engine, with a single instruction-multiple data (SIMD) architecture. The computer is simulated and tested to verify its operation and to measure its performance for further development.

Aburdene, Maurice F.; Khouri, Kamal S.; Piatt, Jason E.; Zheng, Jianqing

1995-01-01

441

High-speed digital image processor with special-purpose hardware for two-dimensional convolution.  

PubMed

A high-speed digital image processor combined with a flying spot scanner and special purpose hardware for two-dimensional convolution has been developed. This processor is applicable to image enhancement, image restoration, and feature extraction. The parallel and/or serial architectures of the processor enable digital image processing for a high-resolution image at a fast process rate. Realization of the high-speed operation of two-dimensional convolution is greatly due to the special scanning method of a flying spot scanner. This processor is well suited to process the image recorded on the photographic film by means of the two-dimensional digital filtering technique. Some experimental results are given to verify the capability of the developed processor. PMID:18699361

Okuyama, H; Fukui, K; Ichioka, Y

1979-10-01

442

Reconfigurable multi-bit processor for DSP applications in statistical physics  

Microsoft Academic Search

A PC-AT hosted DSP processor architecture implemented in SRAM-based field programmable gate arrays (FPGA) and static memories is described. Despite its simplicity, the processor circuits can be reconfigured under software control to tackle a class of multi-bit `pixel' processing problems of current interest in the statistical physics of disordered materials, thereby offering some of the problem flexibility of a general

S. Monaghan; C. P. Cowen

1993-01-01

443

What is a configurable, extensible processor?  

Microsoft Academic Search

We're all familiar with fixed Instruction Set Architecture (ISA) processors, such as Intel or AMD x86-class processors, and in the embedded design area, ARM cores, for example. As well as these general purpose cores, there are also fixed ISA processors that are more specific to a particular problem domain, such as Digital Signal Processors (DSPs). But the product requirements of

Grant Martin

2008-01-01

444

Galactic Plane SETI Observations with the Allen Telescope Array  

NASA Astrophysics Data System (ADS)

In the spring of 2006, the Allen Telescope Array (ATA), a joint effort of the U.C. Berkeley Radio Astronomy Lab and the SETI Institute, will begin initial operations. Starting with 42 antennas out of a planned 350, the array will be equivalent to a single 40 meter dish. Using three phased beams, we will survey twenty square degrees around the galactic center for narrowband signals in the frequency range from 1410 to 1730 MHz (the "Water Hole"). Comparison of results from the beams will be used to eliminate signals from terrestrial and satellite sources. At these frequencies, the wide field of view of the array allows us to cover the 2 x 10 degree strip with five antenna positions. The field of view will track one of the five positions for up to five hours, while the phased beams are pointed within the field of view for 98 seconds per 20 MHz frequency band. During these SETI observations spanning approximately seven months, other radio astronomy observations of this very interesting region will run in parallel using two other independently tunable IF systems with a correlator and other phase array beams feeding other backend processors. Construction of the ATA is supported by private funding, primarily from the Paul G. Allen Foundation. The correlator for the ATA is supported by NSF Grant AST-0322309 to the UCB Radio Astronomy Lab.

Backus, P. R.; Tarter, J. C.; Davis, M. M.; Jordan, J. C.; Kilsdonk, T. N.; Shostak, G. S.; Ackerman, R.; DeBoer, D. R.; Dreher, J. W.; Harp, G. R.; Ross, J. E.; Stauduhar, R.

2005-12-01

445

A Real Time Superresolution Image Enhancement Processor  

NASA Astrophysics Data System (ADS)

An image processor is discussed that combines many types of image enhancement onto a single compact electronics card. The current enhancements include bad pixel compensation, focal plane array non-uniformity correction, and several stages of contrast enhancement, feature sharpening, superresolution, and image motion stabilization. Though there are certainly better algorithms for particular applications, this mixture of algorithms reliably enables the system to substantially improve image quality for a large variety of sensors, platforms, and imaging geometries. The card design hosted an FPGA and microprocessor facilitated rapid development by allowing many complicated algorithm elements to be quickly coded in C, with the FPGA providing horsepower for simpler but more computationally intensive elements. Examples show the quality improvement gained by compensating for image degradations including camera motion, atmospheric turbulence induced blur, focal plane imperfections, camera pixel density, and noise.

Gerwe, D.; Menicucci, P.

446

CCD and IR array controllers  

NASA Astrophysics Data System (ADS)

A family of controllers has bene developed that is powerful and flexible enough to operate a wide range of CCD and IR focal plane arrays in a variety of ground-based applications. These include fast readout of small CCD and IR arrays for adaptive optics applications, slow readout of large CCD and IR mosaics, and single CCD and IR array operation at low background/low noise regimes as well as high background/high speed regimes. The CCD and IR controllers have a common digital core based on user- programmable digital signal processors that are used to generate the array clocking and signal processing signals customized for each application. A fiber optic link passes image data and commands to VME or PCI interface boards resident in a host computer to the controller. CCD signal processing is done with a dual slope integrator operating at speeds of up to one Megapixel per second per channel. Signal processing of IR arrays is done either with a dual channel video processor or a four channel video processor that has built-in image memory and a coadder to 32-bit precision for operating high background arrays. Recent developments underway include the implementation of a fast fiber optic data link operating at a speed of 12.5 Megapixels per second for fast image transfer from the controller to the host computer, and supporting image acquisition software and device drivers for the PCI interface board for the Sun Solaris, Linux and Windows 2000 operating systems.

Leach, Robert W.; Low, Frank J.

2000-08-01

447

On the relationship between parallel computation and graph embedding  

SciTech Connect

The problem of efficiently simulating an algorithm designed for an n-processor parallel machine G on an m-processor parallel machine H with n > m arises when parallel algorithms designed for an ideal size machine are simulated on existing machines which are of a fixed size. The author studies this problem when every processor of H takes over the function of a number of processors in G, and he phrases the simulation problem as a graph embedding problem. New embeddings presented address relevant issues arising from the parallel computation environment. The main focus centers around embedding complete binary trees into smaller-sized binary trees, butterflies, and hypercubes. He also considers simultaneous embeddings of r source machines into a single hypercube. Constant factors play a crucial role in his embeddings since they are not only important in practice but also lead to interesting theoretical problems. All of his embeddings minimize dilation and load, which are the conventional cost measures in graph embeddings and determine the maximum amount of time required to simulate one step of G on H. His embeddings also optimize a new cost measure called ({alpha},{beta})-utilization which characterizes how evenly the processors of H are used by the processors of G. Ideally, the utilization should be balanced (i.e., every processor of H simulates at most (n/m) processors of G) and the ({alpha},{beta})-utilization measures how far off from a balanced utilization the embedding is. He presents embeddings for the situation when some processors of G have different capabilities (e.g. memory or I/O) than others and the processors with different capabilities are to be distributed uniformly among the processors of H. Placing such conditions on an embedding results in an increase in some of the cost measures.

Gupta, A.K.

1989-01-01

448

Parallel Pascal - An extended Pascal for parallel computers  

NASA Technical Reports Server (NTRS)

Parallel Pascal is an extended version of the conventional serial Pascal programming language which includes a convenient syntax for specifying array operations. It is upward compatible with standard Pascal and involves only a small number of carefully chosen new features. Parallel Pascal was developed to reduce the semantic gap between standard Pascal and a large range of highly parallel computers. Two important design goals of Parallel Pascal were efficiency and portability. Portability is particularly difficult to achieve since different parallel computers frequently have very different capabilities.

Reeves, A. P.

1984-01-01

449

Time-domain parallelization for geodynamic modeling  

NASA Astrophysics Data System (ADS)

Modern computational Geodynamics increasingly relies on parallel algorithms to speed up calculations. Currently, parallelization in Geodynamic codes is achieved via spatial decomposition, where the physical computational space (or its associated matrix system) is subdivided into domains that are attributed to one processor or to a set of processors. Such an approach that distributes the computational load is efficient as long as the size of the sub-domains is large enough so that the computational time remains larger than the communication time between processors. However, when the size of the sub-domains becomes too small, the parallel speed-up stagnates, which puts bounds on the maximum performances of the parallel calculations. This limitation can be overcome using a time-domain parallelization algorithm. This approach, named parareal, relies on the use of coarse sequential and fine parallel propagators to predict and to iteratively correct the solution over a given time interval. Although this method has been successfully used to solve parabolic and hyperbolic equations in various scientific areas, it has never been applied in geodynamic studies, where motions relevant to the Earth and other planetary mantles are that of a convective fluid at infinite Prandtl number. In that case, the time dependence of the mass and momentum equations is only implicit, due to thermal and/or viscous couplings with the explicitly time-dependent energy equation. This requires a number of modifications to the original algorithm. The performances of this adapted version of the parareal algorithm were investigated using theoretical model predictions in good agreement with numerical experiments. I show that under optimum conditions, the parallel speedup increases linearly with the number of processors, and speedups close to 25 were measured, using only few tens of CPUs. This parareal approach can be used alone or combined with any spatial parallel algorithm, allowing significant additional increase in speedup with increasing the number of processors.

Samuel, H.

2012-04-01

450

Solving very large, sparse linear systems on mesh-connected parallel computers  

NASA Technical Reports Server (NTRS)

The implementation of Pan and Reif's Parallel Nested Dissection (PND) algorithm on mesh connected parallel computers is described. This is the first known algorithm that allows very large, sparse linear systems of equations to be solved efficiently in polylog time using a small number of processors. How the processor bound of PND can be matched to the number of processors available on a given parallel computer by slowing down the algorithm by constant factors is described. Also, for the important class of problems where G(A) is a grid graph, a unique memory mapping that reduces the inter-processor communication requirements of PND to those that can be executed on mesh connected parallel machines is detailed. A description of an implementation on the Goodyear Massively Parallel Processor (MPP), located at Goddard is given. Also, a detailed discussion of data mappings and performance issues is given.

Opsahl, Torstein; Reif, John

1987-01-01

451

Adaptive parallel algorithms for integral knapsack problems  

NASA Astrophysics Data System (ADS)

In this paper, the design of a time-efficient and processor-efficient parallel algorithm for the integral knapsack problem is considered. A parallel integral knapsack algorithm is presented, which is adaptive to all parameters, especially to the maximum size of items. The parallel complexity of another important packing problem, the integral exactly-packing problem, is also considered. An optimal O(log n log m) time, parallel integral exactly-packing algorithm is given. Since the partition problem has a constant time, constant processor reduction to the exactly-packing problem, our parallel integral exactly-packing algorithm can be used for job scheduling, task partition, and many other important practical problems. Moreover, the methods and techniques used in this paper can be used for developing processor-efficient and time-efficient parallel algorithms for many other problems. Using the new parallel integral knapsack algorithm, the previously known parallel approximation schemes for the 0-1 knapsack problem and the binpacking problem, by E. W. Mayr and P. S. Gopalkrishnan, are improved upon significantly. Supported in part by the National Science Foundation through Grant DCR-8514961. Present address: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213.

Teng, Shang-Hua

1990-04-01

452

Fast 3-D prestack depth migration with a parallel PSPI algorithm. Final report  

SciTech Connect

There was the need for general expertise in the porting of serial seismic reflection code to a parallel processing environment. The project was a continuation of Task Order 38 involving the improvement of existing parallel models developed for that task and to provide support in parallelizing other similar seismic codes to a miasively parallel processor environment.

Roberts, P.M.; Alde, D.M.; House, L.S. [and others

1997-06-01

453

Parallel Processing of Broad-Band PPM Signals  

NASA Technical Reports Server (NTRS)

A parallel-processing algorithm and a hardware architecture to implement the algorithm have been devised for timeslot synchronization in the reception of pulse-position-modulated (PPM) optical or radio signals. As in the cases of some prior algorithms and architectures for parallel, discrete-time, digital processing of signals other than PPM, an incoming broadband signal is divided into multiple parallel narrower-band signals by means of sub-sampling and filtering. The number of parallel streams is chosen so that the frequency content of the narrower-band signals is low enough to enable processing by relatively-low speed complementary metal oxide semiconductor (CMOS) electronic circuitry. The algorithm and architecture are intended to satisfy requirements for time-varying time-slot synchronization and post-detection filtering, with correction of timing errors independent of estimation of timing errors. They are also intended to afford flexibility for dynamic reconfiguration and upgrading. The architecture is implemented in a reconfigurable CMOS processor in the form of a field-programmable gate array. The algorithm and its hardware implementation incorporate three separate time-varying filter banks for three distinct functions: correction of sub-sample timing errors, post-detection filtering, and post-detection estimation of timing errors. The design of the filter bank for correction of timing errors, the method of estimating timing errors, and the design of a feedback-loop filter are governed by a host of parameters, the most critical one, with regard to processing very broadband signals with CMOS hardware, being the number of parallel streams (equivalently, the rate-reduction parameter).

Gray, Andrew; Kang, Edward; Lay, Norman; Vilnrotter, Victor; Srinivasan, Meera; Lee, Clement

2010-01-01

454

Task and instruction scheduling in parallel multithreaded processors  

E-print Network

's are then analyzed in order to find out the critical path. As an illustration, see the shaded critical path in Figure 9 that is explained in detail in Section A. Similar to the round robin policies, our fetch policies are classified as CPF. z. yr 30 where CPF... stands for "Critical path based Prioritized Fetch. " z is the number of threads fetched per cycle, and y is the maximum number of instructions fetched from each thread per cycle. We evaluate the following schemes: CPF. 1. 8, CPF. 2. 4, and CPF. 4. 2...

Mishra, Amitabh

2012-06-07

455

Software orchestration of instruction level parallelism on tiled processor architectures  

E-print Network

Projection from silicon technology is that while transistor budget will continue to blossom according to Moore's law, latency from global wires will severely limit the ability to scale centralized structures at high ...

Lee, Walter (Walter Cheng-Wan)

2005-01-01

456

Parallelization of thermochemical nanolithography  

NASA Astrophysics Data System (ADS)

One of the most pressing technological challenges in the development of next generation nanoscale devices is the rapid, parallel, precise and robust fabrication of nanostructures. Here, we demonstrate the possibility to parallelize thermochemical nanolithography (TCNL) by employing five nano-tips for the fabrication of conjugated polymer nanostructures and graphene-based nanoribbons.One of the most pressing technological challenges in the development of next generation nanoscale devices is the rapid, parallel, precise and robust fabrication of nanostructures. Here, we demonstrate the possibility to parallelize thermochemical nanolithography (TCNL) by employing five nano-tips for the fabrication of conjugated polymer nanostructures and graphene-based nanoribbons. Electronic supplementary information (ESI) available: Details on the cantilevers array, on the sample preparation, and on the GO AFM experiments. See DOI: 10.1039/c3nr05696a

Carroll, Keith M.; Lu, Xi; Kim, Suenne; Gao, Yang; Kim, Hoe-Joon; Somnath, Suhas; Polloni, Laura; Sordan, Roman; King, William P.; Curtis, Jennifer E.; Riedo, Elisa

2014-01-01

457

Computing contingency statistics in parallel.  

SciTech Connect

Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel.We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.

Bennett, Janine Camille; Thompson, David; Pebay, Philippe Pierre

2010-09-01

458

Fast Parallel Computation of Longest Common Prefixes  

E-print Network

Fast Parallel Computation of Longest Common Prefixes Julian Shun Carnegie Mellon University Email arrays [12]) along with the corresponding longest common prefix array have appli- cations in many fields also require the longest common prefix (LCP) array, which stores the length of the longest common

459

Functional MRI Using Regularized Parallel Imaging Acquisition  

E-print Network

Functional MRI Using Regularized Parallel Imaging Acquisition Fa-Hsuan Lin,1* Teng-Yi Huang,1,2 Nan. Kwong1 Parallel MRI techniques reconstruct full-FOV images from un- dersampled k-space data by using the uncorrelated information from RF array coil elements. One disadvantage of parallel MRI is that the image signal

460

SIAM Conference on Parallel Processing for Scientific Computing - March 12-14, 2008  

SciTech Connect

The themes of the 2008 conference included, but were not limited to: Programming languages, models, and compilation techniques; The transition to ubiquitous multicore/manycore processors; Scientific computing on special-purpose processors (Cell, GPUs, etc.); Architecture-aware algorithms; From scalable algorithms to scalable software; Tools for software development and performance evaluation; Global perspectives on HPC; Parallel computing in industry; Distributed/grid computing; Fault tolerance; Parallel visualization and large scale data management; and The future of parallel architectures.

None

2008-09-08